You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by David Morávek <dm...@apache.org> on 2023/02/03 08:42:06 UTC

[DISCUSS] FLIP-291: Externalized Declarative Resource Management

Hi everyone,

This FLIP [1] introduces a new REST API for declaring resource requirements
for the Adaptive Scheduler. There seems to be a clear need for this API
based on the discussion on the "Reworking the Rescale API" [2] thread.

Before we get started, this work is heavily based on the prototype [3]
created by Till Rohrmann, and the FLIP is being published with his consent.
Big shoutout to him!

Last and not least, thanks to Chesnay and Roman for the initial reviews and
discussions.

The best start would be watching a short demo [4] that I've recorded, which
illustrates newly added capabilities (rescaling the running job, handing
back resources to the RM, and session cluster support).

The intuition behind the FLIP is being able to define resource requirements
("resource boundaries") externally that the AdaptiveScheduler can navigate
within. This is a building block for higher-level efforts such as an
external Autoscaler. The natural extension of this work would be to allow
to specify per-vertex ResourceProfiles.

Looking forward to your thoughts; any feedback is appreciated!

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
[2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
[3] https://github.com/tillrohrmann/flink/tree/autoscaling
[4] https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view

Best,
D.

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by John Roesler <vv...@apache.org>.

Thanks for the answer, David!

It sounds like there is a race condition, but it’s a known issue not specific to this FLIP, and the failure case isn’t too bad. I’m satisfied with that.

Thanks,
John

On Thu, Feb 23, 2023, at 10:39, David Morávek wrote:
> Hi Everyone,
>
> @John
>
> This is a problem that we've spent some time trying to crack; in the end,
> we've decided to go against doing any upgrades to JobGraphStore from
> JobMaster to avoid having multiple writers that are guarded by different
> leader election lock (Dispatcher and JobMaster might live in a different
> process). The contract we've decided to choose instead is leveraging the
> idempotency of the endpoint and having the user of the API retry in case
> we're unable to persist new requirements in the JobGraphStore [1]. We
> eventually need to move JobGraphStore out of the dispatcher, but that's way
> out of the scope of this FLIP. The solution is a deliberate trade-off. The
> worst scenario is that the Dispatcher fails over in between retries, which
> would simply rescale the job to meet the previous resource requirements
> (more extended unavailability of underlying HA storage would have worse
> consequences than this). Does that answer your question?
>
> @Matthias
>
> Good catch! I'm fixing it now, thanks!
>
> [1]
> https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9#diff-a4b690fb2c4975d25b05eb4161617af0d704a85ff7b1cad19d3c817c12f1e29cR1151
>
> Best,
> D.
>
> On Tue, Feb 21, 2023 at 12:24 AM John Roesler <vv...@apache.org> wrote:
>
>> Thanks for the FLIP, David!
>>
>> I just had one small question. IIUC, the REST API PUT request will go
>> through the new DispatcherGateway method to be handled. Then, after
>> validation, the dispatcher would call the new JobMasterGateway method to
>> actually update the job.
>>
>> Which component will write the updated JobGraph? I just wanted to make
>> sure it’s the JobMaster because it it were the dispatcher, there could be a
>> race condition with the async JobMaster method.
>>
>> Thanks!
>> -John
>>
>> On Mon, Feb 20, 2023, at 07:34, Matthias Pohl wrote:
>> > Thanks for your clarifications, David. I don't have any additional major
>> > points to add. One thing about the FLIP: The RPC layer API for updating
>> the
>> > JRR returns a future with a JRR? I don't see value in returning a JRR
>> here
>> > since it's an idempotent operation? Wouldn't it be enough to return
>> > CompletableFuture<Void> here? Or am I missing something?
>> >
>> > Matthias
>> >
>> > On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels <mx...@apache.org>
>> wrote:
>> >
>> >> Thanks David! If we could get the pre-allocation working as part of
>> >> the FLIP, that would be great.
>> >>
>> >> Concerning the downscale case, I agree this is a special case for the
>> >> (single-job) application mode where we could re-allocate slots in a
>> >> way that could leave entire task managers unoccupied which we would
>> >> then be able to release. The goal essentially is to reduce slot
>> >> fragmentation on scale down by packing the slots efficiently. The
>> >> easiest way to add this optimization when running in application mode
>> >> would be to drop as many task managers during the restart such that
>> >> NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look into
>> >> this independently of the FLIP.
>> >>
>> >> Feel free to start the vote.
>> >>
>> >> -Max
>> >>
>> >> On Mon, Feb 20, 2023 at 9:10 AM David Morávek <dm...@apache.org> wrote:
>> >> >
>> >> > Hi everyone,
>> >> >
>> >> > Thanks for the feedback! I've updated the FLIP to use idempotent PUT
>> API
>> >> instead of PATCH and to properly handle lower bound settings, to support
>> >> the "pre-allocation" of the resources.
>> >> >
>> >> > @Max
>> >> >
>> >> > > How hard would it be to address this issue in the FLIP?
>> >> >
>> >> > I've included this in the FLIP. It might not be too hard to implement
>> >> this in the end.
>> >> >
>> >> > > B) drop as many superfluous task managers as needed
>> >> >
>> >> > I've intentionally left this part out for now because this ultimately
>> >> needs to be the responsibility of the Resource Manager. After all, in
>> the
>> >> Session Cluster scenario, the Scheduler doesn't have the bigger picture
>> of
>> >> other tasks of other jobs running on those TMs. This will most likely
>> be a
>> >> topic for another FLIP.
>> >> >
>> >> > WDYT? If there are no other questions or concerns, I'd like to start
>> the
>> >> vote on Wednesday.
>> >> >
>> >> > Best,
>> >> > D.
>> >> >
>> >> > On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <mx...@apache.org>
>> >> wrote:
>> >> >>
>> >> >> I missed that the FLIP states:
>> >> >>
>> >> >> > Currently, even though we’d expose the lower bound for clarity and
>> >> API completeness, we won’t allow setting it to any other value than one
>> >> until we have full support throughout the stack.
>> >> >>
>> >> >> How hard would it be to address this issue in the FLIP?
>> >> >>
>> >> >> There is not much value to offer setting a lower bound which won't be
>> >> >> respected / throw an error when it is set. If we had support for a
>> >> >> lower bound, we could enforce a resource contract externally via
>> >> >> setting lowerBound == upperBound. That ties back to the Rescale API
>> >> >> discussion we had. I want to better understand what the major
>> concerns
>> >> >> would be around allowing this.
>> >> >>
>> >> >> Just to outline how I imagine the logic to work:
>> >> >>
>> >> >> A) The resource constraints are already met => Nothing changes
>> >> >> B) More resources available than required => Cancel the job, drop as
>> >> >> many superfluous task managers as needed, restart the job
>> >> >> C) Less resources available than required => Acquire new task
>> >> >> managers, wait for them to register, cancel and restart the job
>> >> >>
>> >> >> I'm open to helping out with the implementation.
>> >> >>
>> >> >> -Max
>> >> >>
>> >> >> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <mx...@apache.org>
>> >> wrote:
>> >> >> >
>> >> >> > Based on further discussion I had with Chesnay on this PR [1], I
>> think
>> >> >> > jobs would currently go into a restarting state after the resource
>> >> >> > requirements have changed. This wouldn't achieve what we had in
>> mind,
>> >> >> > i.e. sticking to the old resource requirements until enough slots
>> are
>> >> >> > available to fulfil the new resource requirements. So this may not
>> be
>> >> >> > 100% what we need but it could be extended to do what we want.
>> >> >> >
>> >> >> > -Max
>> >> >> >
>> >> >> > [1]
>> https://github.com/apache/flink/pull/21908#discussion_r1104792362
>> >> >> >
>> >> >> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <mxm@apache.org
>> >
>> >> wrote:
>> >> >> > >
>> >> >> > > Hi David,
>> >> >> > >
>> >> >> > > This is awesome! Great writeup and demo. This is pretty much
>> what we
>> >> >> > > need for the autoscaler as part of the Flink Kubernetes operator
>> >> [1].
>> >> >> > > Scaling Flink jobs effectively is hard but fortunately we have
>> >> solved
>> >> >> > > the issue as part of the Flink Kubernetes operator. The only
>> >> critical
>> >> >> > > piece we are missing is a better way to execute scaling
>> decisions,
>> >> as
>> >> >> > > discussed in [2].
>> >> >> > >
>> >> >> > > Looking at your proposal, we would set lowerBound == upperBound
>> for
>> >> >> > > the parallelism because we want to fully determine the
>> parallelism
>> >> >> > > externally based on the scaling metrics. Does that sound right?
>> >> >> > >
>> >> >> > > What is the timeline for these changes? Is there a JIRA?
>> >> >> > >
>> >> >> > > Cheers,
>> >> >> > > Max
>> >> >> > >
>> >> >> > > [1]
>> >>
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
>> >> >> > > [2]
>> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
>> >> >> > >
>> >> >> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <
>> xiangyu0xf@gmail.com>
>> >> wrote:
>> >> >> > > >
>> >> >> > > > Hi David,
>> >> >> > > >
>> >> >> > > > Thanks for your reply.  I think your response totally make
>> >> sense.  This
>> >> >> > > > flip targets on declaring required resource to ResourceManager
>> >> instead of
>> >> >> > > > using  ResourceManager to add/remove TMs directly.
>> >> >> > > >
>> >> >> > > > Best,
>> >> >> > > > Xiangyu
>> >> >> > > >
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > David Morávek <da...@gmail.com> 于2023年2月13日周一 15:46写道：
>> >> >> > > >
>> >> >> > > > > Hi everyone,
>> >> >> > > > >
>> >> >> > > > > @Shammon
>> >> >> > > > >
>> >> >> > > > > I'm not entirely sure what "config file" you're referring to.
>> >> You can, of
>> >> >> > > > > course, override the default parallelism in
>> "flink-conf.yaml",
>> >> but for
>> >> >> > > > > sinks and sources, the parallelism needs to be tweaked on the
>> >> connector
>> >> >> > > > > level ("WITH" statement).
>> >> >> > > > >
>> >> >> > > > > This is something that should be achieved with tooling around
>> >> Flink. We
>> >> >> > > > > want to provide an API on the lowest level that generalizes
>> >> well. Achieving
>> >> >> > > > > what you're describing should be straightforward with this
>> API.
>> >> >> > > > >
>> >> >> > > > > @Xiangyu
>> >> >> > > > >
>> >> >> > > > > Is it possible for this REST API to declare TM resources in
>> the
>> >> future?
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > > > Would you like to add/remove TMs if you use an active
>> Resource
>> >> Manager?
>> >> >> > > > > This would be out of the scope of this effort since it
>> targets
>> >> the
>> >> >> > > > > scheduler component only (we make no assumptions about the
>> used
>> >> Resource
>> >> >> > > > > Manager). Also, the AdaptiveScheduler is only intended to be
>> >> used for
>> >> >> > > > > Streaming.
>> >> >> > > > >
>> >> >> > > > >  And for streaming jobs, I'm wondering if there is any
>> >> situation we need to
>> >> >> > > > > > rescale the TM resources of a flink cluster at first and
>> then
>> >> the
>> >> >> > > > > adaptive
>> >> >> > > > > > scheduler will rescale the per-vertex ResourceProfiles
>> >> accordingly.
>> >> >> > > > > >
>> >> >> > > > >
>> >> >> > > > > We plan on adding support for the ResourceProfiles (dynamic
>> slot
>> >> >> > > > > allocation) as the next step. Again we won't make any
>> >> assumptions about the
>> >> >> > > > > used Resource Manager. In other words, this effort ends by
>> >> declaring
>> >> >> > > > > desired resources to the Resource Manager.
>> >> >> > > > >
>> >> >> > > > > Does that make sense?
>> >> >> > > > >
>> >> >> > > > > @Matthias
>> >> >> > > > >
>> >> >> > > > > We've done another pass on the proposed API and currently
>> lean
>> >> towards
>> >> >> > > > > having an idempotent PUT API.
>> >> >> > > > > - We don't care too much about multiple writers' scenarios in
>> >> terms of who
>> >> >> > > > > can write an authoritative payload; this is up to the user of
>> >> the API to
>> >> >> > > > > figure out
>> >> >> > > > > - It's indeed tricky to achieve atomicity with PATCH API;
>> >> switching to PUT
>> >> >> > > > > API seems to do the trick
>> >> >> > > > > - We won't allow partial "payloads" anymore, meaning you need
>> >> to define
>> >> >> > > > > requirements for all vertices in the JobGraph; This is
>> >> completely fine for
>> >> >> > > > > the programmatic workflows. For DEBUG / DEMO purposes, you
>> can
>> >> use the GET
>> >> >> > > > > endpoint and tweak the response to avoid writing the whole
>> >> payload by hand.
>> >> >> > > > >
>> >> >> > > > > WDYT?
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > > > Best,
>> >> >> > > > > D.
>> >> >> > > > >
>> >> >> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <
>> >> xiangyu0xf@gmail.com>
>> >> >> > > > > wrote:
>> >> >> > > > >
>> >> >> > > > > > Hi David,
>> >> >> > > > > >
>> >> >> > > > > > Thanks for creating this flip. I think this work it is very
>> >> useful,
>> >> >> > > > > > especially in autoscaling scenario.  I would like to share
>> >> some questions
>> >> >> > > > > > from my view.
>> >> >> > > > > >
>> >> >> > > > > > 1, Is it possible for this REST API to declare TM resources
>> >> in the
>> >> >> > > > > future?
>> >> >> > > > > > I'm asking because we are building the autoscaling feature
>> >> for Flink OLAP
>> >> >> > > > > > Session Cluster in ByteDance. We need to rescale the
>> >> cluster's resource
>> >> >> > > > > on
>> >> >> > > > > > TM level instead of Job level. It would be very helpful if
>> we
>> >> have a REST
>> >> >> > > > > > API for out external Autoscaling service to use.
>> >> >> > > > > >
>> >> >> > > > > > 2, And for streaming jobs, I'm wondering if there is any
>> >> situation we
>> >> >> > > > > need
>> >> >> > > > > > to rescale the TM resources of a flink cluster at first and
>> >> then the
>> >> >> > > > > > adaptive scheduler will rescale the per-vertex
>> >> ResourceProfiles
>> >> >> > > > > > accordingly.
>> >> >> > > > > >
>> >> >> > > > > > best.
>> >> >> > > > > > Xiangyu
>> >> >> > > > > >
>> >> >> > > > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
>> >> >> > > > > >
>> >> >> > > > > > > Hi David
>> >> >> > > > > > >
>> >> >> > > > > > > Thanks for your answer.
>> >> >> > > > > > >
>> >> >> > > > > > > > Can you elaborate more about how you'd intend to use
>> the
>> >> endpoint? I
>> >> >> > > > > > > think we can ultimately introduce a way of re-declaring
>> >> "per-vertex
>> >> >> > > > > > > defaults," but I'd like to understand the use case bit
>> more
>> >> first.
>> >> >> > > > > > >
>> >> >> > > > > > > For this issue, I mainly consider the consistency of user
>> >> configuration
>> >> >> > > > > > and
>> >> >> > > > > > > job runtime. For sql jobs, users usually set specific
>> >> parallelism for
>> >> >> > > > > > > source and sink, and set a global parallelism for other
>> >> operators.
>> >> >> > > > > These
>> >> >> > > > > > > config items are stored in a config file. For some
>> >> high-priority jobs,
>> >> >> > > > > > > users may want to manage them manually.
>> >> >> > > > > > > 1. When users need to scale the parallelism, they should
>> >> update the
>> >> >> > > > > > config
>> >> >> > > > > > > file and restart flink job, which may take a long time.
>> >> >> > > > > > > 2. After providing the REST API, users can just send a
>> >> request to the
>> >> >> > > > > job
>> >> >> > > > > > > via REST API quickly after updating the config file.
>> >> >> > > > > > > The configuration in the running job and config file
>> should
>> >> be the
>> >> >> > > > > same.
>> >> >> > > > > > > What do you think of this?
>> >> >> > > > > > >
>> >> >> > > > > > > best.
>> >> >> > > > > > > Shammon
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <
>> >> david.moravek@gmail.com>
>> >> >> > > > > > > wrote:
>> >> >> > > > > > >
>> >> >> > > > > > > > Hi everyone,
>> >> >> > > > > > > >
>> >> >> > > > > > > > Let's try to answer the questions one by one.
>> >> >> > > > > > > >
>> >> >> > > > > > > > *@ConradJam*
>> >> >> > > > > > > >
>> >> >> > > > > > > > when the number of "slots" is insufficient, can we can
>> >> stop users
>> >> >> > > > > > > rescaling
>> >> >> > > > > > > > > or throw something to tell user "less avaliable slots
>> >> to upgrade,
>> >> >> > > > > > > please
>> >> >> > > > > > > > > checkout your alivalbe slots" ?
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > The main property of AdaptiveScheduler is that it can
>> >> adapt to
>> >> >> > > > > > "available
>> >> >> > > > > > > > resources," which means you're still able to make
>> >> progress even
>> >> >> > > > > though
>> >> >> > > > > > > you
>> >> >> > > > > > > > didn't get all the slots you've asked for. Let's break
>> >> down the pros
>> >> >> > > > > > and
>> >> >> > > > > > > > cons of this property.
>> >> >> > > > > > > >
>> >> >> > > > > > > > - (plus) If you lose a TM for some reason, you can
>> still
>> >> recover even
>> >> >> > > > > > if
>> >> >> > > > > > > it
>> >> >> > > > > > > > doesn't come back. We still need to give it some time
>> to
>> >> eliminate
>> >> >> > > > > > > > unnecessary rescaling, which can be controlled by
>> setting
>> >> >> > > > > > > > "resource-stabilization-timeout."
>> >> >> > > > > > > > - (plus) The resources can arrive with a significant
>> >> delay. For
>> >> >> > > > > > example,
>> >> >> > > > > > > > you're unable to spawn enough TMs on time because
>> you've
>> >> run out of
>> >> >> > > > > > > > resources in your k8s cluster, and you need to wait for
>> >> the cluster
>> >> >> > > > > > auto
>> >> >> > > > > > > > scaler to kick in and add new nodes to the cluster. In
>> >> this scenario,
>> >> >> > > > > > > > you'll be able to start making progress faster, at the
>> >> cost of
>> >> >> > > > > multiple
>> >> >> > > > > > > > rescalings (once the remaining resources arrive).
>> >> >> > > > > > > > - (plus) This plays well with the declarative manner of
>> >> today's
>> >> >> > > > > > > > infrastructure. For example, you tell k8s that you need
>> >> 10 TMs, and
>> >> >> > > > > > > you'll
>> >> >> > > > > > > > eventually get them.
>> >> >> > > > > > > > - (minus) In the case of large state jobs, the cost of
>> >> multiple
>> >> >> > > > > > > rescalings
>> >> >> > > > > > > > might outweigh the above.
>> >> >> > > > > > > >
>> >> >> > > > > > > > We've already touched on the solution to this problem
>> on
>> >> the FLIP.
>> >> >> > > > > > Please
>> >> >> > > > > > > > notice the parallelism knob being a range with a lower
>> >> and upper
>> >> >> > > > > bound.
>> >> >> > > > > > > > Setting both the lower and upper bound to the same
>> value
>> >> could give
>> >> >> > > > > the
>> >> >> > > > > > > > behavior you're describing at the cost of giving up
>> some
>> >> properties
>> >> >> > > > > > that
>> >> >> > > > > > > AS
>> >> >> > > > > > > > gives you (you'd be falling back to the
>> >> DefaultScheduler's behavior).
>> >> >> > > > > > > >
>> >> >> > > > > > > > when user upgrade job-vertx-parallelism . I want to
>> have
>> >> an interface
>> >> >> > > > > > to
>> >> >> > > > > > > > > query the current update parallel execution status,
>> so
>> >> that the
>> >> >> > > > > user
>> >> >> > > > > > or
>> >> >> > > > > > > > > program can understand the current status
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > This is a misunderstanding. We're not introducing the
>> >> RESCALE
>> >> >> > > > > endpoint.
>> >> >> > > > > > > > This endpoint allows you to re-declare the resources
>> >> needed to run
>> >> >> > > > > the
>> >> >> > > > > > > job.
>> >> >> > > > > > > > Once you reach the desired resources (you get more
>> >> resources than the
>> >> >> > > > > > > lower
>> >> >> > > > > > > > bound defines), your job will run.
>> >> >> > > > > > > >
>> >> >> > > > > > > > We can expose a similar endpoint to "resource
>> >> requirements" to give
>> >> >> > > > > you
>> >> >> > > > > > > an
>> >> >> > > > > > > > overview of the resources the vertices already have.
>> You
>> >> can already
>> >> >> > > > > > get
>> >> >> > > > > > > > this from the REST API, so exposing this in yet another
>> >> way should be
>> >> >> > > > > > > > considered carefully.
>> >> >> > > > > > > >
>> >> >> > > > > > > > *@Matthias*
>> >> >> > > > > > > >
>> >> >> > > > > > > > I'm wondering whether it makes sense to add some kind
>> of
>> >> resource ID
>> >> >> > > > > to
>> >> >> > > > > > > the
>> >> >> > > > > > > > > REST API.
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > That's a good question. I want to think about that and
>> >> get back to
>> >> >> > > > > the
>> >> >> > > > > > > > question later. My main struggle when thinking about
>> this
>> >> is, "if
>> >> >> > > > > this
>> >> >> > > > > > > > would be an idempotent POST endpoint," would it be any
>> >> different?
>> >> >> > > > > > > >
>> >> >> > > > > > > > How often do we allow resource requirements to be
>> changed?
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > There shall be no rate limiting on the FLINK side. If
>> >> this is
>> >> >> > > > > something
>> >> >> > > > > > > > your environment needs, you can achieve it on a
>> different
>> >> layer ("we
>> >> >> > > > > > > can't
>> >> >> > > > > > > > have FLINK to do everything").
>> >> >> > > > > > > >
>> >> >> > > > > > > > Versioning the JobGraph in the JobGraphStore rather
>> than
>> >> overwriting
>> >> >> > > > > it
>> >> >> > > > > > > > > might be an idea.
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > This sounds interesting since it would be closer to the
>> >> JobGraph
>> >> >> > > > > being
>> >> >> > > > > > > > immutable. The main problem I see here is that this
>> would
>> >> introduce a
>> >> >> > > > > > > > BW-incompatible change so it might be a topic for
>> >> follow-up FLIP.
>> >> >> > > > > > > >
>> >> >> > > > > > > > I'm just wondering whether we bundle two things
>> together
>> >> that are
>> >> >> > > > > > > actually
>> >> >> > > > > > > > > separate
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > Yup, this is how we think about it as well. The main
>> >> question is,
>> >> >> > > > > "who
>> >> >> > > > > > > > should be responsible for bookkeeping 1) the JobGraph
>> and
>> >> 2) the
>> >> >> > > > > > > > JobResourceRequirements". The JobMaster would be the
>> >> right place for
>> >> >> > > > > > > both,
>> >> >> > > > > > > > but it's currently not the case, and we're tightly
>> >> coupling the
>> >> >> > > > > > > dispatcher
>> >> >> > > > > > > > with the JobMaster.
>> >> >> > > > > > > >
>> >> >> > > > > > > > Initially, we tried to introduce a separate HA
>> component
>> >> in JobMaster
>> >> >> > > > > > for
>> >> >> > > > > > > > bookkeeping the JobResourceRequirements, but that
>> proved
>> >> to be a more
>> >> >> > > > > > > > significant effort adding additional mess to the
>> already
>> >> messy HA
>> >> >> > > > > > > > ecosystem. Another approach we've discussed was
>> mutating
>> >> the JobGraph
>> >> >> > > > > > and
>> >> >> > > > > > > > setting JRR into the JobGraph structure itself.
>> >> >> > > > > > > >
>> >> >> > > > > > > > The middle ground for keeping this effort reasonably
>> >> sized and not
>> >> >> > > > > > > > violating "we want to keep JG immutable" too much is
>> >> keeping the
>> >> >> > > > > > > > JobResourceRequirements separate as an internal config
>> >> option in
>> >> >> > > > > > > JobGraph's
>> >> >> > > > > > > > configuration.
>> >> >> > > > > > > >
>> >> >> > > > > > > > We ultimately need to rethink the tight coupling of
>> >> Dispatcher and
>> >> >> > > > > > > > JobMaster, but it needs to be a separate effort.
>> >> >> > > > > > > >
>> >> >> > > > > > > > ...also considering the amount of data that can be
>> stored
>> >> in a
>> >> >> > > > > > > > > ConfigMap/ZooKeeper node if versioning the resource
>> >> requirement
>> >> >> > > > > > change
>> >> >> > > > > > > as
>> >> >> > > > > > > > > proposed in my previous item is an option for us.
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > AFAIK we're only storing pointers to the S3 objects in
>> HA
>> >> metadata,
>> >> >> > > > > so
>> >> >> > > > > > we
>> >> >> > > > > > > > should be okay with having larger structures for now.
>> >> >> > > > > > > >
>> >> >> > > > > > > > Updating the JobGraphStore means adding more requests
>> to
>> >> the HA
>> >> >> > > > > backend
>> >> >> > > > > > > > API.
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > It's fine unless you intend to override the resource
>> >> requirements a
>> >> >> > > > > few
>> >> >> > > > > > > > times per second.
>> >> >> > > > > > > >
>> >> >> > > > > > > > *@Shammon*
>> >> >> > > > > > > >
>> >> >> > > > > > > > How about adding some more information such as vertex
>> type
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > Since it was intended as a "debug" endpoint, it makes
>> >> complete sense!
>> >> >> > > > > > > >
>> >> >> > > > > > > >  For sql jobs, we always use a unified parallelism for
>> >> most vertices.
>> >> >> > > > > > Can
>> >> >> > > > > > > > > we provide them with a more convenient setting method
>> >> instead of
>> >> >> > > > > each
>> >> >> > > > > > > > one?
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > I completely feel with this. The main thoughts when
>> >> designing the API
>> >> >> > > > > > > were:
>> >> >> > > > > > > > - We want to keep it clean and easy to understand.
>> >> >> > > > > > > > - Global parallelism can be modeled using per-vertex
>> >> parallelism but
>> >> >> > > > > > not
>> >> >> > > > > > > > the other way around.
>> >> >> > > > > > > > - The API will be used by external tooling (operator,
>> >> auto scaler).
>> >> >> > > > > > > >
>> >> >> > > > > > > > Can you elaborate more about how you'd intend to use
>> the
>> >> endpoint? I
>> >> >> > > > > > > think
>> >> >> > > > > > > > we can ultimately introduce a way of re-declaring
>> >> "per-vertex
>> >> >> > > > > > defaults,"
>> >> >> > > > > > > > but I'd like to understand the use case bit more first.
>> >> >> > > > > > > >
>> >> >> > > > > > > > *@Weijie*
>> >> >> > > > > > > >
>> >> >> > > > > > > > What is the default value here (based on what
>> >> configuration), or just
>> >> >> > > > > > > > > infinite?
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > Currently, for the lower bound, it's always one, and
>> for
>> >> the upper
>> >> >> > > > > > bound,
>> >> >> > > > > > > > it's either parallelism (if defined) or the
>> >> maxParallelism of the
>> >> >> > > > > > vertex
>> >> >> > > > > > > in
>> >> >> > > > > > > > JobGraph. This question might be another signal for
>> >> making the
>> >> >> > > > > defaults
>> >> >> > > > > > > > explicit (see the answer to Shammon's question above).
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > Thanks, everyone, for your initial thoughts!
>> >> >> > > > > > > >
>> >> >> > > > > > > > Best,
>> >> >> > > > > > > > D.
>> >> >> > > > > > > >
>> >> >> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <
>> >> guoweijiereswqa@gmail.com
>> >> >> > > > > >
>> >> >> > > > > > > > wrote:
>> >> >> > > > > > > >
>> >> >> > > > > > > > > Thanks David for driving this. This is a very
>> valuable
>> >> work,
>> >> >> > > > > > especially
>> >> >> > > > > > > > for
>> >> >> > > > > > > > > cloud native environment.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > >> How about adding some more information such as
>> >> vertex type
>> >> >> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get
>> jobs
>> >> >> > > > > > > > > resource-requirements`? For users, only vertex-id may
>> >> be difficult
>> >> >> > > > > to
>> >> >> > > > > > > > > understand.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > +1 for this suggestion, including jobvertex's name in
>> >> the response
>> >> >> > > > > > body
>> >> >> > > > > > > > is
>> >> >> > > > > > > > > more
>> >> >> > > > > > > > > user-friendly.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > I saw this sentence in FLIP: "Setting the upper bound
>> >> to -1 will
>> >> >> > > > > > reset
>> >> >> > > > > > > > the
>> >> >> > > > > > > > > value to the default setting."  What is the default
>> >> value here
>> >> >> > > > > (based
>> >> >> > > > > > > on
>> >> >> > > > > > > > > what configuration), or just infinite?
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Best regards,
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Weijie
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > > Hi David
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > Thanks for initiating this discussion. I think
>> >> declaring job
>> >> >> > > > > > resource
>> >> >> > > > > > > > > > requirements by REST API is very valuable. I just
>> >> left some
>> >> >> > > > > > comments
>> >> >> > > > > > > as
>> >> >> > > > > > > > > > followed
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > 1) How about adding some more information such as
>> >> vertex type
>> >> >> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get
>> >> jobs
>> >> >> > > > > > > > > > resource-requirements`? For users, only vertex-id
>> may
>> >> be
>> >> >> > > > > difficult
>> >> >> > > > > > to
>> >> >> > > > > > > > > > understand.
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > 2) For sql jobs, we always use a unified
>> parallelism
>> >> for most
>> >> >> > > > > > > vertices.
>> >> >> > > > > > > > > Can
>> >> >> > > > > > > > > > we provide them with a more convenient setting
>> method
>> >> instead of
>> >> >> > > > > > each
>> >> >> > > > > > > > > one?
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > Best,
>> >> >> > > > > > > > > > Shammon
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
>> >> >> > > > > > matthias.pohl@aiven.io
>> >> >> > > > > > > > > > .invalid>
>> >> >> > > > > > > > > > wrote:
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > > Thanks David for creating this FLIP. It sounds
>> >> promising and
>> >> >> > > > > > useful
>> >> >> > > > > > > > to
>> >> >> > > > > > > > > > > have. Here are some thoughts from my side (some
>> of
>> >> them might
>> >> >> > > > > be
>> >> >> > > > > > > > > rather a
>> >> >> > > > > > > > > > > follow-up and not necessarily part of this FLIP):
>> >> >> > > > > > > > > > > - I'm wondering whether it makes sense to add
>> some
>> >> kind of
>> >> >> > > > > > resource
>> >> >> > > > > > > > ID
>> >> >> > > > > > > > > to
>> >> >> > > > > > > > > > > the REST API. This would give Flink a tool to
>> >> verify the PATCH
>> >> >> > > > > > > > request
>> >> >> > > > > > > > > of
>> >> >> > > > > > > > > > > the external system in a compare-and-set kind of
>> >> manner. AFAIU,
>> >> >> > > > > > the
>> >> >> > > > > > > > > > process
>> >> >> > > > > > > > > > > requires the external system to retrieve the
>> >> resource
>> >> >> > > > > > requirements
>> >> >> > > > > > > > > first
>> >> >> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID <ABC>
>> >> would be sent
>> >> >> > > > > > > along
>> >> >> > > > > > > > > as
>> >> >> > > > > > > > > > a
>> >> >> > > > > > > > > > > unique identifier for the provided setup. It's
>> >> essentially the
>> >> >> > > > > > > > version
>> >> >> > > > > > > > > ID
>> >> >> > > > > > > > > > > of the currently deployed resource requirement
>> >> configuration.
>> >> >> > > > > > Flink
>> >> >> > > > > > > > > > doesn't
>> >> >> > > > > > > > > > > know whether the external system would use the
>> >> provided
>> >> >> > > > > > information
>> >> >> > > > > > > > in
>> >> >> > > > > > > > > > some
>> >> >> > > > > > > > > > > way to derive a new set of resource requirements
>> >> for this job.
>> >> >> > > > > > The
>> >> >> > > > > > > > > > > subsequent PATCH request with updated resource
>> >> requirements
>> >> >> > > > > would
>> >> >> > > > > > > > > include
>> >> >> > > > > > > > > > > the previously retrieved resource ID <ABC>. The
>> >> PATCH call
>> >> >> > > > > would
>> >> >> > > > > > > fail
>> >> >> > > > > > > > > if
>> >> >> > > > > > > > > > > there was a concurrent PATCH call in between
>> >> indicating to the
>> >> >> > > > > > > > external
>> >> >> > > > > > > > > > > system that the resource requirements were
>> >> concurrently
>> >> >> > > > > updated.
>> >> >> > > > > > > > > > > - How often do we allow resource requirements to
>> be
>> >> changed?
>> >> >> > > > > That
>> >> >> > > > > > > > > > question
>> >> >> > > > > > > > > > > might make my previous comment on the resource ID
>> >> obsolete
>> >> >> > > > > > because
>> >> >> > > > > > > we
>> >> >> > > > > > > > > > could
>> >> >> > > > > > > > > > > just make any PATCH call fail if there was a
>> >> resource
>> >> >> > > > > requirement
>> >> >> > > > > > > > > update
>> >> >> > > > > > > > > > > within a certain time frame before the request.
>> But
>> >> such a time
>> >> >> > > > > > > > period
>> >> >> > > > > > > > > is
>> >> >> > > > > > > > > > > something we might want to make configurable
>> then,
>> >> I guess.
>> >> >> > > > > > > > > > > - Versioning the JobGraph in the JobGraphStore
>> >> rather than
>> >> >> > > > > > > > overwriting
>> >> >> > > > > > > > > it
>> >> >> > > > > > > > > > > might be an idea. This would enable us to provide
>> >> resource
>> >> >> > > > > > > > requirement
>> >> >> > > > > > > > > > > changes in the UI or through the REST API. It is
>> >> related to a
>> >> >> > > > > > > problem
>> >> >> > > > > > > > > > > around keeping track of the exception history
>> >> within the
>> >> >> > > > > > > > > > AdaptiveScheduler
>> >> >> > > > > > > > > > > and also having to consider multiple versions of
>> a
>> >> JobGraph.
>> >> >> > > > > But
>> >> >> > > > > > > for
>> >> >> > > > > > > > > that
>> >> >> > > > > > > > > > > one, we use the ExecutionGraphInfoStore right
>> now.
>> >> >> > > > > > > > > > > - Updating the JobGraph in the JobGraphStore
>> makes
>> >> sense. I'm
>> >> >> > > > > > just
>> >> >> > > > > > > > > > > wondering whether we bundle two things together
>> >> that are
>> >> >> > > > > actually
>> >> >> > > > > > > > > > separate:
>> >> >> > > > > > > > > > > The business logic and the execution
>> configuration
>> >> (the
>> >> >> > > > > resource
>> >> >> > > > > > > > > > > requirements). I'm aware that this is not a flaw
>> of
>> >> the current
>> >> >> > > > > > > FLIP
>> >> >> > > > > > > > > but
>> >> >> > > > > > > > > > > rather something that was not necessary to
>> address
>> >> in the past
>> >> >> > > > > > > > because
>> >> >> > > > > > > > > > the
>> >> >> > > > > > > > > > > JobGraph was kind of static. I don't remember
>> >> whether that was
>> >> >> > > > > > > > already
>> >> >> > > > > > > > > > > discussed while working on the AdaptiveScheduler
>> >> for FLIP-160
>> >> >> > > > > > [1].
>> >> >> > > > > > > > > Maybe,
>> >> >> > > > > > > > > > > I'm missing some functionality here that requires
>> >> us to have
>> >> >> > > > > > > > everything
>> >> >> > > > > > > > > > in
>> >> >> > > > > > > > > > > one place. But it feels like updating the entire
>> >> JobGraph which
>> >> >> > > > > > > could
>> >> >> > > > > > > > > be
>> >> >> > > > > > > > > > > actually a "config change" is not reasonable.
>> >> ...also
>> >> >> > > > > considering
>> >> >> > > > > > > the
>> >> >> > > > > > > > > > > amount of data that can be stored in a
>> >> ConfigMap/ZooKeeper node
>> >> >> > > > > > if
>> >> >> > > > > > > > > > > versioning the resource requirement change as
>> >> proposed in my
>> >> >> > > > > > > previous
>> >> >> > > > > > > > > > item
>> >> >> > > > > > > > > > > is an option for us.
>> >> >> > > > > > > > > > > - Updating the JobGraphStore means adding more
>> >> requests to the
>> >> >> > > > > HA
>> >> >> > > > > > > > > backend
>> >> >> > > > > > > > > > > API. There were some concerns shared in the
>> >> discussion thread
>> >> >> > > > > [2]
>> >> >> > > > > > > for
>> >> >> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API server in
>> >> the past with
>> >> >> > > > > > too
>> >> >> > > > > > > > many
>> >> >> > > > > > > > > > > calls. Eventhough, it's more likely to be caused
>> by
>> >> >> > > > > > checkpointing,
>> >> >> > > > > > > I
>> >> >> > > > > > > > > > still
>> >> >> > > > > > > > > > > wanted to bring it up. We're working on a
>> >> standardized
>> >> >> > > > > > performance
>> >> >> > > > > > > > test
>> >> >> > > > > > > > > > to
>> >> >> > > > > > > > > > > prepare going forward with FLIP-270 [3] right
>> now.
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > Best,
>> >> >> > > > > > > > > > > Matthias
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > [1]
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
>> >> >> > > > > > > > > > > [2]
>> >> >> > > > > > >
>> >> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
>> >> >> > > > > > > > > > > [3]
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <
>> >> jam.gzczy@gmail.com
>> >> >> > > > > >
>> >> >> > > > > > > > wrote:
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > > Hi David:
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > Thank you for drive this flip, which helps less
>> >> flink
>> >> >> > > > > shutdown
>> >> >> > > > > > > time
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > for this flip, I would like to make a few idea
>> on
>> >> share
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >    - when the number of "slots" is
>> insufficient,
>> >> can we can
>> >> >> > > > > > stop
>> >> >> > > > > > > > > users
>> >> >> > > > > > > > > > > >    rescaling or throw something to tell user
>> >> "less avaliable
>> >> >> > > > > > > slots
>> >> >> > > > > > > > to
>> >> >> > > > > > > > > > > > upgrade,
>> >> >> > > > > > > > > > > >    please checkout your alivalbe slots" ? Or we
>> >> could have a
>> >> >> > > > > > > > request
>> >> >> > > > > > > > > > > >    switch(true/false) to allow this behavior
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >    - when user upgrade job-vertx-parallelism .
>> I
>> >> want to have
>> >> >> > > > > > an
>> >> >> > > > > > > > > > > interface
>> >> >> > > > > > > > > > > >    to query the current update parallel
>> execution
>> >> status, so
>> >> >> > > > > > that
>> >> >> > > > > > > > the
>> >> >> > > > > > > > > > > user
>> >> >> > > > > > > > > > > > or
>> >> >> > > > > > > > > > > >    program can understand the current status
>> >> >> > > > > > > > > > > >    - I want to have an interface to query the
>> >> current update
>> >> >> > > > > > > > > > parallelism
>> >> >> > > > > > > > > > > >    execution status. This also helps similar to
>> >> *[1] Flink
>> >> >> > > > > K8S
>> >> >> > > > > > > > > > Operator*
>> >> >> > > > > > > > > > > >    management
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > {
>> >> >> > > > > > > > > > > >   status: Failed
>> >> >> > > > > > > > > > > >   reason: "less avaliable slots to upgrade,
>> >> please checkout
>> >> >> > > > > > your
>> >> >> > > > > > > > > > alivalbe
>> >> >> > > > > > > > > > > > slots"
>> >> >> > > > > > > > > > > > }
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >    - *Pending*: this job now is join the
>> upgrade
>> >> queue,it
>> >> >> > > > > will
>> >> >> > > > > > be
>> >> >> > > > > > > > > > update
>> >> >> > > > > > > > > > > >    later
>> >> >> > > > > > > > > > > >    - *Rescaling*: job now is rescaling,wait it
>> >> finish
>> >> >> > > > > > > > > > > >    - *Finished*: finish do it
>> >> >> > > > > > > > > > > >    - *Failed* : something have wrong,so this
>> job
>> >> is not
>> >> >> > > > > > alivable
>> >> >> > > > > > > > > > upgrade
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > I want to supplement my above content in flip,
>> >> what do you
>> >> >> > > > > > think
>> >> >> > > > > > > ?
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >    1.
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > >
>> >> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五
>> >> 16:42写道：
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Hi everyone,
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > This FLIP [1] introduces a new REST API for
>> >> declaring
>> >> >> > > > > > resource
>> >> >> > > > > > > > > > > > requirements
>> >> >> > > > > > > > > > > > > for the Adaptive Scheduler. There seems to
>> be a
>> >> clear need
>> >> >> > > > > > for
>> >> >> > > > > > > > this
>> >> >> > > > > > > > > > API
>> >> >> > > > > > > > > > > > > based on the discussion on the "Reworking the
>> >> Rescale API"
>> >> >> > > > > > [2]
>> >> >> > > > > > > > > > thread.
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Before we get started, this work is heavily
>> >> based on the
>> >> >> > > > > > > > prototype
>> >> >> > > > > > > > > > [3]
>> >> >> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP is
>> being
>> >> published
>> >> >> > > > > > with
>> >> >> > > > > > > > his
>> >> >> > > > > > > > > > > > consent.
>> >> >> > > > > > > > > > > > > Big shoutout to him!
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Last and not least, thanks to Chesnay and
>> Roman
>> >> for the
>> >> >> > > > > > initial
>> >> >> > > > > > > > > > reviews
>> >> >> > > > > > > > > > > > and
>> >> >> > > > > > > > > > > > > discussions.
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > The best start would be watching a short demo
>> >> [4] that I've
>> >> >> > > > > > > > > recorded,
>> >> >> > > > > > > > > > > > which
>> >> >> > > > > > > > > > > > > illustrates newly added capabilities
>> (rescaling
>> >> the running
>> >> >> > > > > > > job,
>> >> >> > > > > > > > > > > handing
>> >> >> > > > > > > > > > > > > back resources to the RM, and session cluster
>> >> support).
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > The intuition behind the FLIP is being able
>> to
>> >> define
>> >> >> > > > > > resource
>> >> >> > > > > > > > > > > > requirements
>> >> >> > > > > > > > > > > > > ("resource boundaries") externally that the
>> >> >> > > > > AdaptiveScheduler
>> >> >> > > > > > > can
>> >> >> > > > > > > > > > > > navigate
>> >> >> > > > > > > > > > > > > within. This is a building block for
>> >> higher-level efforts
>> >> >> > > > > > such
>> >> >> > > > > > > as
>> >> >> > > > > > > > > an
>> >> >> > > > > > > > > > > > > external Autoscaler. The natural extension of
>> >> this work
>> >> >> > > > > would
>> >> >> > > > > > > be
>> >> >> > > > > > > > to
>> >> >> > > > > > > > > > > allow
>> >> >> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Looking forward to your thoughts; any
>> feedback
>> >> is
>> >> >> > > > > > appreciated!
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > [1]
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
>> >> >> > > > > > > > > > > > > [2]
>> >> >> > > > > > > > >
>> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
>> >> >> > > > > > > > > > > > > [3]
>> >> https://github.com/tillrohrmann/flink/tree/autoscaling
>> >> >> > > > > > > > > > > > > [4]
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > >
>> >> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Best,
>> >> >> > > > > > > > > > > > > D.
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > --
>> >> >> > > > > > > > > > > > Best
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > ConradJam
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >>
>>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by Maximilian Michels <mx...@apache.org>.

I agree that it is useful to have a configurable lower bound. Thanks
for looking into it as part of a follow up!

No objections from my side to move forward with the vote.

-Max

On Tue, Feb 28, 2023 at 1:36 PM David Morávek <dm...@apache.org> wrote:
>
> > I suppose we could further remove the min because it would always be
> safer to scale down if resources are not available than not to run at
> all [1].
>
> Apart from what @Roman has already mentioned, there are still cases where
> we're certain that there is no point in running the jobs with resources
> lower than X; e.g., because the state is too large to be processed with
> parallelism of 1; this allows you not to waste resources if you're certain
> that the job would go into the restart loop / won't be able to checkpoint
>
> I believe that for most use cases, simply keeping the lower bound at 1 will
> be sufficient.
>
> > I saw that the minimum bound is currently not used in the code you posted
> above [2]. Is that still planned?
>
> Yes. We already allow setting the lower bound via API, but it's not
> considered by the scheduler. I'll address this limitation in a separate
> issue.
>
> > Note that originally we had assumed min == max but I think that would be
> a less safe scaling approach because we would get stuck waiting for
> resources when they are not available, e.g. k8s resource limits reached.
>
> 100% agreed; The above-mentioned knobs should allow you to balance the
> trade-off.
>
>
> Does that make sense?
>
> Best,
> D.
>
>
>
> On Tue, Feb 28, 2023 at 1:14 PM Roman Khachatryan <ro...@apache.org> wrote:
>
> > Hi,
> >
> > Thanks for the update, I think distinguishing the rescaling behaviour and
> > the desired parallelism declaration is important.
> >
> > Having the ability to specify min parallelism might be useful in
> > environments with multiple jobs: Scheduler will then have an option to stop
> > the less suitable job.
> > In other setups, where the job should not be stopped at all, the user can
> > always set it to 0.
> >
> > Regards,
> > Roman
> >
> >
> > On Tue, Feb 28, 2023 at 12:58 PM Maximilian Michels <mx...@apache.org>
> > wrote:
> >
> >> Hi David,
> >>
> >> Thanks for the update! We consider using the new declarative resource
> >> API for autoscaling. Currently, we treat a scaling decision as a new
> >> deployment which means surrendering all resources to Kubernetes and
> >> subsequently reallocating them for the rescaled deployment. The
> >> declarative resource management API is a great step forward because it
> >> allows us to do faster and safer rescaling. Faster, because we can
> >> continue to run while resources are pre-allocated which minimizes
> >> downtime. Safer, because we can't get stuck when the desired resources
> >> are not available.
> >>
> >> An example with two vertices and their respective parallelisms:
> >>   v1: 50
> >>   v2: 10
> >> Let's assume slot sharing is disabled, so we need 60 task slots to run
> >> the vertices.
> >>
> >> If the autoscaler was to decide to scale up v1 and v2, it could do so
> >> in a safe way by using min/max configuration:
> >>   v1: [min: 50, max: 70]
> >>   v2: [min: 10, max: 20]
> >> This would then need 90 task slots to run at max capacity.
> >>
> >> I suppose we could further remove the min because it would always be
> >> safer to scale down if resources are not available than to not run at
> >> all [1]. In fact, I saw that the minimum bound is currently not used
> >> in the code you posted above [2]. Is that still planned?
> >>
> >> -Max
> >>
> >> PS: Note that originally we had assumed min == max but I think that
> >> would be a less safe scaling approach because we would get stuck
> >> waiting for resources when they are not available, e.g. k8s resource
> >> limits reached.
> >>
> >> [1] However, there might be costs involved with executing the
> >> rescaling, e.g. for using external storage like s3, especially without
> >> local recovery.
> >> [2]
> >> https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9
> >>
> >> On Tue, Feb 28, 2023 at 9:33 AM David Morávek <dm...@apache.org> wrote:
> >> >
> >> > Hi Everyone,
> >> >
> >> > We had some more talks about the pre-allocation of resources with @Max,
> >> and
> >> > here is the final state that we've converged to for now:
> >> >
> >> > The vital thing to note about the new API is that it's declarative,
> >> meaning
> >> > we're declaring the desired state to which we want our job to converge;
> >> If,
> >> > after the requirements update job no longer holds the desired resources
> >> > (fewer resources than the lower bound), it will be canceled and
> >> transition
> >> > back into the waiting for resources state.
> >> >
> >> > In some use cases, you might always want to rescale to the upper bound
> >> > (this goes along the lines of "preallocating resources" and minimizing
> >> the
> >> > number of rescales, which is especially useful with the large state).
> >> This
> >> > can be controlled by two knobs that already exist:
> >> >
> >> > 1) "jobmanager.adaptive-scheduler.min-parallelism-increase" - this
> >> affects
> >> > a minimal parallelism increase step of a running job; we'll slightly
> >> change
> >> > the semantics, and we'll trigger rescaling either once this condition is
> >> > met or when you hit the ceiling; setting this to the high number will
> >> > ensure that you always rescale to the upper bound
> >> >
> >> > 2) "jobmanager.adaptive-scheduler.resource-stabilization-timeout" - for
> >> new
> >> > and already restarting jobs, we'll always respect this timeout, which
> >> > allows you to wait for more resources even though you already have more
> >> > resources than defined in the lower bound; again, in the case we reach
> >> the
> >> > ceiling (the upper bound), we'll transition into the executing state.
> >> >
> >> >
> >> > We're still planning to dig deeper in this direction with other efforts,
> >> > but this is already good enough and should allow us to move the FLIP
> >> > forward.
> >> >
> >> > WDYT? Unless there are any objectives against the above, I'd like to
> >> > proceed to a vote.
> >> >
> >> > Best,
> >> > D.
> >> >
> >> > On Thu, Feb 23, 2023 at 5:39 PM David Morávek <dm...@apache.org> wrote:
> >> >
> >> > > Hi Everyone,
> >> > >
> >> > > @John
> >> > >
> >> > > This is a problem that we've spent some time trying to crack; in the
> >> end,
> >> > > we've decided to go against doing any upgrades to JobGraphStore from
> >> > > JobMaster to avoid having multiple writers that are guarded by
> >> different
> >> > > leader election lock (Dispatcher and JobMaster might live in a
> >> different
> >> > > process). The contract we've decided to choose instead is leveraging
> >> the
> >> > > idempotency of the endpoint and having the user of the API retry in
> >> case
> >> > > we're unable to persist new requirements in the JobGraphStore [1]. We
> >> > > eventually need to move JobGraphStore out of the dispatcher, but
> >> that's way
> >> > > out of the scope of this FLIP. The solution is a deliberate
> >> trade-off. The
> >> > > worst scenario is that the Dispatcher fails over in between retries,
> >> which
> >> > > would simply rescale the job to meet the previous resource
> >> requirements
> >> > > (more extended unavailability of underlying HA storage would have
> >> worse
> >> > > consequences than this). Does that answer your question?
> >> > >
> >> > > @Matthias
> >> > >
> >> > > Good catch! I'm fixing it now, thanks!
> >> > >
> >> > > [1]
> >> > >
> >> https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9#diff-a4b690fb2c4975d25b05eb4161617af0d704a85ff7b1cad19d3c817c12f1e29cR1151
> >> > >
> >> > > Best,
> >> > > D.
> >> > >
> >> > > On Tue, Feb 21, 2023 at 12:24 AM John Roesler <vv...@apache.org>
> >> wrote:
> >> > >
> >> > >> Thanks for the FLIP, David!
> >> > >>
> >> > >> I just had one small question. IIUC, the REST API PUT request will go
> >> > >> through the new DispatcherGateway method to be handled. Then, after
> >> > >> validation, the dispatcher would call the new JobMasterGateway
> >> method to
> >> > >> actually update the job.
> >> > >>
> >> > >> Which component will write the updated JobGraph? I just wanted to
> >> make
> >> > >> sure it’s the JobMaster because it it were the dispatcher, there
> >> could be a
> >> > >> race condition with the async JobMaster method.
> >> > >>
> >> > >> Thanks!
> >> > >> -John
> >> > >>
> >> > >> On Mon, Feb 20, 2023, at 07:34, Matthias Pohl wrote:
> >> > >> > Thanks for your clarifications, David. I don't have any additional
> >> major
> >> > >> > points to add. One thing about the FLIP: The RPC layer API for
> >> updating
> >> > >> the
> >> > >> > JRR returns a future with a JRR? I don't see value in returning a
> >> JRR
> >> > >> here
> >> > >> > since it's an idempotent operation? Wouldn't it be enough to return
> >> > >> > CompletableFuture<Void> here? Or am I missing something?
> >> > >> >
> >> > >> > Matthias
> >> > >> >
> >> > >> > On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels <mxm@apache.org
> >> >
> >> > >> wrote:
> >> > >> >
> >> > >> >> Thanks David! If we could get the pre-allocation working as part
> >> of
> >> > >> >> the FLIP, that would be great.
> >> > >> >>
> >> > >> >> Concerning the downscale case, I agree this is a special case for
> >> the
> >> > >> >> (single-job) application mode where we could re-allocate slots in
> >> a
> >> > >> >> way that could leave entire task managers unoccupied which we
> >> would
> >> > >> >> then be able to release. The goal essentially is to reduce slot
> >> > >> >> fragmentation on scale down by packing the slots efficiently. The
> >> > >> >> easiest way to add this optimization when running in application
> >> mode
> >> > >> >> would be to drop as many task managers during the restart such
> >> that
> >> > >> >> NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look
> >> into
> >> > >> >> this independently of the FLIP.
> >> > >> >>
> >> > >> >> Feel free to start the vote.
> >> > >> >>
> >> > >> >> -Max
> >> > >> >>
> >> > >> >> On Mon, Feb 20, 2023 at 9:10 AM David Morávek <dm...@apache.org>
> >> wrote:
> >> > >> >> >
> >> > >> >> > Hi everyone,
> >> > >> >> >
> >> > >> >> > Thanks for the feedback! I've updated the FLIP to use
> >> idempotent PUT
> >> > >> API
> >> > >> >> instead of PATCH and to properly handle lower bound settings, to
> >> > >> support
> >> > >> >> the "pre-allocation" of the resources.
> >> > >> >> >
> >> > >> >> > @Max
> >> > >> >> >
> >> > >> >> > > How hard would it be to address this issue in the FLIP?
> >> > >> >> >
> >> > >> >> > I've included this in the FLIP. It might not be too hard to
> >> implement
> >> > >> >> this in the end.
> >> > >> >> >
> >> > >> >> > > B) drop as many superfluous task managers as needed
> >> > >> >> >
> >> > >> >> > I've intentionally left this part out for now because this
> >> ultimately
> >> > >> >> needs to be the responsibility of the Resource Manager. After
> >> all, in
> >> > >> the
> >> > >> >> Session Cluster scenario, the Scheduler doesn't have the bigger
> >> > >> picture of
> >> > >> >> other tasks of other jobs running on those TMs. This will most
> >> likely
> >> > >> be a
> >> > >> >> topic for another FLIP.
> >> > >> >> >
> >> > >> >> > WDYT? If there are no other questions or concerns, I'd like to
> >> start
> >> > >> the
> >> > >> >> vote on Wednesday.
> >> > >> >> >
> >> > >> >> > Best,
> >> > >> >> > D.
> >> > >> >> >
> >> > >> >> > On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <
> >> mxm@apache.org>
> >> > >> >> wrote:
> >> > >> >> >>
> >> > >> >> >> I missed that the FLIP states:
> >> > >> >> >>
> >> > >> >> >> > Currently, even though we’d expose the lower bound for
> >> clarity and
> >> > >> >> API completeness, we won’t allow setting it to any other value
> >> than one
> >> > >> >> until we have full support throughout the stack.
> >> > >> >> >>
> >> > >> >> >> How hard would it be to address this issue in the FLIP?
> >> > >> >> >>
> >> > >> >> >> There is not much value to offer setting a lower bound which
> >> won't
> >> > >> be
> >> > >> >> >> respected / throw an error when it is set. If we had support
> >> for a
> >> > >> >> >> lower bound, we could enforce a resource contract externally
> >> via
> >> > >> >> >> setting lowerBound == upperBound. That ties back to the
> >> Rescale API
> >> > >> >> >> discussion we had. I want to better understand what the major
> >> > >> concerns
> >> > >> >> >> would be around allowing this.
> >> > >> >> >>
> >> > >> >> >> Just to outline how I imagine the logic to work:
> >> > >> >> >>
> >> > >> >> >> A) The resource constraints are already met => Nothing changes
> >> > >> >> >> B) More resources available than required => Cancel the job,
> >> drop as
> >> > >> >> >> many superfluous task managers as needed, restart the job
> >> > >> >> >> C) Less resources available than required => Acquire new task
> >> > >> >> >> managers, wait for them to register, cancel and restart the job
> >> > >> >> >>
> >> > >> >> >> I'm open to helping out with the implementation.
> >> > >> >> >>
> >> > >> >> >> -Max
> >> > >> >> >>
> >> > >> >> >> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <
> >> mxm@apache.org>
> >> > >> >> wrote:
> >> > >> >> >> >
> >> > >> >> >> > Based on further discussion I had with Chesnay on this PR
> >> [1], I
> >> > >> think
> >> > >> >> >> > jobs would currently go into a restarting state after the
> >> resource
> >> > >> >> >> > requirements have changed. This wouldn't achieve what we had
> >> in
> >> > >> mind,
> >> > >> >> >> > i.e. sticking to the old resource requirements until enough
> >> slots
> >> > >> are
> >> > >> >> >> > available to fulfil the new resource requirements. So this
> >> may
> >> > >> not be
> >> > >> >> >> > 100% what we need but it could be extended to do what we
> >> want.
> >> > >> >> >> >
> >> > >> >> >> > -Max
> >> > >> >> >> >
> >> > >> >> >> > [1]
> >> > >> https://github.com/apache/flink/pull/21908#discussion_r1104792362
> >> > >> >> >> >
> >> > >> >> >> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <
> >> > >> mxm@apache.org>
> >> > >> >> wrote:
> >> > >> >> >> > >
> >> > >> >> >> > > Hi David,
> >> > >> >> >> > >
> >> > >> >> >> > > This is awesome! Great writeup and demo. This is pretty
> >> much
> >> > >> what we
> >> > >> >> >> > > need for the autoscaler as part of the Flink Kubernetes
> >> operator
> >> > >> >> [1].
> >> > >> >> >> > > Scaling Flink jobs effectively is hard but fortunately we
> >> have
> >> > >> >> solved
> >> > >> >> >> > > the issue as part of the Flink Kubernetes operator. The
> >> only
> >> > >> >> critical
> >> > >> >> >> > > piece we are missing is a better way to execute scaling
> >> > >> decisions,
> >> > >> >> as
> >> > >> >> >> > > discussed in [2].
> >> > >> >> >> > >
> >> > >> >> >> > > Looking at your proposal, we would set lowerBound ==
> >> upperBound
> >> > >> for
> >> > >> >> >> > > the parallelism because we want to fully determine the
> >> > >> parallelism
> >> > >> >> >> > > externally based on the scaling metrics. Does that sound
> >> right?
> >> > >> >> >> > >
> >> > >> >> >> > > What is the timeline for these changes? Is there a JIRA?
> >> > >> >> >> > >
> >> > >> >> >> > > Cheers,
> >> > >> >> >> > > Max
> >> > >> >> >> > >
> >> > >> >> >> > > [1]
> >> > >> >>
> >> > >>
> >> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
> >> > >> >> >> > > [2]
> >> > >> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> >> > >> >> >> > >
> >> > >> >> >> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <
> >> > >> xiangyu0xf@gmail.com>
> >> > >> >> wrote:
> >> > >> >> >> > > >
> >> > >> >> >> > > > Hi David,
> >> > >> >> >> > > >
> >> > >> >> >> > > > Thanks for your reply.  I think your response totally
> >> make
> >> > >> >> sense.  This
> >> > >> >> >> > > > flip targets on declaring required resource to
> >> ResourceManager
> >> > >> >> instead of
> >> > >> >> >> > > > using  ResourceManager to add/remove TMs directly.
> >> > >> >> >> > > >
> >> > >> >> >> > > > Best,
> >> > >> >> >> > > > Xiangyu
> >> > >> >> >> > > >
> >> > >> >> >> > > >
> >> > >> >> >> > > >
> >> > >> >> >> > > > David Morávek <da...@gmail.com> 于2023年2月13日周一
> >> > >> 15:46写道：
> >> > >> >> >> > > >
> >> > >> >> >> > > > > Hi everyone,
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > @Shammon
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > I'm not entirely sure what "config file" you're
> >> referring
> >> > >> to.
> >> > >> >> You can, of
> >> > >> >> >> > > > > course, override the default parallelism in
> >> > >> "flink-conf.yaml",
> >> > >> >> but for
> >> > >> >> >> > > > > sinks and sources, the parallelism needs to be tweaked
> >> on
> >> > >> the
> >> > >> >> connector
> >> > >> >> >> > > > > level ("WITH" statement).
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > This is something that should be achieved with tooling
> >> > >> around
> >> > >> >> Flink. We
> >> > >> >> >> > > > > want to provide an API on the lowest level that
> >> generalizes
> >> > >> >> well. Achieving
> >> > >> >> >> > > > > what you're describing should be straightforward with
> >> this
> >> > >> API.
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > @Xiangyu
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > Is it possible for this REST API to declare TM
> >> resources in
> >> > >> the
> >> > >> >> future?
> >> > >> >> >> > > > >
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > Would you like to add/remove TMs if you use an active
> >> > >> Resource
> >> > >> >> Manager?
> >> > >> >> >> > > > > This would be out of the scope of this effort since it
> >> > >> targets
> >> > >> >> the
> >> > >> >> >> > > > > scheduler component only (we make no assumptions about
> >> the
> >> > >> used
> >> > >> >> Resource
> >> > >> >> >> > > > > Manager). Also, the AdaptiveScheduler is only intended
> >> to be
> >> > >> >> used for
> >> > >> >> >> > > > > Streaming.
> >> > >> >> >> > > > >
> >> > >> >> >> > > > >  And for streaming jobs, I'm wondering if there is any
> >> > >> >> situation we need to
> >> > >> >> >> > > > > > rescale the TM resources of a flink cluster at first
> >> and
> >> > >> then
> >> > >> >> the
> >> > >> >> >> > > > > adaptive
> >> > >> >> >> > > > > > scheduler will rescale the per-vertex
> >> ResourceProfiles
> >> > >> >> accordingly.
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > We plan on adding support for the ResourceProfiles
> >> (dynamic
> >> > >> slot
> >> > >> >> >> > > > > allocation) as the next step. Again we won't make any
> >> > >> >> assumptions about the
> >> > >> >> >> > > > > used Resource Manager. In other words, this effort
> >> ends by
> >> > >> >> declaring
> >> > >> >> >> > > > > desired resources to the Resource Manager.
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > Does that make sense?
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > @Matthias
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > We've done another pass on the proposed API and
> >> currently
> >> > >> lean
> >> > >> >> towards
> >> > >> >> >> > > > > having an idempotent PUT API.
> >> > >> >> >> > > > > - We don't care too much about multiple writers'
> >> scenarios
> >> > >> in
> >> > >> >> terms of who
> >> > >> >> >> > > > > can write an authoritative payload; this is up to the
> >> user
> >> > >> of
> >> > >> >> the API to
> >> > >> >> >> > > > > figure out
> >> > >> >> >> > > > > - It's indeed tricky to achieve atomicity with PATCH
> >> API;
> >> > >> >> switching to PUT
> >> > >> >> >> > > > > API seems to do the trick
> >> > >> >> >> > > > > - We won't allow partial "payloads" anymore, meaning
> >> you
> >> > >> need
> >> > >> >> to define
> >> > >> >> >> > > > > requirements for all vertices in the JobGraph; This is
> >> > >> >> completely fine for
> >> > >> >> >> > > > > the programmatic workflows. For DEBUG / DEMO purposes,
> >> you
> >> > >> can
> >> > >> >> use the GET
> >> > >> >> >> > > > > endpoint and tweak the response to avoid writing the
> >> whole
> >> > >> >> payload by hand.
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > WDYT?
> >> > >> >> >> > > > >
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > Best,
> >> > >> >> >> > > > > D.
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <
> >> > >> >> xiangyu0xf@gmail.com>
> >> > >> >> >> > > > > wrote:
> >> > >> >> >> > > > >
> >> > >> >> >> > > > > > Hi David,
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > > > Thanks for creating this flip. I think this work it
> >> is
> >> > >> very
> >> > >> >> useful,
> >> > >> >> >> > > > > > especially in autoscaling scenario.  I would like to
> >> share
> >> > >> >> some questions
> >> > >> >> >> > > > > > from my view.
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > > > 1, Is it possible for this REST API to declare TM
> >> > >> resources
> >> > >> >> in the
> >> > >> >> >> > > > > future?
> >> > >> >> >> > > > > > I'm asking because we are building the autoscaling
> >> feature
> >> > >> >> for Flink OLAP
> >> > >> >> >> > > > > > Session Cluster in ByteDance. We need to rescale the
> >> > >> >> cluster's resource
> >> > >> >> >> > > > > on
> >> > >> >> >> > > > > > TM level instead of Job level. It would be very
> >> helpful
> >> > >> if we
> >> > >> >> have a REST
> >> > >> >> >> > > > > > API for out external Autoscaling service to use.
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > > > 2, And for streaming jobs, I'm wondering if there is
> >> any
> >> > >> >> situation we
> >> > >> >> >> > > > > need
> >> > >> >> >> > > > > > to rescale the TM resources of a flink cluster at
> >> first
> >> > >> and
> >> > >> >> then the
> >> > >> >> >> > > > > > adaptive scheduler will rescale the per-vertex
> >> > >> >> ResourceProfiles
> >> > >> >> >> > > > > > accordingly.
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > > > best.
> >> > >> >> >> > > > > > Xiangyu
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > > > > Hi David
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > > > > Thanks for your answer.
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > > > > > Can you elaborate more about how you'd intend to
> >> use
> >> > >> the
> >> > >> >> endpoint? I
> >> > >> >> >> > > > > > > think we can ultimately introduce a way of
> >> re-declaring
> >> > >> >> "per-vertex
> >> > >> >> >> > > > > > > defaults," but I'd like to understand the use case
> >> bit
> >> > >> more
> >> > >> >> first.
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > > > > For this issue, I mainly consider the consistency
> >> of
> >> > >> user
> >> > >> >> configuration
> >> > >> >> >> > > > > > and
> >> > >> >> >> > > > > > > job runtime. For sql jobs, users usually set
> >> specific
> >> > >> >> parallelism for
> >> > >> >> >> > > > > > > source and sink, and set a global parallelism for
> >> other
> >> > >> >> operators.
> >> > >> >> >> > > > > These
> >> > >> >> >> > > > > > > config items are stored in a config file. For some
> >> > >> >> high-priority jobs,
> >> > >> >> >> > > > > > > users may want to manage them manually.
> >> > >> >> >> > > > > > > 1. When users need to scale the parallelism, they
> >> should
> >> > >> >> update the
> >> > >> >> >> > > > > > config
> >> > >> >> >> > > > > > > file and restart flink job, which may take a long
> >> time.
> >> > >> >> >> > > > > > > 2. After providing the REST API, users can just
> >> send a
> >> > >> >> request to the
> >> > >> >> >> > > > > job
> >> > >> >> >> > > > > > > via REST API quickly after updating the config
> >> file.
> >> > >> >> >> > > > > > > The configuration in the running job and config
> >> file
> >> > >> should
> >> > >> >> be the
> >> > >> >> >> > > > > same.
> >> > >> >> >> > > > > > > What do you think of this?
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > > > > best.
> >> > >> >> >> > > > > > > Shammon
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <
> >> > >> >> david.moravek@gmail.com>
> >> > >> >> >> > > > > > > wrote:
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > > > > > Hi everyone,
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > Let's try to answer the questions one by one.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > *@ConradJam*
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > when the number of "slots" is insufficient, can
> >> we can
> >> > >> >> stop users
> >> > >> >> >> > > > > > > rescaling
> >> > >> >> >> > > > > > > > > or throw something to tell user "less avaliable
> >> > >> slots
> >> > >> >> to upgrade,
> >> > >> >> >> > > > > > > please
> >> > >> >> >> > > > > > > > > checkout your alivalbe slots" ?
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > The main property of AdaptiveScheduler is that
> >> it can
> >> > >> >> adapt to
> >> > >> >> >> > > > > > "available
> >> > >> >> >> > > > > > > > resources," which means you're still able to make
> >> > >> >> progress even
> >> > >> >> >> > > > > though
> >> > >> >> >> > > > > > > you
> >> > >> >> >> > > > > > > > didn't get all the slots you've asked for. Let's
> >> break
> >> > >> >> down the pros
> >> > >> >> >> > > > > > and
> >> > >> >> >> > > > > > > > cons of this property.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > - (plus) If you lose a TM for some reason, you
> >> can
> >> > >> still
> >> > >> >> recover even
> >> > >> >> >> > > > > > if
> >> > >> >> >> > > > > > > it
> >> > >> >> >> > > > > > > > doesn't come back. We still need to give it some
> >> time
> >> > >> to
> >> > >> >> eliminate
> >> > >> >> >> > > > > > > > unnecessary rescaling, which can be controlled by
> >> > >> setting
> >> > >> >> >> > > > > > > > "resource-stabilization-timeout."
> >> > >> >> >> > > > > > > > - (plus) The resources can arrive with a
> >> significant
> >> > >> >> delay. For
> >> > >> >> >> > > > > > example,
> >> > >> >> >> > > > > > > > you're unable to spawn enough TMs on time because
> >> > >> you've
> >> > >> >> run out of
> >> > >> >> >> > > > > > > > resources in your k8s cluster, and you need to
> >> wait
> >> > >> for
> >> > >> >> the cluster
> >> > >> >> >> > > > > > auto
> >> > >> >> >> > > > > > > > scaler to kick in and add new nodes to the
> >> cluster. In
> >> > >> >> this scenario,
> >> > >> >> >> > > > > > > > you'll be able to start making progress faster,
> >> at the
> >> > >> >> cost of
> >> > >> >> >> > > > > multiple
> >> > >> >> >> > > > > > > > rescalings (once the remaining resources arrive).
> >> > >> >> >> > > > > > > > - (plus) This plays well with the declarative
> >> manner
> >> > >> of
> >> > >> >> today's
> >> > >> >> >> > > > > > > > infrastructure. For example, you tell k8s that
> >> you
> >> > >> need
> >> > >> >> 10 TMs, and
> >> > >> >> >> > > > > > > you'll
> >> > >> >> >> > > > > > > > eventually get them.
> >> > >> >> >> > > > > > > > - (minus) In the case of large state jobs, the
> >> cost of
> >> > >> >> multiple
> >> > >> >> >> > > > > > > rescalings
> >> > >> >> >> > > > > > > > might outweigh the above.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > We've already touched on the solution to this
> >> problem
> >> > >> on
> >> > >> >> the FLIP.
> >> > >> >> >> > > > > > Please
> >> > >> >> >> > > > > > > > notice the parallelism knob being a range with a
> >> lower
> >> > >> >> and upper
> >> > >> >> >> > > > > bound.
> >> > >> >> >> > > > > > > > Setting both the lower and upper bound to the
> >> same
> >> > >> value
> >> > >> >> could give
> >> > >> >> >> > > > > the
> >> > >> >> >> > > > > > > > behavior you're describing at the cost of giving
> >> up
> >> > >> some
> >> > >> >> properties
> >> > >> >> >> > > > > > that
> >> > >> >> >> > > > > > > AS
> >> > >> >> >> > > > > > > > gives you (you'd be falling back to the
> >> > >> >> DefaultScheduler's behavior).
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > when user upgrade job-vertx-parallelism . I want
> >> to
> >> > >> have
> >> > >> >> an interface
> >> > >> >> >> > > > > > to
> >> > >> >> >> > > > > > > > > query the current update parallel execution
> >> status,
> >> > >> so
> >> > >> >> that the
> >> > >> >> >> > > > > user
> >> > >> >> >> > > > > > or
> >> > >> >> >> > > > > > > > > program can understand the current status
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > This is a misunderstanding. We're not
> >> introducing the
> >> > >> >> RESCALE
> >> > >> >> >> > > > > endpoint.
> >> > >> >> >> > > > > > > > This endpoint allows you to re-declare the
> >> resources
> >> > >> >> needed to run
> >> > >> >> >> > > > > the
> >> > >> >> >> > > > > > > job.
> >> > >> >> >> > > > > > > > Once you reach the desired resources (you get
> >> more
> >> > >> >> resources than the
> >> > >> >> >> > > > > > > lower
> >> > >> >> >> > > > > > > > bound defines), your job will run.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > We can expose a similar endpoint to "resource
> >> > >> >> requirements" to give
> >> > >> >> >> > > > > you
> >> > >> >> >> > > > > > > an
> >> > >> >> >> > > > > > > > overview of the resources the vertices already
> >> have.
> >> > >> You
> >> > >> >> can already
> >> > >> >> >> > > > > > get
> >> > >> >> >> > > > > > > > this from the REST API, so exposing this in yet
> >> > >> another
> >> > >> >> way should be
> >> > >> >> >> > > > > > > > considered carefully.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > *@Matthias*
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > I'm wondering whether it makes sense to add some
> >> kind
> >> > >> of
> >> > >> >> resource ID
> >> > >> >> >> > > > > to
> >> > >> >> >> > > > > > > the
> >> > >> >> >> > > > > > > > > REST API.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > That's a good question. I want to think about
> >> that and
> >> > >> >> get back to
> >> > >> >> >> > > > > the
> >> > >> >> >> > > > > > > > question later. My main struggle when thinking
> >> about
> >> > >> this
> >> > >> >> is, "if
> >> > >> >> >> > > > > this
> >> > >> >> >> > > > > > > > would be an idempotent POST endpoint," would it
> >> be any
> >> > >> >> different?
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > How often do we allow resource requirements to be
> >> > >> changed?
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > There shall be no rate limiting on the FLINK
> >> side. If
> >> > >> >> this is
> >> > >> >> >> > > > > something
> >> > >> >> >> > > > > > > > your environment needs, you can achieve it on a
> >> > >> different
> >> > >> >> layer ("we
> >> > >> >> >> > > > > > > can't
> >> > >> >> >> > > > > > > > have FLINK to do everything").
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > Versioning the JobGraph in the JobGraphStore
> >> rather
> >> > >> than
> >> > >> >> overwriting
> >> > >> >> >> > > > > it
> >> > >> >> >> > > > > > > > > might be an idea.
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > This sounds interesting since it would be closer
> >> to
> >> > >> the
> >> > >> >> JobGraph
> >> > >> >> >> > > > > being
> >> > >> >> >> > > > > > > > immutable. The main problem I see here is that
> >> this
> >> > >> would
> >> > >> >> introduce a
> >> > >> >> >> > > > > > > > BW-incompatible change so it might be a topic for
> >> > >> >> follow-up FLIP.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > I'm just wondering whether we bundle two things
> >> > >> together
> >> > >> >> that are
> >> > >> >> >> > > > > > > actually
> >> > >> >> >> > > > > > > > > separate
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > Yup, this is how we think about it as well. The
> >> main
> >> > >> >> question is,
> >> > >> >> >> > > > > "who
> >> > >> >> >> > > > > > > > should be responsible for bookkeeping 1) the
> >> JobGraph
> >> > >> and
> >> > >> >> 2) the
> >> > >> >> >> > > > > > > > JobResourceRequirements". The JobMaster would be
> >> the
> >> > >> >> right place for
> >> > >> >> >> > > > > > > both,
> >> > >> >> >> > > > > > > > but it's currently not the case, and we're
> >> tightly
> >> > >> >> coupling the
> >> > >> >> >> > > > > > > dispatcher
> >> > >> >> >> > > > > > > > with the JobMaster.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > Initially, we tried to introduce a separate HA
> >> > >> component
> >> > >> >> in JobMaster
> >> > >> >> >> > > > > > for
> >> > >> >> >> > > > > > > > bookkeeping the JobResourceRequirements, but that
> >> > >> proved
> >> > >> >> to be a more
> >> > >> >> >> > > > > > > > significant effort adding additional mess to the
> >> > >> already
> >> > >> >> messy HA
> >> > >> >> >> > > > > > > > ecosystem. Another approach we've discussed was
> >> > >> mutating
> >> > >> >> the JobGraph
> >> > >> >> >> > > > > > and
> >> > >> >> >> > > > > > > > setting JRR into the JobGraph structure itself.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > The middle ground for keeping this effort
> >> reasonably
> >> > >> >> sized and not
> >> > >> >> >> > > > > > > > violating "we want to keep JG immutable" too
> >> much is
> >> > >> >> keeping the
> >> > >> >> >> > > > > > > > JobResourceRequirements separate as an internal
> >> config
> >> > >> >> option in
> >> > >> >> >> > > > > > > JobGraph's
> >> > >> >> >> > > > > > > > configuration.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > We ultimately need to rethink the tight coupling
> >> of
> >> > >> >> Dispatcher and
> >> > >> >> >> > > > > > > > JobMaster, but it needs to be a separate effort.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > ...also considering the amount of data that can
> >> be
> >> > >> stored
> >> > >> >> in a
> >> > >> >> >> > > > > > > > > ConfigMap/ZooKeeper node if versioning the
> >> resource
> >> > >> >> requirement
> >> > >> >> >> > > > > > change
> >> > >> >> >> > > > > > > as
> >> > >> >> >> > > > > > > > > proposed in my previous item is an option for
> >> us.
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > AFAIK we're only storing pointers to the S3
> >> objects
> >> > >> in HA
> >> > >> >> metadata,
> >> > >> >> >> > > > > so
> >> > >> >> >> > > > > > we
> >> > >> >> >> > > > > > > > should be okay with having larger structures for
> >> now.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > Updating the JobGraphStore means adding more
> >> requests
> >> > >> to
> >> > >> >> the HA
> >> > >> >> >> > > > > backend
> >> > >> >> >> > > > > > > > API.
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > It's fine unless you intend to override the
> >> resource
> >> > >> >> requirements a
> >> > >> >> >> > > > > few
> >> > >> >> >> > > > > > > > times per second.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > *@Shammon*
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > How about adding some more information such as
> >> vertex
> >> > >> type
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > Since it was intended as a "debug" endpoint, it
> >> makes
> >> > >> >> complete sense!
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > >  For sql jobs, we always use a unified
> >> parallelism for
> >> > >> >> most vertices.
> >> > >> >> >> > > > > > Can
> >> > >> >> >> > > > > > > > > we provide them with a more convenient setting
> >> > >> method
> >> > >> >> instead of
> >> > >> >> >> > > > > each
> >> > >> >> >> > > > > > > > one?
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > I completely feel with this. The main thoughts
> >> when
> >> > >> >> designing the API
> >> > >> >> >> > > > > > > were:
> >> > >> >> >> > > > > > > > - We want to keep it clean and easy to
> >> understand.
> >> > >> >> >> > > > > > > > - Global parallelism can be modeled using
> >> per-vertex
> >> > >> >> parallelism but
> >> > >> >> >> > > > > > not
> >> > >> >> >> > > > > > > > the other way around.
> >> > >> >> >> > > > > > > > - The API will be used by external tooling
> >> (operator,
> >> > >> >> auto scaler).
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > Can you elaborate more about how you'd intend to
> >> use
> >> > >> the
> >> > >> >> endpoint? I
> >> > >> >> >> > > > > > > think
> >> > >> >> >> > > > > > > > we can ultimately introduce a way of re-declaring
> >> > >> >> "per-vertex
> >> > >> >> >> > > > > > defaults,"
> >> > >> >> >> > > > > > > > but I'd like to understand the use case bit more
> >> > >> first.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > *@Weijie*
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > What is the default value here (based on what
> >> > >> >> configuration), or just
> >> > >> >> >> > > > > > > > > infinite?
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > Currently, for the lower bound, it's always one,
> >> and
> >> > >> for
> >> > >> >> the upper
> >> > >> >> >> > > > > > bound,
> >> > >> >> >> > > > > > > > it's either parallelism (if defined) or the
> >> > >> >> maxParallelism of the
> >> > >> >> >> > > > > > vertex
> >> > >> >> >> > > > > > > in
> >> > >> >> >> > > > > > > > JobGraph. This question might be another signal
> >> for
> >> > >> >> making the
> >> > >> >> >> > > > > defaults
> >> > >> >> >> > > > > > > > explicit (see the answer to Shammon's question
> >> above).
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > Thanks, everyone, for your initial thoughts!
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > Best,
> >> > >> >> >> > > > > > > > D.
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <
> >> > >> >> guoweijiereswqa@gmail.com
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > > > > > wrote:
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > > > > Thanks David for driving this. This is a very
> >> > >> valuable
> >> > >> >> work,
> >> > >> >> >> > > > > > especially
> >> > >> >> >> > > > > > > > for
> >> > >> >> >> > > > > > > > > cloud native environment.
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > > > >> How about adding some more information such
> >> as
> >> > >> >> vertex type
> >> > >> >> >> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of
> >> `get
> >> > >> jobs
> >> > >> >> >> > > > > > > > > resource-requirements`? For users, only
> >> vertex-id
> >> > >> may
> >> > >> >> be difficult
> >> > >> >> >> > > > > to
> >> > >> >> >> > > > > > > > > understand.
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > > > +1 for this suggestion, including jobvertex's
> >> name
> >> > >> in
> >> > >> >> the response
> >> > >> >> >> > > > > > body
> >> > >> >> >> > > > > > > > is
> >> > >> >> >> > > > > > > > > more
> >> > >> >> >> > > > > > > > > user-friendly.
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > > > I saw this sentence in FLIP: "Setting the upper
> >> > >> bound
> >> > >> >> to -1 will
> >> > >> >> >> > > > > > reset
> >> > >> >> >> > > > > > > > the
> >> > >> >> >> > > > > > > > > value to the default setting."  What is the
> >> default
> >> > >> >> value here
> >> > >> >> >> > > > > (based
> >> > >> >> >> > > > > > > on
> >> > >> >> >> > > > > > > > > what configuration), or just infinite?
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > > > Best regards,
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > > > Weijie
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一
> >> > >> 18:06写道：
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > > > > Hi David
> >> > >> >> >> > > > > > > > > >
> >> > >> >> >> > > > > > > > > > Thanks for initiating this discussion. I
> >> think
> >> > >> >> declaring job
> >> > >> >> >> > > > > > resource
> >> > >> >> >> > > > > > > > > > requirements by REST API is very valuable. I
> >> just
> >> > >> >> left some
> >> > >> >> >> > > > > > comments
> >> > >> >> >> > > > > > > as
> >> > >> >> >> > > > > > > > > > followed
> >> > >> >> >> > > > > > > > > >
> >> > >> >> >> > > > > > > > > > 1) How about adding some more information
> >> such as
> >> > >> >> vertex type
> >> > >> >> >> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response
> >> of `get
> >> > >> >> jobs
> >> > >> >> >> > > > > > > > > > resource-requirements`? For users, only
> >> vertex-id
> >> > >> may
> >> > >> >> be
> >> > >> >> >> > > > > difficult
> >> > >> >> >> > > > > > to
> >> > >> >> >> > > > > > > > > > understand.
> >> > >> >> >> > > > > > > > > >
> >> > >> >> >> > > > > > > > > > 2) For sql jobs, we always use a unified
> >> > >> parallelism
> >> > >> >> for most
> >> > >> >> >> > > > > > > vertices.
> >> > >> >> >> > > > > > > > > Can
> >> > >> >> >> > > > > > > > > > we provide them with a more convenient
> >> setting
> >> > >> method
> >> > >> >> instead of
> >> > >> >> >> > > > > > each
> >> > >> >> >> > > > > > > > > one?
> >> > >> >> >> > > > > > > > > >
> >> > >> >> >> > > > > > > > > >
> >> > >> >> >> > > > > > > > > > Best,
> >> > >> >> >> > > > > > > > > > Shammon
> >> > >> >> >> > > > > > > > > >
> >> > >> >> >> > > > > > > > > >
> >> > >> >> >> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl
> >> <
> >> > >> >> >> > > > > > matthias.pohl@aiven.io
> >> > >> >> >> > > > > > > > > > .invalid>
> >> > >> >> >> > > > > > > > > > wrote:
> >> > >> >> >> > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > Thanks David for creating this FLIP. It
> >> sounds
> >> > >> >> promising and
> >> > >> >> >> > > > > > useful
> >> > >> >> >> > > > > > > > to
> >> > >> >> >> > > > > > > > > > > have. Here are some thoughts from my side
> >> (some
> >> > >> of
> >> > >> >> them might
> >> > >> >> >> > > > > be
> >> > >> >> >> > > > > > > > > rather a
> >> > >> >> >> > > > > > > > > > > follow-up and not necessarily part of this
> >> > >> FLIP):
> >> > >> >> >> > > > > > > > > > > - I'm wondering whether it makes sense to
> >> add
> >> > >> some
> >> > >> >> kind of
> >> > >> >> >> > > > > > resource
> >> > >> >> >> > > > > > > > ID
> >> > >> >> >> > > > > > > > > to
> >> > >> >> >> > > > > > > > > > > the REST API. This would give Flink a tool
> >> to
> >> > >> >> verify the PATCH
> >> > >> >> >> > > > > > > > request
> >> > >> >> >> > > > > > > > > of
> >> > >> >> >> > > > > > > > > > > the external system in a compare-and-set
> >> kind of
> >> > >> >> manner. AFAIU,
> >> > >> >> >> > > > > > the
> >> > >> >> >> > > > > > > > > > process
> >> > >> >> >> > > > > > > > > > > requires the external system to retrieve
> >> the
> >> > >> >> resource
> >> > >> >> >> > > > > > requirements
> >> > >> >> >> > > > > > > > > first
> >> > >> >> >> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID
> >> > >> <ABC>
> >> > >> >> would be sent
> >> > >> >> >> > > > > > > along
> >> > >> >> >> > > > > > > > > as
> >> > >> >> >> > > > > > > > > > a
> >> > >> >> >> > > > > > > > > > > unique identifier for the provided setup.
> >> It's
> >> > >> >> essentially the
> >> > >> >> >> > > > > > > > version
> >> > >> >> >> > > > > > > > > ID
> >> > >> >> >> > > > > > > > > > > of the currently deployed resource
> >> requirement
> >> > >> >> configuration.
> >> > >> >> >> > > > > > Flink
> >> > >> >> >> > > > > > > > > > doesn't
> >> > >> >> >> > > > > > > > > > > know whether the external system would use
> >> the
> >> > >> >> provided
> >> > >> >> >> > > > > > information
> >> > >> >> >> > > > > > > > in
> >> > >> >> >> > > > > > > > > > some
> >> > >> >> >> > > > > > > > > > > way to derive a new set of resource
> >> requirements
> >> > >> >> for this job.
> >> > >> >> >> > > > > > The
> >> > >> >> >> > > > > > > > > > > subsequent PATCH request with updated
> >> resource
> >> > >> >> requirements
> >> > >> >> >> > > > > would
> >> > >> >> >> > > > > > > > > include
> >> > >> >> >> > > > > > > > > > > the previously retrieved resource ID
> >> <ABC>. The
> >> > >> >> PATCH call
> >> > >> >> >> > > > > would
> >> > >> >> >> > > > > > > fail
> >> > >> >> >> > > > > > > > > if
> >> > >> >> >> > > > > > > > > > > there was a concurrent PATCH call in
> >> between
> >> > >> >> indicating to the
> >> > >> >> >> > > > > > > > external
> >> > >> >> >> > > > > > > > > > > system that the resource requirements were
> >> > >> >> concurrently
> >> > >> >> >> > > > > updated.
> >> > >> >> >> > > > > > > > > > > - How often do we allow resource
> >> requirements
> >> > >> to be
> >> > >> >> changed?
> >> > >> >> >> > > > > That
> >> > >> >> >> > > > > > > > > > question
> >> > >> >> >> > > > > > > > > > > might make my previous comment on the
> >> resource
> >> > >> ID
> >> > >> >> obsolete
> >> > >> >> >> > > > > > because
> >> > >> >> >> > > > > > > we
> >> > >> >> >> > > > > > > > > > could
> >> > >> >> >> > > > > > > > > > > just make any PATCH call fail if there was
> >> a
> >> > >> >> resource
> >> > >> >> >> > > > > requirement
> >> > >> >> >> > > > > > > > > update
> >> > >> >> >> > > > > > > > > > > within a certain time frame before the
> >> request.
> >> > >> But
> >> > >> >> such a time
> >> > >> >> >> > > > > > > > period
> >> > >> >> >> > > > > > > > > is
> >> > >> >> >> > > > > > > > > > > something we might want to make
> >> configurable
> >> > >> then,
> >> > >> >> I guess.
> >> > >> >> >> > > > > > > > > > > - Versioning the JobGraph in the
> >> JobGraphStore
> >> > >> >> rather than
> >> > >> >> >> > > > > > > > overwriting
> >> > >> >> >> > > > > > > > > it
> >> > >> >> >> > > > > > > > > > > might be an idea. This would enable us to
> >> > >> provide
> >> > >> >> resource
> >> > >> >> >> > > > > > > > requirement
> >> > >> >> >> > > > > > > > > > > changes in the UI or through the REST API.
> >> It is
> >> > >> >> related to a
> >> > >> >> >> > > > > > > problem
> >> > >> >> >> > > > > > > > > > > around keeping track of the exception
> >> history
> >> > >> >> within the
> >> > >> >> >> > > > > > > > > > AdaptiveScheduler
> >> > >> >> >> > > > > > > > > > > and also having to consider multiple
> >> versions
> >> > >> of a
> >> > >> >> JobGraph.
> >> > >> >> >> > > > > But
> >> > >> >> >> > > > > > > for
> >> > >> >> >> > > > > > > > > that
> >> > >> >> >> > > > > > > > > > > one, we use the ExecutionGraphInfoStore
> >> right
> >> > >> now.
> >> > >> >> >> > > > > > > > > > > - Updating the JobGraph in the
> >> JobGraphStore
> >> > >> makes
> >> > >> >> sense. I'm
> >> > >> >> >> > > > > > just
> >> > >> >> >> > > > > > > > > > > wondering whether we bundle two things
> >> together
> >> > >> >> that are
> >> > >> >> >> > > > > actually
> >> > >> >> >> > > > > > > > > > separate:
> >> > >> >> >> > > > > > > > > > > The business logic and the execution
> >> > >> configuration
> >> > >> >> (the
> >> > >> >> >> > > > > resource
> >> > >> >> >> > > > > > > > > > > requirements). I'm aware that this is not a
> >> > >> flaw of
> >> > >> >> the current
> >> > >> >> >> > > > > > > FLIP
> >> > >> >> >> > > > > > > > > but
> >> > >> >> >> > > > > > > > > > > rather something that was not necessary to
> >> > >> address
> >> > >> >> in the past
> >> > >> >> >> > > > > > > > because
> >> > >> >> >> > > > > > > > > > the
> >> > >> >> >> > > > > > > > > > > JobGraph was kind of static. I don't
> >> remember
> >> > >> >> whether that was
> >> > >> >> >> > > > > > > > already
> >> > >> >> >> > > > > > > > > > > discussed while working on the
> >> AdaptiveScheduler
> >> > >> >> for FLIP-160
> >> > >> >> >> > > > > > [1].
> >> > >> >> >> > > > > > > > > Maybe,
> >> > >> >> >> > > > > > > > > > > I'm missing some functionality here that
> >> > >> requires
> >> > >> >> us to have
> >> > >> >> >> > > > > > > > everything
> >> > >> >> >> > > > > > > > > > in
> >> > >> >> >> > > > > > > > > > > one place. But it feels like updating the
> >> entire
> >> > >> >> JobGraph which
> >> > >> >> >> > > > > > > could
> >> > >> >> >> > > > > > > > > be
> >> > >> >> >> > > > > > > > > > > actually a "config change" is not
> >> reasonable.
> >> > >> >> ...also
> >> > >> >> >> > > > > considering
> >> > >> >> >> > > > > > > the
> >> > >> >> >> > > > > > > > > > > amount of data that can be stored in a
> >> > >> >> ConfigMap/ZooKeeper node
> >> > >> >> >> > > > > > if
> >> > >> >> >> > > > > > > > > > > versioning the resource requirement change
> >> as
> >> > >> >> proposed in my
> >> > >> >> >> > > > > > > previous
> >> > >> >> >> > > > > > > > > > item
> >> > >> >> >> > > > > > > > > > > is an option for us.
> >> > >> >> >> > > > > > > > > > > - Updating the JobGraphStore means adding
> >> more
> >> > >> >> requests to the
> >> > >> >> >> > > > > HA
> >> > >> >> >> > > > > > > > > backend
> >> > >> >> >> > > > > > > > > > > API. There were some concerns shared in the
> >> > >> >> discussion thread
> >> > >> >> >> > > > > [2]
> >> > >> >> >> > > > > > > for
> >> > >> >> >> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API
> >> server in
> >> > >> >> the past with
> >> > >> >> >> > > > > > too
> >> > >> >> >> > > > > > > > many
> >> > >> >> >> > > > > > > > > > > calls. Eventhough, it's more likely to be
> >> > >> caused by
> >> > >> >> >> > > > > > checkpointing,
> >> > >> >> >> > > > > > > I
> >> > >> >> >> > > > > > > > > > still
> >> > >> >> >> > > > > > > > > > > wanted to bring it up. We're working on a
> >> > >> >> standardized
> >> > >> >> >> > > > > > performance
> >> > >> >> >> > > > > > > > test
> >> > >> >> >> > > > > > > > > > to
> >> > >> >> >> > > > > > > > > > > prepare going forward with FLIP-270 [3]
> >> right
> >> > >> now.
> >> > >> >> >> > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > Best,
> >> > >> >> >> > > > > > > > > > > Matthias
> >> > >> >> >> > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > [1]
> >> > >> >> >> > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > >
> >> > >> >> >> > > > > > > > > >
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > >
> >> > >> >>
> >> > >>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> >> > >> >> >> > > > > > > > > > > [2]
> >> > >> >> >> > > > > > >
> >> > >> >> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> >> > >> >> >> > > > > > > > > > > [3]
> >> > >> >> >> > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > >
> >> > >> >> >> > > > > > > > > >
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > >
> >> > >> >>
> >> > >>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> >> > >> >> >> > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <
> >> > >> >> jam.gzczy@gmail.com
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > > > > > wrote:
> >> > >> >> >> > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > Hi David:
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > Thank you for drive this flip, which
> >> helps
> >> > >> less
> >> > >> >> flink
> >> > >> >> >> > > > > shutdown
> >> > >> >> >> > > > > > > time
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > for this flip, I would like to make a few
> >> > >> idea on
> >> > >> >> share
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >    - when the number of "slots" is
> >> > >> insufficient,
> >> > >> >> can we can
> >> > >> >> >> > > > > > stop
> >> > >> >> >> > > > > > > > > users
> >> > >> >> >> > > > > > > > > > > >    rescaling or throw something to tell
> >> user
> >> > >> >> "less avaliable
> >> > >> >> >> > > > > > > slots
> >> > >> >> >> > > > > > > > to
> >> > >> >> >> > > > > > > > > > > > upgrade,
> >> > >> >> >> > > > > > > > > > > >    please checkout your alivalbe slots"
> >> ? Or
> >> > >> we
> >> > >> >> could have a
> >> > >> >> >> > > > > > > > request
> >> > >> >> >> > > > > > > > > > > >    switch(true/false) to allow this
> >> behavior
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >    - when user upgrade
> >> job-vertx-parallelism
> >> > >> . I
> >> > >> >> want to have
> >> > >> >> >> > > > > > an
> >> > >> >> >> > > > > > > > > > > interface
> >> > >> >> >> > > > > > > > > > > >    to query the current update parallel
> >> > >> execution
> >> > >> >> status, so
> >> > >> >> >> > > > > > that
> >> > >> >> >> > > > > > > > the
> >> > >> >> >> > > > > > > > > > > user
> >> > >> >> >> > > > > > > > > > > > or
> >> > >> >> >> > > > > > > > > > > >    program can understand the current
> >> status
> >> > >> >> >> > > > > > > > > > > >    - I want to have an interface to
> >> query the
> >> > >> >> current update
> >> > >> >> >> > > > > > > > > > parallelism
> >> > >> >> >> > > > > > > > > > > >    execution status. This also helps
> >> similar
> >> > >> to
> >> > >> >> *[1] Flink
> >> > >> >> >> > > > > K8S
> >> > >> >> >> > > > > > > > > > Operator*
> >> > >> >> >> > > > > > > > > > > >    management
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > {
> >> > >> >> >> > > > > > > > > > > >   status: Failed
> >> > >> >> >> > > > > > > > > > > >   reason: "less avaliable slots to
> >> upgrade,
> >> > >> >> please checkout
> >> > >> >> >> > > > > > your
> >> > >> >> >> > > > > > > > > > alivalbe
> >> > >> >> >> > > > > > > > > > > > slots"
> >> > >> >> >> > > > > > > > > > > > }
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >    - *Pending*: this job now is join the
> >> > >> upgrade
> >> > >> >> queue,it
> >> > >> >> >> > > > > will
> >> > >> >> >> > > > > > be
> >> > >> >> >> > > > > > > > > > update
> >> > >> >> >> > > > > > > > > > > >    later
> >> > >> >> >> > > > > > > > > > > >    - *Rescaling*: job now is
> >> rescaling,wait it
> >> > >> >> finish
> >> > >> >> >> > > > > > > > > > > >    - *Finished*: finish do it
> >> > >> >> >> > > > > > > > > > > >    - *Failed* : something have wrong,so
> >> this
> >> > >> job
> >> > >> >> is not
> >> > >> >> >> > > > > > alivable
> >> > >> >> >> > > > > > > > > > upgrade
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > I want to supplement my above content in
> >> flip,
> >> > >> >> what do you
> >> > >> >> >> > > > > > think
> >> > >> >> >> > > > > > > ?
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >    1.
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > >
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > >
> >> > >> >>
> >> > >>
> >> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > David Morávek <dm...@apache.org>
> >> 于2023年2月3日周五
> >> > >> >> 16:42写道：
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > > Hi everyone,
> >> > >> >> >> > > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > > This FLIP [1] introduces a new REST
> >> API for
> >> > >> >> declaring
> >> > >> >> >> > > > > > resource
> >> > >> >> >> > > > > > > > > > > > requirements
> >> > >> >> >> > > > > > > > > > > > > for the Adaptive Scheduler. There
> >> seems to
> >> > >> be a
> >> > >> >> clear need
> >> > >> >> >> > > > > > for
> >> > >> >> >> > > > > > > > this
> >> > >> >> >> > > > > > > > > > API
> >> > >> >> >> > > > > > > > > > > > > based on the discussion on the
> >> "Reworking
> >> > >> the
> >> > >> >> Rescale API"
> >> > >> >> >> > > > > > [2]
> >> > >> >> >> > > > > > > > > > thread.
> >> > >> >> >> > > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > > Before we get started, this work is
> >> heavily
> >> > >> >> based on the
> >> > >> >> >> > > > > > > > prototype
> >> > >> >> >> > > > > > > > > > [3]
> >> > >> >> >> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP
> >> is
> >> > >> being
> >> > >> >> published
> >> > >> >> >> > > > > > with
> >> > >> >> >> > > > > > > > his
> >> > >> >> >> > > > > > > > > > > > consent.
> >> > >> >> >> > > > > > > > > > > > > Big shoutout to him!
> >> > >> >> >> > > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > > Last and not least, thanks to Chesnay
> >> and
> >> > >> Roman
> >> > >> >> for the
> >> > >> >> >> > > > > > initial
> >> > >> >> >> > > > > > > > > > reviews
> >> > >> >> >> > > > > > > > > > > > and
> >> > >> >> >> > > > > > > > > > > > > discussions.
> >> > >> >> >> > > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > > The best start would be watching a
> >> short
> >> > >> demo
> >> > >> >> [4] that I've
> >> > >> >> >> > > > > > > > > recorded,
> >> > >> >> >> > > > > > > > > > > > which
> >> > >> >> >> > > > > > > > > > > > > illustrates newly added capabilities
> >> > >> (rescaling
> >> > >> >> the running
> >> > >> >> >> > > > > > > job,
> >> > >> >> >> > > > > > > > > > > handing
> >> > >> >> >> > > > > > > > > > > > > back resources to the RM, and session
> >> > >> cluster
> >> > >> >> support).
> >> > >> >> >> > > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > > The intuition behind the FLIP is being
> >> able
> >> > >> to
> >> > >> >> define
> >> > >> >> >> > > > > > resource
> >> > >> >> >> > > > > > > > > > > > requirements
> >> > >> >> >> > > > > > > > > > > > > ("resource boundaries") externally
> >> that the
> >> > >> >> >> > > > > AdaptiveScheduler
> >> > >> >> >> > > > > > > can
> >> > >> >> >> > > > > > > > > > > > navigate
> >> > >> >> >> > > > > > > > > > > > > within. This is a building block for
> >> > >> >> higher-level efforts
> >> > >> >> >> > > > > > such
> >> > >> >> >> > > > > > > as
> >> > >> >> >> > > > > > > > > an
> >> > >> >> >> > > > > > > > > > > > > external Autoscaler. The natural
> >> extension
> >> > >> of
> >> > >> >> this work
> >> > >> >> >> > > > > would
> >> > >> >> >> > > > > > > be
> >> > >> >> >> > > > > > > > to
> >> > >> >> >> > > > > > > > > > > allow
> >> > >> >> >> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
> >> > >> >> >> > > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > > Looking forward to your thoughts; any
> >> > >> feedback
> >> > >> >> is
> >> > >> >> >> > > > > > appreciated!
> >> > >> >> >> > > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > > [1]
> >> > >> >> >> > > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > >
> >> > >> >> >> > > > > > > > > >
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > >
> >> > >> >>
> >> > >>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> >> > >> >> >> > > > > > > > > > > > > [2]
> >> > >> >> >> > > > > > > > >
> >> > >> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> >> > >> >> >> > > > > > > > > > > > > [3]
> >> > >> >> https://github.com/tillrohrmann/flink/tree/autoscaling
> >> > >> >> >> > > > > > > > > > > > > [4]
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > >
> >> > >> >>
> >> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> >> > >> >> >> > > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > > Best,
> >> > >> >> >> > > > > > > > > > > > > D.
> >> > >> >> >> > > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > --
> >> > >> >> >> > > > > > > > > > > > Best
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > > > ConradJam
> >> > >> >> >> > > > > > > > > > > >
> >> > >> >> >> > > > > > > > > > >
> >> > >> >> >> > > > > > > > > >
> >> > >> >> >> > > > > > > > >
> >> > >> >> >> > > > > > > >
> >> > >> >> >> > > > > > >
> >> > >> >> >> > > > > >
> >> > >> >> >> > > > >
> >> > >> >>
> >> > >>
> >> > >
> >>
> >

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by David Morávek <dm...@apache.org>.

> I suppose we could further remove the min because it would always be
safer to scale down if resources are not available than not to run at
all [1].

Apart from what @Roman has already mentioned, there are still cases where
we're certain that there is no point in running the jobs with resources
lower than X; e.g., because the state is too large to be processed with
parallelism of 1; this allows you not to waste resources if you're certain
that the job would go into the restart loop / won't be able to checkpoint

I believe that for most use cases, simply keeping the lower bound at 1 will
be sufficient.

> I saw that the minimum bound is currently not used in the code you posted
above [2]. Is that still planned?

Yes. We already allow setting the lower bound via API, but it's not
considered by the scheduler. I'll address this limitation in a separate
issue.

> Note that originally we had assumed min == max but I think that would be
a less safe scaling approach because we would get stuck waiting for
resources when they are not available, e.g. k8s resource limits reached.

100% agreed; The above-mentioned knobs should allow you to balance the
trade-off.


Does that make sense?

Best,
D.



On Tue, Feb 28, 2023 at 1:14 PM Roman Khachatryan <ro...@apache.org> wrote:

> Hi,
>
> Thanks for the update, I think distinguishing the rescaling behaviour and
> the desired parallelism declaration is important.
>
> Having the ability to specify min parallelism might be useful in
> environments with multiple jobs: Scheduler will then have an option to stop
> the less suitable job.
> In other setups, where the job should not be stopped at all, the user can
> always set it to 0.
>
> Regards,
> Roman
>
>
> On Tue, Feb 28, 2023 at 12:58 PM Maximilian Michels <mx...@apache.org>
> wrote:
>
>> Hi David,
>>
>> Thanks for the update! We consider using the new declarative resource
>> API for autoscaling. Currently, we treat a scaling decision as a new
>> deployment which means surrendering all resources to Kubernetes and
>> subsequently reallocating them for the rescaled deployment. The
>> declarative resource management API is a great step forward because it
>> allows us to do faster and safer rescaling. Faster, because we can
>> continue to run while resources are pre-allocated which minimizes
>> downtime. Safer, because we can't get stuck when the desired resources
>> are not available.
>>
>> An example with two vertices and their respective parallelisms:
>>   v1: 50
>>   v2: 10
>> Let's assume slot sharing is disabled, so we need 60 task slots to run
>> the vertices.
>>
>> If the autoscaler was to decide to scale up v1 and v2, it could do so
>> in a safe way by using min/max configuration:
>>   v1: [min: 50, max: 70]
>>   v2: [min: 10, max: 20]
>> This would then need 90 task slots to run at max capacity.
>>
>> I suppose we could further remove the min because it would always be
>> safer to scale down if resources are not available than to not run at
>> all [1]. In fact, I saw that the minimum bound is currently not used
>> in the code you posted above [2]. Is that still planned?
>>
>> -Max
>>
>> PS: Note that originally we had assumed min == max but I think that
>> would be a less safe scaling approach because we would get stuck
>> waiting for resources when they are not available, e.g. k8s resource
>> limits reached.
>>
>> [1] However, there might be costs involved with executing the
>> rescaling, e.g. for using external storage like s3, especially without
>> local recovery.
>> [2]
>> https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9
>>
>> On Tue, Feb 28, 2023 at 9:33 AM David Morávek <dm...@apache.org> wrote:
>> >
>> > Hi Everyone,
>> >
>> > We had some more talks about the pre-allocation of resources with @Max,
>> and
>> > here is the final state that we've converged to for now:
>> >
>> > The vital thing to note about the new API is that it's declarative,
>> meaning
>> > we're declaring the desired state to which we want our job to converge;
>> If,
>> > after the requirements update job no longer holds the desired resources
>> > (fewer resources than the lower bound), it will be canceled and
>> transition
>> > back into the waiting for resources state.
>> >
>> > In some use cases, you might always want to rescale to the upper bound
>> > (this goes along the lines of "preallocating resources" and minimizing
>> the
>> > number of rescales, which is especially useful with the large state).
>> This
>> > can be controlled by two knobs that already exist:
>> >
>> > 1) "jobmanager.adaptive-scheduler.min-parallelism-increase" - this
>> affects
>> > a minimal parallelism increase step of a running job; we'll slightly
>> change
>> > the semantics, and we'll trigger rescaling either once this condition is
>> > met or when you hit the ceiling; setting this to the high number will
>> > ensure that you always rescale to the upper bound
>> >
>> > 2) "jobmanager.adaptive-scheduler.resource-stabilization-timeout" - for
>> new
>> > and already restarting jobs, we'll always respect this timeout, which
>> > allows you to wait for more resources even though you already have more
>> > resources than defined in the lower bound; again, in the case we reach
>> the
>> > ceiling (the upper bound), we'll transition into the executing state.
>> >
>> >
>> > We're still planning to dig deeper in this direction with other efforts,
>> > but this is already good enough and should allow us to move the FLIP
>> > forward.
>> >
>> > WDYT? Unless there are any objectives against the above, I'd like to
>> > proceed to a vote.
>> >
>> > Best,
>> > D.
>> >
>> > On Thu, Feb 23, 2023 at 5:39 PM David Morávek <dm...@apache.org> wrote:
>> >
>> > > Hi Everyone,
>> > >
>> > > @John
>> > >
>> > > This is a problem that we've spent some time trying to crack; in the
>> end,
>> > > we've decided to go against doing any upgrades to JobGraphStore from
>> > > JobMaster to avoid having multiple writers that are guarded by
>> different
>> > > leader election lock (Dispatcher and JobMaster might live in a
>> different
>> > > process). The contract we've decided to choose instead is leveraging
>> the
>> > > idempotency of the endpoint and having the user of the API retry in
>> case
>> > > we're unable to persist new requirements in the JobGraphStore [1]. We
>> > > eventually need to move JobGraphStore out of the dispatcher, but
>> that's way
>> > > out of the scope of this FLIP. The solution is a deliberate
>> trade-off. The
>> > > worst scenario is that the Dispatcher fails over in between retries,
>> which
>> > > would simply rescale the job to meet the previous resource
>> requirements
>> > > (more extended unavailability of underlying HA storage would have
>> worse
>> > > consequences than this). Does that answer your question?
>> > >
>> > > @Matthias
>> > >
>> > > Good catch! I'm fixing it now, thanks!
>> > >
>> > > [1]
>> > >
>> https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9#diff-a4b690fb2c4975d25b05eb4161617af0d704a85ff7b1cad19d3c817c12f1e29cR1151
>> > >
>> > > Best,
>> > > D.
>> > >
>> > > On Tue, Feb 21, 2023 at 12:24 AM John Roesler <vv...@apache.org>
>> wrote:
>> > >
>> > >> Thanks for the FLIP, David!
>> > >>
>> > >> I just had one small question. IIUC, the REST API PUT request will go
>> > >> through the new DispatcherGateway method to be handled. Then, after
>> > >> validation, the dispatcher would call the new JobMasterGateway
>> method to
>> > >> actually update the job.
>> > >>
>> > >> Which component will write the updated JobGraph? I just wanted to
>> make
>> > >> sure it’s the JobMaster because it it were the dispatcher, there
>> could be a
>> > >> race condition with the async JobMaster method.
>> > >>
>> > >> Thanks!
>> > >> -John
>> > >>
>> > >> On Mon, Feb 20, 2023, at 07:34, Matthias Pohl wrote:
>> > >> > Thanks for your clarifications, David. I don't have any additional
>> major
>> > >> > points to add. One thing about the FLIP: The RPC layer API for
>> updating
>> > >> the
>> > >> > JRR returns a future with a JRR? I don't see value in returning a
>> JRR
>> > >> here
>> > >> > since it's an idempotent operation? Wouldn't it be enough to return
>> > >> > CompletableFuture<Void> here? Or am I missing something?
>> > >> >
>> > >> > Matthias
>> > >> >
>> > >> > On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels <mxm@apache.org
>> >
>> > >> wrote:
>> > >> >
>> > >> >> Thanks David! If we could get the pre-allocation working as part
>> of
>> > >> >> the FLIP, that would be great.
>> > >> >>
>> > >> >> Concerning the downscale case, I agree this is a special case for
>> the
>> > >> >> (single-job) application mode where we could re-allocate slots in
>> a
>> > >> >> way that could leave entire task managers unoccupied which we
>> would
>> > >> >> then be able to release. The goal essentially is to reduce slot
>> > >> >> fragmentation on scale down by packing the slots efficiently. The
>> > >> >> easiest way to add this optimization when running in application
>> mode
>> > >> >> would be to drop as many task managers during the restart such
>> that
>> > >> >> NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look
>> into
>> > >> >> this independently of the FLIP.
>> > >> >>
>> > >> >> Feel free to start the vote.
>> > >> >>
>> > >> >> -Max
>> > >> >>
>> > >> >> On Mon, Feb 20, 2023 at 9:10 AM David Morávek <dm...@apache.org>
>> wrote:
>> > >> >> >
>> > >> >> > Hi everyone,
>> > >> >> >
>> > >> >> > Thanks for the feedback! I've updated the FLIP to use
>> idempotent PUT
>> > >> API
>> > >> >> instead of PATCH and to properly handle lower bound settings, to
>> > >> support
>> > >> >> the "pre-allocation" of the resources.
>> > >> >> >
>> > >> >> > @Max
>> > >> >> >
>> > >> >> > > How hard would it be to address this issue in the FLIP?
>> > >> >> >
>> > >> >> > I've included this in the FLIP. It might not be too hard to
>> implement
>> > >> >> this in the end.
>> > >> >> >
>> > >> >> > > B) drop as many superfluous task managers as needed
>> > >> >> >
>> > >> >> > I've intentionally left this part out for now because this
>> ultimately
>> > >> >> needs to be the responsibility of the Resource Manager. After
>> all, in
>> > >> the
>> > >> >> Session Cluster scenario, the Scheduler doesn't have the bigger
>> > >> picture of
>> > >> >> other tasks of other jobs running on those TMs. This will most
>> likely
>> > >> be a
>> > >> >> topic for another FLIP.
>> > >> >> >
>> > >> >> > WDYT? If there are no other questions or concerns, I'd like to
>> start
>> > >> the
>> > >> >> vote on Wednesday.
>> > >> >> >
>> > >> >> > Best,
>> > >> >> > D.
>> > >> >> >
>> > >> >> > On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <
>> mxm@apache.org>
>> > >> >> wrote:
>> > >> >> >>
>> > >> >> >> I missed that the FLIP states:
>> > >> >> >>
>> > >> >> >> > Currently, even though we’d expose the lower bound for
>> clarity and
>> > >> >> API completeness, we won’t allow setting it to any other value
>> than one
>> > >> >> until we have full support throughout the stack.
>> > >> >> >>
>> > >> >> >> How hard would it be to address this issue in the FLIP?
>> > >> >> >>
>> > >> >> >> There is not much value to offer setting a lower bound which
>> won't
>> > >> be
>> > >> >> >> respected / throw an error when it is set. If we had support
>> for a
>> > >> >> >> lower bound, we could enforce a resource contract externally
>> via
>> > >> >> >> setting lowerBound == upperBound. That ties back to the
>> Rescale API
>> > >> >> >> discussion we had. I want to better understand what the major
>> > >> concerns
>> > >> >> >> would be around allowing this.
>> > >> >> >>
>> > >> >> >> Just to outline how I imagine the logic to work:
>> > >> >> >>
>> > >> >> >> A) The resource constraints are already met => Nothing changes
>> > >> >> >> B) More resources available than required => Cancel the job,
>> drop as
>> > >> >> >> many superfluous task managers as needed, restart the job
>> > >> >> >> C) Less resources available than required => Acquire new task
>> > >> >> >> managers, wait for them to register, cancel and restart the job
>> > >> >> >>
>> > >> >> >> I'm open to helping out with the implementation.
>> > >> >> >>
>> > >> >> >> -Max
>> > >> >> >>
>> > >> >> >> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <
>> mxm@apache.org>
>> > >> >> wrote:
>> > >> >> >> >
>> > >> >> >> > Based on further discussion I had with Chesnay on this PR
>> [1], I
>> > >> think
>> > >> >> >> > jobs would currently go into a restarting state after the
>> resource
>> > >> >> >> > requirements have changed. This wouldn't achieve what we had
>> in
>> > >> mind,
>> > >> >> >> > i.e. sticking to the old resource requirements until enough
>> slots
>> > >> are
>> > >> >> >> > available to fulfil the new resource requirements. So this
>> may
>> > >> not be
>> > >> >> >> > 100% what we need but it could be extended to do what we
>> want.
>> > >> >> >> >
>> > >> >> >> > -Max
>> > >> >> >> >
>> > >> >> >> > [1]
>> > >> https://github.com/apache/flink/pull/21908#discussion_r1104792362
>> > >> >> >> >
>> > >> >> >> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <
>> > >> mxm@apache.org>
>> > >> >> wrote:
>> > >> >> >> > >
>> > >> >> >> > > Hi David,
>> > >> >> >> > >
>> > >> >> >> > > This is awesome! Great writeup and demo. This is pretty
>> much
>> > >> what we
>> > >> >> >> > > need for the autoscaler as part of the Flink Kubernetes
>> operator
>> > >> >> [1].
>> > >> >> >> > > Scaling Flink jobs effectively is hard but fortunately we
>> have
>> > >> >> solved
>> > >> >> >> > > the issue as part of the Flink Kubernetes operator. The
>> only
>> > >> >> critical
>> > >> >> >> > > piece we are missing is a better way to execute scaling
>> > >> decisions,
>> > >> >> as
>> > >> >> >> > > discussed in [2].
>> > >> >> >> > >
>> > >> >> >> > > Looking at your proposal, we would set lowerBound ==
>> upperBound
>> > >> for
>> > >> >> >> > > the parallelism because we want to fully determine the
>> > >> parallelism
>> > >> >> >> > > externally based on the scaling metrics. Does that sound
>> right?
>> > >> >> >> > >
>> > >> >> >> > > What is the timeline for these changes? Is there a JIRA?
>> > >> >> >> > >
>> > >> >> >> > > Cheers,
>> > >> >> >> > > Max
>> > >> >> >> > >
>> > >> >> >> > > [1]
>> > >> >>
>> > >>
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
>> > >> >> >> > > [2]
>> > >> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
>> > >> >> >> > >
>> > >> >> >> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <
>> > >> xiangyu0xf@gmail.com>
>> > >> >> wrote:
>> > >> >> >> > > >
>> > >> >> >> > > > Hi David,
>> > >> >> >> > > >
>> > >> >> >> > > > Thanks for your reply.  I think your response totally
>> make
>> > >> >> sense.  This
>> > >> >> >> > > > flip targets on declaring required resource to
>> ResourceManager
>> > >> >> instead of
>> > >> >> >> > > > using  ResourceManager to add/remove TMs directly.
>> > >> >> >> > > >
>> > >> >> >> > > > Best,
>> > >> >> >> > > > Xiangyu
>> > >> >> >> > > >
>> > >> >> >> > > >
>> > >> >> >> > > >
>> > >> >> >> > > > David Morávek <da...@gmail.com> 于2023年2月13日周一
>> > >> 15:46写道：
>> > >> >> >> > > >
>> > >> >> >> > > > > Hi everyone,
>> > >> >> >> > > > >
>> > >> >> >> > > > > @Shammon
>> > >> >> >> > > > >
>> > >> >> >> > > > > I'm not entirely sure what "config file" you're
>> referring
>> > >> to.
>> > >> >> You can, of
>> > >> >> >> > > > > course, override the default parallelism in
>> > >> "flink-conf.yaml",
>> > >> >> but for
>> > >> >> >> > > > > sinks and sources, the parallelism needs to be tweaked
>> on
>> > >> the
>> > >> >> connector
>> > >> >> >> > > > > level ("WITH" statement).
>> > >> >> >> > > > >
>> > >> >> >> > > > > This is something that should be achieved with tooling
>> > >> around
>> > >> >> Flink. We
>> > >> >> >> > > > > want to provide an API on the lowest level that
>> generalizes
>> > >> >> well. Achieving
>> > >> >> >> > > > > what you're describing should be straightforward with
>> this
>> > >> API.
>> > >> >> >> > > > >
>> > >> >> >> > > > > @Xiangyu
>> > >> >> >> > > > >
>> > >> >> >> > > > > Is it possible for this REST API to declare TM
>> resources in
>> > >> the
>> > >> >> future?
>> > >> >> >> > > > >
>> > >> >> >> > > > >
>> > >> >> >> > > > > Would you like to add/remove TMs if you use an active
>> > >> Resource
>> > >> >> Manager?
>> > >> >> >> > > > > This would be out of the scope of this effort since it
>> > >> targets
>> > >> >> the
>> > >> >> >> > > > > scheduler component only (we make no assumptions about
>> the
>> > >> used
>> > >> >> Resource
>> > >> >> >> > > > > Manager). Also, the AdaptiveScheduler is only intended
>> to be
>> > >> >> used for
>> > >> >> >> > > > > Streaming.
>> > >> >> >> > > > >
>> > >> >> >> > > > >  And for streaming jobs, I'm wondering if there is any
>> > >> >> situation we need to
>> > >> >> >> > > > > > rescale the TM resources of a flink cluster at first
>> and
>> > >> then
>> > >> >> the
>> > >> >> >> > > > > adaptive
>> > >> >> >> > > > > > scheduler will rescale the per-vertex
>> ResourceProfiles
>> > >> >> accordingly.
>> > >> >> >> > > > > >
>> > >> >> >> > > > >
>> > >> >> >> > > > > We plan on adding support for the ResourceProfiles
>> (dynamic
>> > >> slot
>> > >> >> >> > > > > allocation) as the next step. Again we won't make any
>> > >> >> assumptions about the
>> > >> >> >> > > > > used Resource Manager. In other words, this effort
>> ends by
>> > >> >> declaring
>> > >> >> >> > > > > desired resources to the Resource Manager.
>> > >> >> >> > > > >
>> > >> >> >> > > > > Does that make sense?
>> > >> >> >> > > > >
>> > >> >> >> > > > > @Matthias
>> > >> >> >> > > > >
>> > >> >> >> > > > > We've done another pass on the proposed API and
>> currently
>> > >> lean
>> > >> >> towards
>> > >> >> >> > > > > having an idempotent PUT API.
>> > >> >> >> > > > > - We don't care too much about multiple writers'
>> scenarios
>> > >> in
>> > >> >> terms of who
>> > >> >> >> > > > > can write an authoritative payload; this is up to the
>> user
>> > >> of
>> > >> >> the API to
>> > >> >> >> > > > > figure out
>> > >> >> >> > > > > - It's indeed tricky to achieve atomicity with PATCH
>> API;
>> > >> >> switching to PUT
>> > >> >> >> > > > > API seems to do the trick
>> > >> >> >> > > > > - We won't allow partial "payloads" anymore, meaning
>> you
>> > >> need
>> > >> >> to define
>> > >> >> >> > > > > requirements for all vertices in the JobGraph; This is
>> > >> >> completely fine for
>> > >> >> >> > > > > the programmatic workflows. For DEBUG / DEMO purposes,
>> you
>> > >> can
>> > >> >> use the GET
>> > >> >> >> > > > > endpoint and tweak the response to avoid writing the
>> whole
>> > >> >> payload by hand.
>> > >> >> >> > > > >
>> > >> >> >> > > > > WDYT?
>> > >> >> >> > > > >
>> > >> >> >> > > > >
>> > >> >> >> > > > > Best,
>> > >> >> >> > > > > D.
>> > >> >> >> > > > >
>> > >> >> >> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <
>> > >> >> xiangyu0xf@gmail.com>
>> > >> >> >> > > > > wrote:
>> > >> >> >> > > > >
>> > >> >> >> > > > > > Hi David,
>> > >> >> >> > > > > >
>> > >> >> >> > > > > > Thanks for creating this flip. I think this work it
>> is
>> > >> very
>> > >> >> useful,
>> > >> >> >> > > > > > especially in autoscaling scenario.  I would like to
>> share
>> > >> >> some questions
>> > >> >> >> > > > > > from my view.
>> > >> >> >> > > > > >
>> > >> >> >> > > > > > 1, Is it possible for this REST API to declare TM
>> > >> resources
>> > >> >> in the
>> > >> >> >> > > > > future?
>> > >> >> >> > > > > > I'm asking because we are building the autoscaling
>> feature
>> > >> >> for Flink OLAP
>> > >> >> >> > > > > > Session Cluster in ByteDance. We need to rescale the
>> > >> >> cluster's resource
>> > >> >> >> > > > > on
>> > >> >> >> > > > > > TM level instead of Job level. It would be very
>> helpful
>> > >> if we
>> > >> >> have a REST
>> > >> >> >> > > > > > API for out external Autoscaling service to use.
>> > >> >> >> > > > > >
>> > >> >> >> > > > > > 2, And for streaming jobs, I'm wondering if there is
>> any
>> > >> >> situation we
>> > >> >> >> > > > > need
>> > >> >> >> > > > > > to rescale the TM resources of a flink cluster at
>> first
>> > >> and
>> > >> >> then the
>> > >> >> >> > > > > > adaptive scheduler will rescale the per-vertex
>> > >> >> ResourceProfiles
>> > >> >> >> > > > > > accordingly.
>> > >> >> >> > > > > >
>> > >> >> >> > > > > > best.
>> > >> >> >> > > > > > Xiangyu
>> > >> >> >> > > > > >
>> > >> >> >> > > > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
>> > >> >> >> > > > > >
>> > >> >> >> > > > > > > Hi David
>> > >> >> >> > > > > > >
>> > >> >> >> > > > > > > Thanks for your answer.
>> > >> >> >> > > > > > >
>> > >> >> >> > > > > > > > Can you elaborate more about how you'd intend to
>> use
>> > >> the
>> > >> >> endpoint? I
>> > >> >> >> > > > > > > think we can ultimately introduce a way of
>> re-declaring
>> > >> >> "per-vertex
>> > >> >> >> > > > > > > defaults," but I'd like to understand the use case
>> bit
>> > >> more
>> > >> >> first.
>> > >> >> >> > > > > > >
>> > >> >> >> > > > > > > For this issue, I mainly consider the consistency
>> of
>> > >> user
>> > >> >> configuration
>> > >> >> >> > > > > > and
>> > >> >> >> > > > > > > job runtime. For sql jobs, users usually set
>> specific
>> > >> >> parallelism for
>> > >> >> >> > > > > > > source and sink, and set a global parallelism for
>> other
>> > >> >> operators.
>> > >> >> >> > > > > These
>> > >> >> >> > > > > > > config items are stored in a config file. For some
>> > >> >> high-priority jobs,
>> > >> >> >> > > > > > > users may want to manage them manually.
>> > >> >> >> > > > > > > 1. When users need to scale the parallelism, they
>> should
>> > >> >> update the
>> > >> >> >> > > > > > config
>> > >> >> >> > > > > > > file and restart flink job, which may take a long
>> time.
>> > >> >> >> > > > > > > 2. After providing the REST API, users can just
>> send a
>> > >> >> request to the
>> > >> >> >> > > > > job
>> > >> >> >> > > > > > > via REST API quickly after updating the config
>> file.
>> > >> >> >> > > > > > > The configuration in the running job and config
>> file
>> > >> should
>> > >> >> be the
>> > >> >> >> > > > > same.
>> > >> >> >> > > > > > > What do you think of this?
>> > >> >> >> > > > > > >
>> > >> >> >> > > > > > > best.
>> > >> >> >> > > > > > > Shammon
>> > >> >> >> > > > > > >
>> > >> >> >> > > > > > >
>> > >> >> >> > > > > > >
>> > >> >> >> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <
>> > >> >> david.moravek@gmail.com>
>> > >> >> >> > > > > > > wrote:
>> > >> >> >> > > > > > >
>> > >> >> >> > > > > > > > Hi everyone,
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > Let's try to answer the questions one by one.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > *@ConradJam*
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > when the number of "slots" is insufficient, can
>> we can
>> > >> >> stop users
>> > >> >> >> > > > > > > rescaling
>> > >> >> >> > > > > > > > > or throw something to tell user "less avaliable
>> > >> slots
>> > >> >> to upgrade,
>> > >> >> >> > > > > > > please
>> > >> >> >> > > > > > > > > checkout your alivalbe slots" ?
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > The main property of AdaptiveScheduler is that
>> it can
>> > >> >> adapt to
>> > >> >> >> > > > > > "available
>> > >> >> >> > > > > > > > resources," which means you're still able to make
>> > >> >> progress even
>> > >> >> >> > > > > though
>> > >> >> >> > > > > > > you
>> > >> >> >> > > > > > > > didn't get all the slots you've asked for. Let's
>> break
>> > >> >> down the pros
>> > >> >> >> > > > > > and
>> > >> >> >> > > > > > > > cons of this property.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > - (plus) If you lose a TM for some reason, you
>> can
>> > >> still
>> > >> >> recover even
>> > >> >> >> > > > > > if
>> > >> >> >> > > > > > > it
>> > >> >> >> > > > > > > > doesn't come back. We still need to give it some
>> time
>> > >> to
>> > >> >> eliminate
>> > >> >> >> > > > > > > > unnecessary rescaling, which can be controlled by
>> > >> setting
>> > >> >> >> > > > > > > > "resource-stabilization-timeout."
>> > >> >> >> > > > > > > > - (plus) The resources can arrive with a
>> significant
>> > >> >> delay. For
>> > >> >> >> > > > > > example,
>> > >> >> >> > > > > > > > you're unable to spawn enough TMs on time because
>> > >> you've
>> > >> >> run out of
>> > >> >> >> > > > > > > > resources in your k8s cluster, and you need to
>> wait
>> > >> for
>> > >> >> the cluster
>> > >> >> >> > > > > > auto
>> > >> >> >> > > > > > > > scaler to kick in and add new nodes to the
>> cluster. In
>> > >> >> this scenario,
>> > >> >> >> > > > > > > > you'll be able to start making progress faster,
>> at the
>> > >> >> cost of
>> > >> >> >> > > > > multiple
>> > >> >> >> > > > > > > > rescalings (once the remaining resources arrive).
>> > >> >> >> > > > > > > > - (plus) This plays well with the declarative
>> manner
>> > >> of
>> > >> >> today's
>> > >> >> >> > > > > > > > infrastructure. For example, you tell k8s that
>> you
>> > >> need
>> > >> >> 10 TMs, and
>> > >> >> >> > > > > > > you'll
>> > >> >> >> > > > > > > > eventually get them.
>> > >> >> >> > > > > > > > - (minus) In the case of large state jobs, the
>> cost of
>> > >> >> multiple
>> > >> >> >> > > > > > > rescalings
>> > >> >> >> > > > > > > > might outweigh the above.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > We've already touched on the solution to this
>> problem
>> > >> on
>> > >> >> the FLIP.
>> > >> >> >> > > > > > Please
>> > >> >> >> > > > > > > > notice the parallelism knob being a range with a
>> lower
>> > >> >> and upper
>> > >> >> >> > > > > bound.
>> > >> >> >> > > > > > > > Setting both the lower and upper bound to the
>> same
>> > >> value
>> > >> >> could give
>> > >> >> >> > > > > the
>> > >> >> >> > > > > > > > behavior you're describing at the cost of giving
>> up
>> > >> some
>> > >> >> properties
>> > >> >> >> > > > > > that
>> > >> >> >> > > > > > > AS
>> > >> >> >> > > > > > > > gives you (you'd be falling back to the
>> > >> >> DefaultScheduler's behavior).
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > when user upgrade job-vertx-parallelism . I want
>> to
>> > >> have
>> > >> >> an interface
>> > >> >> >> > > > > > to
>> > >> >> >> > > > > > > > > query the current update parallel execution
>> status,
>> > >> so
>> > >> >> that the
>> > >> >> >> > > > > user
>> > >> >> >> > > > > > or
>> > >> >> >> > > > > > > > > program can understand the current status
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > This is a misunderstanding. We're not
>> introducing the
>> > >> >> RESCALE
>> > >> >> >> > > > > endpoint.
>> > >> >> >> > > > > > > > This endpoint allows you to re-declare the
>> resources
>> > >> >> needed to run
>> > >> >> >> > > > > the
>> > >> >> >> > > > > > > job.
>> > >> >> >> > > > > > > > Once you reach the desired resources (you get
>> more
>> > >> >> resources than the
>> > >> >> >> > > > > > > lower
>> > >> >> >> > > > > > > > bound defines), your job will run.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > We can expose a similar endpoint to "resource
>> > >> >> requirements" to give
>> > >> >> >> > > > > you
>> > >> >> >> > > > > > > an
>> > >> >> >> > > > > > > > overview of the resources the vertices already
>> have.
>> > >> You
>> > >> >> can already
>> > >> >> >> > > > > > get
>> > >> >> >> > > > > > > > this from the REST API, so exposing this in yet
>> > >> another
>> > >> >> way should be
>> > >> >> >> > > > > > > > considered carefully.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > *@Matthias*
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > I'm wondering whether it makes sense to add some
>> kind
>> > >> of
>> > >> >> resource ID
>> > >> >> >> > > > > to
>> > >> >> >> > > > > > > the
>> > >> >> >> > > > > > > > > REST API.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > That's a good question. I want to think about
>> that and
>> > >> >> get back to
>> > >> >> >> > > > > the
>> > >> >> >> > > > > > > > question later. My main struggle when thinking
>> about
>> > >> this
>> > >> >> is, "if
>> > >> >> >> > > > > this
>> > >> >> >> > > > > > > > would be an idempotent POST endpoint," would it
>> be any
>> > >> >> different?
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > How often do we allow resource requirements to be
>> > >> changed?
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > There shall be no rate limiting on the FLINK
>> side. If
>> > >> >> this is
>> > >> >> >> > > > > something
>> > >> >> >> > > > > > > > your environment needs, you can achieve it on a
>> > >> different
>> > >> >> layer ("we
>> > >> >> >> > > > > > > can't
>> > >> >> >> > > > > > > > have FLINK to do everything").
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > Versioning the JobGraph in the JobGraphStore
>> rather
>> > >> than
>> > >> >> overwriting
>> > >> >> >> > > > > it
>> > >> >> >> > > > > > > > > might be an idea.
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > This sounds interesting since it would be closer
>> to
>> > >> the
>> > >> >> JobGraph
>> > >> >> >> > > > > being
>> > >> >> >> > > > > > > > immutable. The main problem I see here is that
>> this
>> > >> would
>> > >> >> introduce a
>> > >> >> >> > > > > > > > BW-incompatible change so it might be a topic for
>> > >> >> follow-up FLIP.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > I'm just wondering whether we bundle two things
>> > >> together
>> > >> >> that are
>> > >> >> >> > > > > > > actually
>> > >> >> >> > > > > > > > > separate
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > Yup, this is how we think about it as well. The
>> main
>> > >> >> question is,
>> > >> >> >> > > > > "who
>> > >> >> >> > > > > > > > should be responsible for bookkeeping 1) the
>> JobGraph
>> > >> and
>> > >> >> 2) the
>> > >> >> >> > > > > > > > JobResourceRequirements". The JobMaster would be
>> the
>> > >> >> right place for
>> > >> >> >> > > > > > > both,
>> > >> >> >> > > > > > > > but it's currently not the case, and we're
>> tightly
>> > >> >> coupling the
>> > >> >> >> > > > > > > dispatcher
>> > >> >> >> > > > > > > > with the JobMaster.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > Initially, we tried to introduce a separate HA
>> > >> component
>> > >> >> in JobMaster
>> > >> >> >> > > > > > for
>> > >> >> >> > > > > > > > bookkeeping the JobResourceRequirements, but that
>> > >> proved
>> > >> >> to be a more
>> > >> >> >> > > > > > > > significant effort adding additional mess to the
>> > >> already
>> > >> >> messy HA
>> > >> >> >> > > > > > > > ecosystem. Another approach we've discussed was
>> > >> mutating
>> > >> >> the JobGraph
>> > >> >> >> > > > > > and
>> > >> >> >> > > > > > > > setting JRR into the JobGraph structure itself.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > The middle ground for keeping this effort
>> reasonably
>> > >> >> sized and not
>> > >> >> >> > > > > > > > violating "we want to keep JG immutable" too
>> much is
>> > >> >> keeping the
>> > >> >> >> > > > > > > > JobResourceRequirements separate as an internal
>> config
>> > >> >> option in
>> > >> >> >> > > > > > > JobGraph's
>> > >> >> >> > > > > > > > configuration.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > We ultimately need to rethink the tight coupling
>> of
>> > >> >> Dispatcher and
>> > >> >> >> > > > > > > > JobMaster, but it needs to be a separate effort.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > ...also considering the amount of data that can
>> be
>> > >> stored
>> > >> >> in a
>> > >> >> >> > > > > > > > > ConfigMap/ZooKeeper node if versioning the
>> resource
>> > >> >> requirement
>> > >> >> >> > > > > > change
>> > >> >> >> > > > > > > as
>> > >> >> >> > > > > > > > > proposed in my previous item is an option for
>> us.
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > AFAIK we're only storing pointers to the S3
>> objects
>> > >> in HA
>> > >> >> metadata,
>> > >> >> >> > > > > so
>> > >> >> >> > > > > > we
>> > >> >> >> > > > > > > > should be okay with having larger structures for
>> now.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > Updating the JobGraphStore means adding more
>> requests
>> > >> to
>> > >> >> the HA
>> > >> >> >> > > > > backend
>> > >> >> >> > > > > > > > API.
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > It's fine unless you intend to override the
>> resource
>> > >> >> requirements a
>> > >> >> >> > > > > few
>> > >> >> >> > > > > > > > times per second.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > *@Shammon*
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > How about adding some more information such as
>> vertex
>> > >> type
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > Since it was intended as a "debug" endpoint, it
>> makes
>> > >> >> complete sense!
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > >  For sql jobs, we always use a unified
>> parallelism for
>> > >> >> most vertices.
>> > >> >> >> > > > > > Can
>> > >> >> >> > > > > > > > > we provide them with a more convenient setting
>> > >> method
>> > >> >> instead of
>> > >> >> >> > > > > each
>> > >> >> >> > > > > > > > one?
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > I completely feel with this. The main thoughts
>> when
>> > >> >> designing the API
>> > >> >> >> > > > > > > were:
>> > >> >> >> > > > > > > > - We want to keep it clean and easy to
>> understand.
>> > >> >> >> > > > > > > > - Global parallelism can be modeled using
>> per-vertex
>> > >> >> parallelism but
>> > >> >> >> > > > > > not
>> > >> >> >> > > > > > > > the other way around.
>> > >> >> >> > > > > > > > - The API will be used by external tooling
>> (operator,
>> > >> >> auto scaler).
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > Can you elaborate more about how you'd intend to
>> use
>> > >> the
>> > >> >> endpoint? I
>> > >> >> >> > > > > > > think
>> > >> >> >> > > > > > > > we can ultimately introduce a way of re-declaring
>> > >> >> "per-vertex
>> > >> >> >> > > > > > defaults,"
>> > >> >> >> > > > > > > > but I'd like to understand the use case bit more
>> > >> first.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > *@Weijie*
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > What is the default value here (based on what
>> > >> >> configuration), or just
>> > >> >> >> > > > > > > > > infinite?
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > Currently, for the lower bound, it's always one,
>> and
>> > >> for
>> > >> >> the upper
>> > >> >> >> > > > > > bound,
>> > >> >> >> > > > > > > > it's either parallelism (if defined) or the
>> > >> >> maxParallelism of the
>> > >> >> >> > > > > > vertex
>> > >> >> >> > > > > > > in
>> > >> >> >> > > > > > > > JobGraph. This question might be another signal
>> for
>> > >> >> making the
>> > >> >> >> > > > > defaults
>> > >> >> >> > > > > > > > explicit (see the answer to Shammon's question
>> above).
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > Thanks, everyone, for your initial thoughts!
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > Best,
>> > >> >> >> > > > > > > > D.
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <
>> > >> >> guoweijiereswqa@gmail.com
>> > >> >> >> > > > > >
>> > >> >> >> > > > > > > > wrote:
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > > > > Thanks David for driving this. This is a very
>> > >> valuable
>> > >> >> work,
>> > >> >> >> > > > > > especially
>> > >> >> >> > > > > > > > for
>> > >> >> >> > > > > > > > > cloud native environment.
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > > > >> How about adding some more information such
>> as
>> > >> >> vertex type
>> > >> >> >> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of
>> `get
>> > >> jobs
>> > >> >> >> > > > > > > > > resource-requirements`? For users, only
>> vertex-id
>> > >> may
>> > >> >> be difficult
>> > >> >> >> > > > > to
>> > >> >> >> > > > > > > > > understand.
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > > > +1 for this suggestion, including jobvertex's
>> name
>> > >> in
>> > >> >> the response
>> > >> >> >> > > > > > body
>> > >> >> >> > > > > > > > is
>> > >> >> >> > > > > > > > > more
>> > >> >> >> > > > > > > > > user-friendly.
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > > > I saw this sentence in FLIP: "Setting the upper
>> > >> bound
>> > >> >> to -1 will
>> > >> >> >> > > > > > reset
>> > >> >> >> > > > > > > > the
>> > >> >> >> > > > > > > > > value to the default setting."  What is the
>> default
>> > >> >> value here
>> > >> >> >> > > > > (based
>> > >> >> >> > > > > > > on
>> > >> >> >> > > > > > > > > what configuration), or just infinite?
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > > > Best regards,
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > > > Weijie
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一
>> > >> 18:06写道：
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > > > > Hi David
>> > >> >> >> > > > > > > > > >
>> > >> >> >> > > > > > > > > > Thanks for initiating this discussion. I
>> think
>> > >> >> declaring job
>> > >> >> >> > > > > > resource
>> > >> >> >> > > > > > > > > > requirements by REST API is very valuable. I
>> just
>> > >> >> left some
>> > >> >> >> > > > > > comments
>> > >> >> >> > > > > > > as
>> > >> >> >> > > > > > > > > > followed
>> > >> >> >> > > > > > > > > >
>> > >> >> >> > > > > > > > > > 1) How about adding some more information
>> such as
>> > >> >> vertex type
>> > >> >> >> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response
>> of `get
>> > >> >> jobs
>> > >> >> >> > > > > > > > > > resource-requirements`? For users, only
>> vertex-id
>> > >> may
>> > >> >> be
>> > >> >> >> > > > > difficult
>> > >> >> >> > > > > > to
>> > >> >> >> > > > > > > > > > understand.
>> > >> >> >> > > > > > > > > >
>> > >> >> >> > > > > > > > > > 2) For sql jobs, we always use a unified
>> > >> parallelism
>> > >> >> for most
>> > >> >> >> > > > > > > vertices.
>> > >> >> >> > > > > > > > > Can
>> > >> >> >> > > > > > > > > > we provide them with a more convenient
>> setting
>> > >> method
>> > >> >> instead of
>> > >> >> >> > > > > > each
>> > >> >> >> > > > > > > > > one?
>> > >> >> >> > > > > > > > > >
>> > >> >> >> > > > > > > > > >
>> > >> >> >> > > > > > > > > > Best,
>> > >> >> >> > > > > > > > > > Shammon
>> > >> >> >> > > > > > > > > >
>> > >> >> >> > > > > > > > > >
>> > >> >> >> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl
>> <
>> > >> >> >> > > > > > matthias.pohl@aiven.io
>> > >> >> >> > > > > > > > > > .invalid>
>> > >> >> >> > > > > > > > > > wrote:
>> > >> >> >> > > > > > > > > >
>> > >> >> >> > > > > > > > > > > Thanks David for creating this FLIP. It
>> sounds
>> > >> >> promising and
>> > >> >> >> > > > > > useful
>> > >> >> >> > > > > > > > to
>> > >> >> >> > > > > > > > > > > have. Here are some thoughts from my side
>> (some
>> > >> of
>> > >> >> them might
>> > >> >> >> > > > > be
>> > >> >> >> > > > > > > > > rather a
>> > >> >> >> > > > > > > > > > > follow-up and not necessarily part of this
>> > >> FLIP):
>> > >> >> >> > > > > > > > > > > - I'm wondering whether it makes sense to
>> add
>> > >> some
>> > >> >> kind of
>> > >> >> >> > > > > > resource
>> > >> >> >> > > > > > > > ID
>> > >> >> >> > > > > > > > > to
>> > >> >> >> > > > > > > > > > > the REST API. This would give Flink a tool
>> to
>> > >> >> verify the PATCH
>> > >> >> >> > > > > > > > request
>> > >> >> >> > > > > > > > > of
>> > >> >> >> > > > > > > > > > > the external system in a compare-and-set
>> kind of
>> > >> >> manner. AFAIU,
>> > >> >> >> > > > > > the
>> > >> >> >> > > > > > > > > > process
>> > >> >> >> > > > > > > > > > > requires the external system to retrieve
>> the
>> > >> >> resource
>> > >> >> >> > > > > > requirements
>> > >> >> >> > > > > > > > > first
>> > >> >> >> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID
>> > >> <ABC>
>> > >> >> would be sent
>> > >> >> >> > > > > > > along
>> > >> >> >> > > > > > > > > as
>> > >> >> >> > > > > > > > > > a
>> > >> >> >> > > > > > > > > > > unique identifier for the provided setup.
>> It's
>> > >> >> essentially the
>> > >> >> >> > > > > > > > version
>> > >> >> >> > > > > > > > > ID
>> > >> >> >> > > > > > > > > > > of the currently deployed resource
>> requirement
>> > >> >> configuration.
>> > >> >> >> > > > > > Flink
>> > >> >> >> > > > > > > > > > doesn't
>> > >> >> >> > > > > > > > > > > know whether the external system would use
>> the
>> > >> >> provided
>> > >> >> >> > > > > > information
>> > >> >> >> > > > > > > > in
>> > >> >> >> > > > > > > > > > some
>> > >> >> >> > > > > > > > > > > way to derive a new set of resource
>> requirements
>> > >> >> for this job.
>> > >> >> >> > > > > > The
>> > >> >> >> > > > > > > > > > > subsequent PATCH request with updated
>> resource
>> > >> >> requirements
>> > >> >> >> > > > > would
>> > >> >> >> > > > > > > > > include
>> > >> >> >> > > > > > > > > > > the previously retrieved resource ID
>> <ABC>. The
>> > >> >> PATCH call
>> > >> >> >> > > > > would
>> > >> >> >> > > > > > > fail
>> > >> >> >> > > > > > > > > if
>> > >> >> >> > > > > > > > > > > there was a concurrent PATCH call in
>> between
>> > >> >> indicating to the
>> > >> >> >> > > > > > > > external
>> > >> >> >> > > > > > > > > > > system that the resource requirements were
>> > >> >> concurrently
>> > >> >> >> > > > > updated.
>> > >> >> >> > > > > > > > > > > - How often do we allow resource
>> requirements
>> > >> to be
>> > >> >> changed?
>> > >> >> >> > > > > That
>> > >> >> >> > > > > > > > > > question
>> > >> >> >> > > > > > > > > > > might make my previous comment on the
>> resource
>> > >> ID
>> > >> >> obsolete
>> > >> >> >> > > > > > because
>> > >> >> >> > > > > > > we
>> > >> >> >> > > > > > > > > > could
>> > >> >> >> > > > > > > > > > > just make any PATCH call fail if there was
>> a
>> > >> >> resource
>> > >> >> >> > > > > requirement
>> > >> >> >> > > > > > > > > update
>> > >> >> >> > > > > > > > > > > within a certain time frame before the
>> request.
>> > >> But
>> > >> >> such a time
>> > >> >> >> > > > > > > > period
>> > >> >> >> > > > > > > > > is
>> > >> >> >> > > > > > > > > > > something we might want to make
>> configurable
>> > >> then,
>> > >> >> I guess.
>> > >> >> >> > > > > > > > > > > - Versioning the JobGraph in the
>> JobGraphStore
>> > >> >> rather than
>> > >> >> >> > > > > > > > overwriting
>> > >> >> >> > > > > > > > > it
>> > >> >> >> > > > > > > > > > > might be an idea. This would enable us to
>> > >> provide
>> > >> >> resource
>> > >> >> >> > > > > > > > requirement
>> > >> >> >> > > > > > > > > > > changes in the UI or through the REST API.
>> It is
>> > >> >> related to a
>> > >> >> >> > > > > > > problem
>> > >> >> >> > > > > > > > > > > around keeping track of the exception
>> history
>> > >> >> within the
>> > >> >> >> > > > > > > > > > AdaptiveScheduler
>> > >> >> >> > > > > > > > > > > and also having to consider multiple
>> versions
>> > >> of a
>> > >> >> JobGraph.
>> > >> >> >> > > > > But
>> > >> >> >> > > > > > > for
>> > >> >> >> > > > > > > > > that
>> > >> >> >> > > > > > > > > > > one, we use the ExecutionGraphInfoStore
>> right
>> > >> now.
>> > >> >> >> > > > > > > > > > > - Updating the JobGraph in the
>> JobGraphStore
>> > >> makes
>> > >> >> sense. I'm
>> > >> >> >> > > > > > just
>> > >> >> >> > > > > > > > > > > wondering whether we bundle two things
>> together
>> > >> >> that are
>> > >> >> >> > > > > actually
>> > >> >> >> > > > > > > > > > separate:
>> > >> >> >> > > > > > > > > > > The business logic and the execution
>> > >> configuration
>> > >> >> (the
>> > >> >> >> > > > > resource
>> > >> >> >> > > > > > > > > > > requirements). I'm aware that this is not a
>> > >> flaw of
>> > >> >> the current
>> > >> >> >> > > > > > > FLIP
>> > >> >> >> > > > > > > > > but
>> > >> >> >> > > > > > > > > > > rather something that was not necessary to
>> > >> address
>> > >> >> in the past
>> > >> >> >> > > > > > > > because
>> > >> >> >> > > > > > > > > > the
>> > >> >> >> > > > > > > > > > > JobGraph was kind of static. I don't
>> remember
>> > >> >> whether that was
>> > >> >> >> > > > > > > > already
>> > >> >> >> > > > > > > > > > > discussed while working on the
>> AdaptiveScheduler
>> > >> >> for FLIP-160
>> > >> >> >> > > > > > [1].
>> > >> >> >> > > > > > > > > Maybe,
>> > >> >> >> > > > > > > > > > > I'm missing some functionality here that
>> > >> requires
>> > >> >> us to have
>> > >> >> >> > > > > > > > everything
>> > >> >> >> > > > > > > > > > in
>> > >> >> >> > > > > > > > > > > one place. But it feels like updating the
>> entire
>> > >> >> JobGraph which
>> > >> >> >> > > > > > > could
>> > >> >> >> > > > > > > > > be
>> > >> >> >> > > > > > > > > > > actually a "config change" is not
>> reasonable.
>> > >> >> ...also
>> > >> >> >> > > > > considering
>> > >> >> >> > > > > > > the
>> > >> >> >> > > > > > > > > > > amount of data that can be stored in a
>> > >> >> ConfigMap/ZooKeeper node
>> > >> >> >> > > > > > if
>> > >> >> >> > > > > > > > > > > versioning the resource requirement change
>> as
>> > >> >> proposed in my
>> > >> >> >> > > > > > > previous
>> > >> >> >> > > > > > > > > > item
>> > >> >> >> > > > > > > > > > > is an option for us.
>> > >> >> >> > > > > > > > > > > - Updating the JobGraphStore means adding
>> more
>> > >> >> requests to the
>> > >> >> >> > > > > HA
>> > >> >> >> > > > > > > > > backend
>> > >> >> >> > > > > > > > > > > API. There were some concerns shared in the
>> > >> >> discussion thread
>> > >> >> >> > > > > [2]
>> > >> >> >> > > > > > > for
>> > >> >> >> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API
>> server in
>> > >> >> the past with
>> > >> >> >> > > > > > too
>> > >> >> >> > > > > > > > many
>> > >> >> >> > > > > > > > > > > calls. Eventhough, it's more likely to be
>> > >> caused by
>> > >> >> >> > > > > > checkpointing,
>> > >> >> >> > > > > > > I
>> > >> >> >> > > > > > > > > > still
>> > >> >> >> > > > > > > > > > > wanted to bring it up. We're working on a
>> > >> >> standardized
>> > >> >> >> > > > > > performance
>> > >> >> >> > > > > > > > test
>> > >> >> >> > > > > > > > > > to
>> > >> >> >> > > > > > > > > > > prepare going forward with FLIP-270 [3]
>> right
>> > >> now.
>> > >> >> >> > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > Best,
>> > >> >> >> > > > > > > > > > > Matthias
>> > >> >> >> > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > [1]
>> > >> >> >> > > > > > > > > > >
>> > >> >> >> > > > > > > > > > >
>> > >> >> >> > > > > > > > > >
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > >
>> > >> >> >> > > > > >
>> > >> >> >> > > > >
>> > >> >>
>> > >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
>> > >> >> >> > > > > > > > > > > [2]
>> > >> >> >> > > > > > >
>> > >> >> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
>> > >> >> >> > > > > > > > > > > [3]
>> > >> >> >> > > > > > > > > > >
>> > >> >> >> > > > > > > > > > >
>> > >> >> >> > > > > > > > > >
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > >
>> > >> >> >> > > > > >
>> > >> >> >> > > > >
>> > >> >>
>> > >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
>> > >> >> >> > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <
>> > >> >> jam.gzczy@gmail.com
>> > >> >> >> > > > > >
>> > >> >> >> > > > > > > > wrote:
>> > >> >> >> > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > Hi David:
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > Thank you for drive this flip, which
>> helps
>> > >> less
>> > >> >> flink
>> > >> >> >> > > > > shutdown
>> > >> >> >> > > > > > > time
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > for this flip, I would like to make a few
>> > >> idea on
>> > >> >> share
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >    - when the number of "slots" is
>> > >> insufficient,
>> > >> >> can we can
>> > >> >> >> > > > > > stop
>> > >> >> >> > > > > > > > > users
>> > >> >> >> > > > > > > > > > > >    rescaling or throw something to tell
>> user
>> > >> >> "less avaliable
>> > >> >> >> > > > > > > slots
>> > >> >> >> > > > > > > > to
>> > >> >> >> > > > > > > > > > > > upgrade,
>> > >> >> >> > > > > > > > > > > >    please checkout your alivalbe slots"
>> ? Or
>> > >> we
>> > >> >> could have a
>> > >> >> >> > > > > > > > request
>> > >> >> >> > > > > > > > > > > >    switch(true/false) to allow this
>> behavior
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >    - when user upgrade
>> job-vertx-parallelism
>> > >> . I
>> > >> >> want to have
>> > >> >> >> > > > > > an
>> > >> >> >> > > > > > > > > > > interface
>> > >> >> >> > > > > > > > > > > >    to query the current update parallel
>> > >> execution
>> > >> >> status, so
>> > >> >> >> > > > > > that
>> > >> >> >> > > > > > > > the
>> > >> >> >> > > > > > > > > > > user
>> > >> >> >> > > > > > > > > > > > or
>> > >> >> >> > > > > > > > > > > >    program can understand the current
>> status
>> > >> >> >> > > > > > > > > > > >    - I want to have an interface to
>> query the
>> > >> >> current update
>> > >> >> >> > > > > > > > > > parallelism
>> > >> >> >> > > > > > > > > > > >    execution status. This also helps
>> similar
>> > >> to
>> > >> >> *[1] Flink
>> > >> >> >> > > > > K8S
>> > >> >> >> > > > > > > > > > Operator*
>> > >> >> >> > > > > > > > > > > >    management
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > {
>> > >> >> >> > > > > > > > > > > >   status: Failed
>> > >> >> >> > > > > > > > > > > >   reason: "less avaliable slots to
>> upgrade,
>> > >> >> please checkout
>> > >> >> >> > > > > > your
>> > >> >> >> > > > > > > > > > alivalbe
>> > >> >> >> > > > > > > > > > > > slots"
>> > >> >> >> > > > > > > > > > > > }
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >    - *Pending*: this job now is join the
>> > >> upgrade
>> > >> >> queue,it
>> > >> >> >> > > > > will
>> > >> >> >> > > > > > be
>> > >> >> >> > > > > > > > > > update
>> > >> >> >> > > > > > > > > > > >    later
>> > >> >> >> > > > > > > > > > > >    - *Rescaling*: job now is
>> rescaling,wait it
>> > >> >> finish
>> > >> >> >> > > > > > > > > > > >    - *Finished*: finish do it
>> > >> >> >> > > > > > > > > > > >    - *Failed* : something have wrong,so
>> this
>> > >> job
>> > >> >> is not
>> > >> >> >> > > > > > alivable
>> > >> >> >> > > > > > > > > > upgrade
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > I want to supplement my above content in
>> flip,
>> > >> >> what do you
>> > >> >> >> > > > > > think
>> > >> >> >> > > > > > > ?
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >    1.
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > >
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > >
>> > >> >> >> > > > >
>> > >> >>
>> > >>
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > David Morávek <dm...@apache.org>
>> 于2023年2月3日周五
>> > >> >> 16:42写道：
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > > Hi everyone,
>> > >> >> >> > > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > > This FLIP [1] introduces a new REST
>> API for
>> > >> >> declaring
>> > >> >> >> > > > > > resource
>> > >> >> >> > > > > > > > > > > > requirements
>> > >> >> >> > > > > > > > > > > > > for the Adaptive Scheduler. There
>> seems to
>> > >> be a
>> > >> >> clear need
>> > >> >> >> > > > > > for
>> > >> >> >> > > > > > > > this
>> > >> >> >> > > > > > > > > > API
>> > >> >> >> > > > > > > > > > > > > based on the discussion on the
>> "Reworking
>> > >> the
>> > >> >> Rescale API"
>> > >> >> >> > > > > > [2]
>> > >> >> >> > > > > > > > > > thread.
>> > >> >> >> > > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > > Before we get started, this work is
>> heavily
>> > >> >> based on the
>> > >> >> >> > > > > > > > prototype
>> > >> >> >> > > > > > > > > > [3]
>> > >> >> >> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP
>> is
>> > >> being
>> > >> >> published
>> > >> >> >> > > > > > with
>> > >> >> >> > > > > > > > his
>> > >> >> >> > > > > > > > > > > > consent.
>> > >> >> >> > > > > > > > > > > > > Big shoutout to him!
>> > >> >> >> > > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > > Last and not least, thanks to Chesnay
>> and
>> > >> Roman
>> > >> >> for the
>> > >> >> >> > > > > > initial
>> > >> >> >> > > > > > > > > > reviews
>> > >> >> >> > > > > > > > > > > > and
>> > >> >> >> > > > > > > > > > > > > discussions.
>> > >> >> >> > > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > > The best start would be watching a
>> short
>> > >> demo
>> > >> >> [4] that I've
>> > >> >> >> > > > > > > > > recorded,
>> > >> >> >> > > > > > > > > > > > which
>> > >> >> >> > > > > > > > > > > > > illustrates newly added capabilities
>> > >> (rescaling
>> > >> >> the running
>> > >> >> >> > > > > > > job,
>> > >> >> >> > > > > > > > > > > handing
>> > >> >> >> > > > > > > > > > > > > back resources to the RM, and session
>> > >> cluster
>> > >> >> support).
>> > >> >> >> > > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > > The intuition behind the FLIP is being
>> able
>> > >> to
>> > >> >> define
>> > >> >> >> > > > > > resource
>> > >> >> >> > > > > > > > > > > > requirements
>> > >> >> >> > > > > > > > > > > > > ("resource boundaries") externally
>> that the
>> > >> >> >> > > > > AdaptiveScheduler
>> > >> >> >> > > > > > > can
>> > >> >> >> > > > > > > > > > > > navigate
>> > >> >> >> > > > > > > > > > > > > within. This is a building block for
>> > >> >> higher-level efforts
>> > >> >> >> > > > > > such
>> > >> >> >> > > > > > > as
>> > >> >> >> > > > > > > > > an
>> > >> >> >> > > > > > > > > > > > > external Autoscaler. The natural
>> extension
>> > >> of
>> > >> >> this work
>> > >> >> >> > > > > would
>> > >> >> >> > > > > > > be
>> > >> >> >> > > > > > > > to
>> > >> >> >> > > > > > > > > > > allow
>> > >> >> >> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
>> > >> >> >> > > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > > Looking forward to your thoughts; any
>> > >> feedback
>> > >> >> is
>> > >> >> >> > > > > > appreciated!
>> > >> >> >> > > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > > [1]
>> > >> >> >> > > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > >
>> > >> >> >> > > > > > > > > >
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > >
>> > >> >> >> > > > > >
>> > >> >> >> > > > >
>> > >> >>
>> > >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
>> > >> >> >> > > > > > > > > > > > > [2]
>> > >> >> >> > > > > > > > >
>> > >> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
>> > >> >> >> > > > > > > > > > > > > [3]
>> > >> >> https://github.com/tillrohrmann/flink/tree/autoscaling
>> > >> >> >> > > > > > > > > > > > > [4]
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > >
>> > >> >>
>> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
>> > >> >> >> > > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > > Best,
>> > >> >> >> > > > > > > > > > > > > D.
>> > >> >> >> > > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > --
>> > >> >> >> > > > > > > > > > > > Best
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > > > ConradJam
>> > >> >> >> > > > > > > > > > > >
>> > >> >> >> > > > > > > > > > >
>> > >> >> >> > > > > > > > > >
>> > >> >> >> > > > > > > > >
>> > >> >> >> > > > > > > >
>> > >> >> >> > > > > > >
>> > >> >> >> > > > > >
>> > >> >> >> > > > >
>> > >> >>
>> > >>
>> > >
>>
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by Roman Khachatryan <ro...@apache.org>.

Hi,

Thanks for the update, I think distinguishing the rescaling behaviour and
the desired parallelism declaration is important.

Having the ability to specify min parallelism might be useful in
environments with multiple jobs: Scheduler will then have an option to stop
the less suitable job.
In other setups, where the job should not be stopped at all, the user can
always set it to 0.

Regards,
Roman


On Tue, Feb 28, 2023 at 12:58 PM Maximilian Michels <mx...@apache.org> wrote:

> Hi David,
>
> Thanks for the update! We consider using the new declarative resource
> API for autoscaling. Currently, we treat a scaling decision as a new
> deployment which means surrendering all resources to Kubernetes and
> subsequently reallocating them for the rescaled deployment. The
> declarative resource management API is a great step forward because it
> allows us to do faster and safer rescaling. Faster, because we can
> continue to run while resources are pre-allocated which minimizes
> downtime. Safer, because we can't get stuck when the desired resources
> are not available.
>
> An example with two vertices and their respective parallelisms:
>   v1: 50
>   v2: 10
> Let's assume slot sharing is disabled, so we need 60 task slots to run
> the vertices.
>
> If the autoscaler was to decide to scale up v1 and v2, it could do so
> in a safe way by using min/max configuration:
>   v1: [min: 50, max: 70]
>   v2: [min: 10, max: 20]
> This would then need 90 task slots to run at max capacity.
>
> I suppose we could further remove the min because it would always be
> safer to scale down if resources are not available than to not run at
> all [1]. In fact, I saw that the minimum bound is currently not used
> in the code you posted above [2]. Is that still planned?
>
> -Max
>
> PS: Note that originally we had assumed min == max but I think that
> would be a less safe scaling approach because we would get stuck
> waiting for resources when they are not available, e.g. k8s resource
> limits reached.
>
> [1] However, there might be costs involved with executing the
> rescaling, e.g. for using external storage like s3, especially without
> local recovery.
> [2]
> https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9
>
> On Tue, Feb 28, 2023 at 9:33 AM David Morávek <dm...@apache.org> wrote:
> >
> > Hi Everyone,
> >
> > We had some more talks about the pre-allocation of resources with @Max,
> and
> > here is the final state that we've converged to for now:
> >
> > The vital thing to note about the new API is that it's declarative,
> meaning
> > we're declaring the desired state to which we want our job to converge;
> If,
> > after the requirements update job no longer holds the desired resources
> > (fewer resources than the lower bound), it will be canceled and
> transition
> > back into the waiting for resources state.
> >
> > In some use cases, you might always want to rescale to the upper bound
> > (this goes along the lines of "preallocating resources" and minimizing
> the
> > number of rescales, which is especially useful with the large state).
> This
> > can be controlled by two knobs that already exist:
> >
> > 1) "jobmanager.adaptive-scheduler.min-parallelism-increase" - this
> affects
> > a minimal parallelism increase step of a running job; we'll slightly
> change
> > the semantics, and we'll trigger rescaling either once this condition is
> > met or when you hit the ceiling; setting this to the high number will
> > ensure that you always rescale to the upper bound
> >
> > 2) "jobmanager.adaptive-scheduler.resource-stabilization-timeout" - for
> new
> > and already restarting jobs, we'll always respect this timeout, which
> > allows you to wait for more resources even though you already have more
> > resources than defined in the lower bound; again, in the case we reach
> the
> > ceiling (the upper bound), we'll transition into the executing state.
> >
> >
> > We're still planning to dig deeper in this direction with other efforts,
> > but this is already good enough and should allow us to move the FLIP
> > forward.
> >
> > WDYT? Unless there are any objectives against the above, I'd like to
> > proceed to a vote.
> >
> > Best,
> > D.
> >
> > On Thu, Feb 23, 2023 at 5:39 PM David Morávek <dm...@apache.org> wrote:
> >
> > > Hi Everyone,
> > >
> > > @John
> > >
> > > This is a problem that we've spent some time trying to crack; in the
> end,
> > > we've decided to go against doing any upgrades to JobGraphStore from
> > > JobMaster to avoid having multiple writers that are guarded by
> different
> > > leader election lock (Dispatcher and JobMaster might live in a
> different
> > > process). The contract we've decided to choose instead is leveraging
> the
> > > idempotency of the endpoint and having the user of the API retry in
> case
> > > we're unable to persist new requirements in the JobGraphStore [1]. We
> > > eventually need to move JobGraphStore out of the dispatcher, but
> that's way
> > > out of the scope of this FLIP. The solution is a deliberate trade-off.
> The
> > > worst scenario is that the Dispatcher fails over in between retries,
> which
> > > would simply rescale the job to meet the previous resource requirements
> > > (more extended unavailability of underlying HA storage would have worse
> > > consequences than this). Does that answer your question?
> > >
> > > @Matthias
> > >
> > > Good catch! I'm fixing it now, thanks!
> > >
> > > [1]
> > >
> https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9#diff-a4b690fb2c4975d25b05eb4161617af0d704a85ff7b1cad19d3c817c12f1e29cR1151
> > >
> > > Best,
> > > D.
> > >
> > > On Tue, Feb 21, 2023 at 12:24 AM John Roesler <vv...@apache.org>
> wrote:
> > >
> > >> Thanks for the FLIP, David!
> > >>
> > >> I just had one small question. IIUC, the REST API PUT request will go
> > >> through the new DispatcherGateway method to be handled. Then, after
> > >> validation, the dispatcher would call the new JobMasterGateway method
> to
> > >> actually update the job.
> > >>
> > >> Which component will write the updated JobGraph? I just wanted to make
> > >> sure it’s the JobMaster because it it were the dispatcher, there
> could be a
> > >> race condition with the async JobMaster method.
> > >>
> > >> Thanks!
> > >> -John
> > >>
> > >> On Mon, Feb 20, 2023, at 07:34, Matthias Pohl wrote:
> > >> > Thanks for your clarifications, David. I don't have any additional
> major
> > >> > points to add. One thing about the FLIP: The RPC layer API for
> updating
> > >> the
> > >> > JRR returns a future with a JRR? I don't see value in returning a
> JRR
> > >> here
> > >> > since it's an idempotent operation? Wouldn't it be enough to return
> > >> > CompletableFuture<Void> here? Or am I missing something?
> > >> >
> > >> > Matthias
> > >> >
> > >> > On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels <mx...@apache.org>
> > >> wrote:
> > >> >
> > >> >> Thanks David! If we could get the pre-allocation working as part of
> > >> >> the FLIP, that would be great.
> > >> >>
> > >> >> Concerning the downscale case, I agree this is a special case for
> the
> > >> >> (single-job) application mode where we could re-allocate slots in a
> > >> >> way that could leave entire task managers unoccupied which we would
> > >> >> then be able to release. The goal essentially is to reduce slot
> > >> >> fragmentation on scale down by packing the slots efficiently. The
> > >> >> easiest way to add this optimization when running in application
> mode
> > >> >> would be to drop as many task managers during the restart such that
> > >> >> NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look
> into
> > >> >> this independently of the FLIP.
> > >> >>
> > >> >> Feel free to start the vote.
> > >> >>
> > >> >> -Max
> > >> >>
> > >> >> On Mon, Feb 20, 2023 at 9:10 AM David Morávek <dm...@apache.org>
> wrote:
> > >> >> >
> > >> >> > Hi everyone,
> > >> >> >
> > >> >> > Thanks for the feedback! I've updated the FLIP to use idempotent
> PUT
> > >> API
> > >> >> instead of PATCH and to properly handle lower bound settings, to
> > >> support
> > >> >> the "pre-allocation" of the resources.
> > >> >> >
> > >> >> > @Max
> > >> >> >
> > >> >> > > How hard would it be to address this issue in the FLIP?
> > >> >> >
> > >> >> > I've included this in the FLIP. It might not be too hard to
> implement
> > >> >> this in the end.
> > >> >> >
> > >> >> > > B) drop as many superfluous task managers as needed
> > >> >> >
> > >> >> > I've intentionally left this part out for now because this
> ultimately
> > >> >> needs to be the responsibility of the Resource Manager. After all,
> in
> > >> the
> > >> >> Session Cluster scenario, the Scheduler doesn't have the bigger
> > >> picture of
> > >> >> other tasks of other jobs running on those TMs. This will most
> likely
> > >> be a
> > >> >> topic for another FLIP.
> > >> >> >
> > >> >> > WDYT? If there are no other questions or concerns, I'd like to
> start
> > >> the
> > >> >> vote on Wednesday.
> > >> >> >
> > >> >> > Best,
> > >> >> > D.
> > >> >> >
> > >> >> > On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <
> mxm@apache.org>
> > >> >> wrote:
> > >> >> >>
> > >> >> >> I missed that the FLIP states:
> > >> >> >>
> > >> >> >> > Currently, even though we’d expose the lower bound for
> clarity and
> > >> >> API completeness, we won’t allow setting it to any other value
> than one
> > >> >> until we have full support throughout the stack.
> > >> >> >>
> > >> >> >> How hard would it be to address this issue in the FLIP?
> > >> >> >>
> > >> >> >> There is not much value to offer setting a lower bound which
> won't
> > >> be
> > >> >> >> respected / throw an error when it is set. If we had support
> for a
> > >> >> >> lower bound, we could enforce a resource contract externally via
> > >> >> >> setting lowerBound == upperBound. That ties back to the Rescale
> API
> > >> >> >> discussion we had. I want to better understand what the major
> > >> concerns
> > >> >> >> would be around allowing this.
> > >> >> >>
> > >> >> >> Just to outline how I imagine the logic to work:
> > >> >> >>
> > >> >> >> A) The resource constraints are already met => Nothing changes
> > >> >> >> B) More resources available than required => Cancel the job,
> drop as
> > >> >> >> many superfluous task managers as needed, restart the job
> > >> >> >> C) Less resources available than required => Acquire new task
> > >> >> >> managers, wait for them to register, cancel and restart the job
> > >> >> >>
> > >> >> >> I'm open to helping out with the implementation.
> > >> >> >>
> > >> >> >> -Max
> > >> >> >>
> > >> >> >> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <
> mxm@apache.org>
> > >> >> wrote:
> > >> >> >> >
> > >> >> >> > Based on further discussion I had with Chesnay on this PR
> [1], I
> > >> think
> > >> >> >> > jobs would currently go into a restarting state after the
> resource
> > >> >> >> > requirements have changed. This wouldn't achieve what we had
> in
> > >> mind,
> > >> >> >> > i.e. sticking to the old resource requirements until enough
> slots
> > >> are
> > >> >> >> > available to fulfil the new resource requirements. So this may
> > >> not be
> > >> >> >> > 100% what we need but it could be extended to do what we want.
> > >> >> >> >
> > >> >> >> > -Max
> > >> >> >> >
> > >> >> >> > [1]
> > >> https://github.com/apache/flink/pull/21908#discussion_r1104792362
> > >> >> >> >
> > >> >> >> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <
> > >> mxm@apache.org>
> > >> >> wrote:
> > >> >> >> > >
> > >> >> >> > > Hi David,
> > >> >> >> > >
> > >> >> >> > > This is awesome! Great writeup and demo. This is pretty much
> > >> what we
> > >> >> >> > > need for the autoscaler as part of the Flink Kubernetes
> operator
> > >> >> [1].
> > >> >> >> > > Scaling Flink jobs effectively is hard but fortunately we
> have
> > >> >> solved
> > >> >> >> > > the issue as part of the Flink Kubernetes operator. The only
> > >> >> critical
> > >> >> >> > > piece we are missing is a better way to execute scaling
> > >> decisions,
> > >> >> as
> > >> >> >> > > discussed in [2].
> > >> >> >> > >
> > >> >> >> > > Looking at your proposal, we would set lowerBound ==
> upperBound
> > >> for
> > >> >> >> > > the parallelism because we want to fully determine the
> > >> parallelism
> > >> >> >> > > externally based on the scaling metrics. Does that sound
> right?
> > >> >> >> > >
> > >> >> >> > > What is the timeline for these changes? Is there a JIRA?
> > >> >> >> > >
> > >> >> >> > > Cheers,
> > >> >> >> > > Max
> > >> >> >> > >
> > >> >> >> > > [1]
> > >> >>
> > >>
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
> > >> >> >> > > [2]
> > >> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > >> >> >> > >
> > >> >> >> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <
> > >> xiangyu0xf@gmail.com>
> > >> >> wrote:
> > >> >> >> > > >
> > >> >> >> > > > Hi David,
> > >> >> >> > > >
> > >> >> >> > > > Thanks for your reply.  I think your response totally make
> > >> >> sense.  This
> > >> >> >> > > > flip targets on declaring required resource to
> ResourceManager
> > >> >> instead of
> > >> >> >> > > > using  ResourceManager to add/remove TMs directly.
> > >> >> >> > > >
> > >> >> >> > > > Best,
> > >> >> >> > > > Xiangyu
> > >> >> >> > > >
> > >> >> >> > > >
> > >> >> >> > > >
> > >> >> >> > > > David Morávek <da...@gmail.com> 于2023年2月13日周一
> > >> 15:46写道：
> > >> >> >> > > >
> > >> >> >> > > > > Hi everyone,
> > >> >> >> > > > >
> > >> >> >> > > > > @Shammon
> > >> >> >> > > > >
> > >> >> >> > > > > I'm not entirely sure what "config file" you're
> referring
> > >> to.
> > >> >> You can, of
> > >> >> >> > > > > course, override the default parallelism in
> > >> "flink-conf.yaml",
> > >> >> but for
> > >> >> >> > > > > sinks and sources, the parallelism needs to be tweaked
> on
> > >> the
> > >> >> connector
> > >> >> >> > > > > level ("WITH" statement).
> > >> >> >> > > > >
> > >> >> >> > > > > This is something that should be achieved with tooling
> > >> around
> > >> >> Flink. We
> > >> >> >> > > > > want to provide an API on the lowest level that
> generalizes
> > >> >> well. Achieving
> > >> >> >> > > > > what you're describing should be straightforward with
> this
> > >> API.
> > >> >> >> > > > >
> > >> >> >> > > > > @Xiangyu
> > >> >> >> > > > >
> > >> >> >> > > > > Is it possible for this REST API to declare TM
> resources in
> > >> the
> > >> >> future?
> > >> >> >> > > > >
> > >> >> >> > > > >
> > >> >> >> > > > > Would you like to add/remove TMs if you use an active
> > >> Resource
> > >> >> Manager?
> > >> >> >> > > > > This would be out of the scope of this effort since it
> > >> targets
> > >> >> the
> > >> >> >> > > > > scheduler component only (we make no assumptions about
> the
> > >> used
> > >> >> Resource
> > >> >> >> > > > > Manager). Also, the AdaptiveScheduler is only intended
> to be
> > >> >> used for
> > >> >> >> > > > > Streaming.
> > >> >> >> > > > >
> > >> >> >> > > > >  And for streaming jobs, I'm wondering if there is any
> > >> >> situation we need to
> > >> >> >> > > > > > rescale the TM resources of a flink cluster at first
> and
> > >> then
> > >> >> the
> > >> >> >> > > > > adaptive
> > >> >> >> > > > > > scheduler will rescale the per-vertex ResourceProfiles
> > >> >> accordingly.
> > >> >> >> > > > > >
> > >> >> >> > > > >
> > >> >> >> > > > > We plan on adding support for the ResourceProfiles
> (dynamic
> > >> slot
> > >> >> >> > > > > allocation) as the next step. Again we won't make any
> > >> >> assumptions about the
> > >> >> >> > > > > used Resource Manager. In other words, this effort ends
> by
> > >> >> declaring
> > >> >> >> > > > > desired resources to the Resource Manager.
> > >> >> >> > > > >
> > >> >> >> > > > > Does that make sense?
> > >> >> >> > > > >
> > >> >> >> > > > > @Matthias
> > >> >> >> > > > >
> > >> >> >> > > > > We've done another pass on the proposed API and
> currently
> > >> lean
> > >> >> towards
> > >> >> >> > > > > having an idempotent PUT API.
> > >> >> >> > > > > - We don't care too much about multiple writers'
> scenarios
> > >> in
> > >> >> terms of who
> > >> >> >> > > > > can write an authoritative payload; this is up to the
> user
> > >> of
> > >> >> the API to
> > >> >> >> > > > > figure out
> > >> >> >> > > > > - It's indeed tricky to achieve atomicity with PATCH
> API;
> > >> >> switching to PUT
> > >> >> >> > > > > API seems to do the trick
> > >> >> >> > > > > - We won't allow partial "payloads" anymore, meaning you
> > >> need
> > >> >> to define
> > >> >> >> > > > > requirements for all vertices in the JobGraph; This is
> > >> >> completely fine for
> > >> >> >> > > > > the programmatic workflows. For DEBUG / DEMO purposes,
> you
> > >> can
> > >> >> use the GET
> > >> >> >> > > > > endpoint and tweak the response to avoid writing the
> whole
> > >> >> payload by hand.
> > >> >> >> > > > >
> > >> >> >> > > > > WDYT?
> > >> >> >> > > > >
> > >> >> >> > > > >
> > >> >> >> > > > > Best,
> > >> >> >> > > > > D.
> > >> >> >> > > > >
> > >> >> >> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <
> > >> >> xiangyu0xf@gmail.com>
> > >> >> >> > > > > wrote:
> > >> >> >> > > > >
> > >> >> >> > > > > > Hi David,
> > >> >> >> > > > > >
> > >> >> >> > > > > > Thanks for creating this flip. I think this work it is
> > >> very
> > >> >> useful,
> > >> >> >> > > > > > especially in autoscaling scenario.  I would like to
> share
> > >> >> some questions
> > >> >> >> > > > > > from my view.
> > >> >> >> > > > > >
> > >> >> >> > > > > > 1, Is it possible for this REST API to declare TM
> > >> resources
> > >> >> in the
> > >> >> >> > > > > future?
> > >> >> >> > > > > > I'm asking because we are building the autoscaling
> feature
> > >> >> for Flink OLAP
> > >> >> >> > > > > > Session Cluster in ByteDance. We need to rescale the
> > >> >> cluster's resource
> > >> >> >> > > > > on
> > >> >> >> > > > > > TM level instead of Job level. It would be very
> helpful
> > >> if we
> > >> >> have a REST
> > >> >> >> > > > > > API for out external Autoscaling service to use.
> > >> >> >> > > > > >
> > >> >> >> > > > > > 2, And for streaming jobs, I'm wondering if there is
> any
> > >> >> situation we
> > >> >> >> > > > > need
> > >> >> >> > > > > > to rescale the TM resources of a flink cluster at
> first
> > >> and
> > >> >> then the
> > >> >> >> > > > > > adaptive scheduler will rescale the per-vertex
> > >> >> ResourceProfiles
> > >> >> >> > > > > > accordingly.
> > >> >> >> > > > > >
> > >> >> >> > > > > > best.
> > >> >> >> > > > > > Xiangyu
> > >> >> >> > > > > >
> > >> >> >> > > > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
> > >> >> >> > > > > >
> > >> >> >> > > > > > > Hi David
> > >> >> >> > > > > > >
> > >> >> >> > > > > > > Thanks for your answer.
> > >> >> >> > > > > > >
> > >> >> >> > > > > > > > Can you elaborate more about how you'd intend to
> use
> > >> the
> > >> >> endpoint? I
> > >> >> >> > > > > > > think we can ultimately introduce a way of
> re-declaring
> > >> >> "per-vertex
> > >> >> >> > > > > > > defaults," but I'd like to understand the use case
> bit
> > >> more
> > >> >> first.
> > >> >> >> > > > > > >
> > >> >> >> > > > > > > For this issue, I mainly consider the consistency of
> > >> user
> > >> >> configuration
> > >> >> >> > > > > > and
> > >> >> >> > > > > > > job runtime. For sql jobs, users usually set
> specific
> > >> >> parallelism for
> > >> >> >> > > > > > > source and sink, and set a global parallelism for
> other
> > >> >> operators.
> > >> >> >> > > > > These
> > >> >> >> > > > > > > config items are stored in a config file. For some
> > >> >> high-priority jobs,
> > >> >> >> > > > > > > users may want to manage them manually.
> > >> >> >> > > > > > > 1. When users need to scale the parallelism, they
> should
> > >> >> update the
> > >> >> >> > > > > > config
> > >> >> >> > > > > > > file and restart flink job, which may take a long
> time.
> > >> >> >> > > > > > > 2. After providing the REST API, users can just
> send a
> > >> >> request to the
> > >> >> >> > > > > job
> > >> >> >> > > > > > > via REST API quickly after updating the config file.
> > >> >> >> > > > > > > The configuration in the running job and config file
> > >> should
> > >> >> be the
> > >> >> >> > > > > same.
> > >> >> >> > > > > > > What do you think of this?
> > >> >> >> > > > > > >
> > >> >> >> > > > > > > best.
> > >> >> >> > > > > > > Shammon
> > >> >> >> > > > > > >
> > >> >> >> > > > > > >
> > >> >> >> > > > > > >
> > >> >> >> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <
> > >> >> david.moravek@gmail.com>
> > >> >> >> > > > > > > wrote:
> > >> >> >> > > > > > >
> > >> >> >> > > > > > > > Hi everyone,
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Let's try to answer the questions one by one.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > *@ConradJam*
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > when the number of "slots" is insufficient, can
> we can
> > >> >> stop users
> > >> >> >> > > > > > > rescaling
> > >> >> >> > > > > > > > > or throw something to tell user "less avaliable
> > >> slots
> > >> >> to upgrade,
> > >> >> >> > > > > > > please
> > >> >> >> > > > > > > > > checkout your alivalbe slots" ?
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > The main property of AdaptiveScheduler is that it
> can
> > >> >> adapt to
> > >> >> >> > > > > > "available
> > >> >> >> > > > > > > > resources," which means you're still able to make
> > >> >> progress even
> > >> >> >> > > > > though
> > >> >> >> > > > > > > you
> > >> >> >> > > > > > > > didn't get all the slots you've asked for. Let's
> break
> > >> >> down the pros
> > >> >> >> > > > > > and
> > >> >> >> > > > > > > > cons of this property.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > - (plus) If you lose a TM for some reason, you can
> > >> still
> > >> >> recover even
> > >> >> >> > > > > > if
> > >> >> >> > > > > > > it
> > >> >> >> > > > > > > > doesn't come back. We still need to give it some
> time
> > >> to
> > >> >> eliminate
> > >> >> >> > > > > > > > unnecessary rescaling, which can be controlled by
> > >> setting
> > >> >> >> > > > > > > > "resource-stabilization-timeout."
> > >> >> >> > > > > > > > - (plus) The resources can arrive with a
> significant
> > >> >> delay. For
> > >> >> >> > > > > > example,
> > >> >> >> > > > > > > > you're unable to spawn enough TMs on time because
> > >> you've
> > >> >> run out of
> > >> >> >> > > > > > > > resources in your k8s cluster, and you need to
> wait
> > >> for
> > >> >> the cluster
> > >> >> >> > > > > > auto
> > >> >> >> > > > > > > > scaler to kick in and add new nodes to the
> cluster. In
> > >> >> this scenario,
> > >> >> >> > > > > > > > you'll be able to start making progress faster,
> at the
> > >> >> cost of
> > >> >> >> > > > > multiple
> > >> >> >> > > > > > > > rescalings (once the remaining resources arrive).
> > >> >> >> > > > > > > > - (plus) This plays well with the declarative
> manner
> > >> of
> > >> >> today's
> > >> >> >> > > > > > > > infrastructure. For example, you tell k8s that you
> > >> need
> > >> >> 10 TMs, and
> > >> >> >> > > > > > > you'll
> > >> >> >> > > > > > > > eventually get them.
> > >> >> >> > > > > > > > - (minus) In the case of large state jobs, the
> cost of
> > >> >> multiple
> > >> >> >> > > > > > > rescalings
> > >> >> >> > > > > > > > might outweigh the above.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > We've already touched on the solution to this
> problem
> > >> on
> > >> >> the FLIP.
> > >> >> >> > > > > > Please
> > >> >> >> > > > > > > > notice the parallelism knob being a range with a
> lower
> > >> >> and upper
> > >> >> >> > > > > bound.
> > >> >> >> > > > > > > > Setting both the lower and upper bound to the same
> > >> value
> > >> >> could give
> > >> >> >> > > > > the
> > >> >> >> > > > > > > > behavior you're describing at the cost of giving
> up
> > >> some
> > >> >> properties
> > >> >> >> > > > > > that
> > >> >> >> > > > > > > AS
> > >> >> >> > > > > > > > gives you (you'd be falling back to the
> > >> >> DefaultScheduler's behavior).
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > when user upgrade job-vertx-parallelism . I want
> to
> > >> have
> > >> >> an interface
> > >> >> >> > > > > > to
> > >> >> >> > > > > > > > > query the current update parallel execution
> status,
> > >> so
> > >> >> that the
> > >> >> >> > > > > user
> > >> >> >> > > > > > or
> > >> >> >> > > > > > > > > program can understand the current status
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > This is a misunderstanding. We're not introducing
> the
> > >> >> RESCALE
> > >> >> >> > > > > endpoint.
> > >> >> >> > > > > > > > This endpoint allows you to re-declare the
> resources
> > >> >> needed to run
> > >> >> >> > > > > the
> > >> >> >> > > > > > > job.
> > >> >> >> > > > > > > > Once you reach the desired resources (you get more
> > >> >> resources than the
> > >> >> >> > > > > > > lower
> > >> >> >> > > > > > > > bound defines), your job will run.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > We can expose a similar endpoint to "resource
> > >> >> requirements" to give
> > >> >> >> > > > > you
> > >> >> >> > > > > > > an
> > >> >> >> > > > > > > > overview of the resources the vertices already
> have.
> > >> You
> > >> >> can already
> > >> >> >> > > > > > get
> > >> >> >> > > > > > > > this from the REST API, so exposing this in yet
> > >> another
> > >> >> way should be
> > >> >> >> > > > > > > > considered carefully.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > *@Matthias*
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > I'm wondering whether it makes sense to add some
> kind
> > >> of
> > >> >> resource ID
> > >> >> >> > > > > to
> > >> >> >> > > > > > > the
> > >> >> >> > > > > > > > > REST API.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > That's a good question. I want to think about
> that and
> > >> >> get back to
> > >> >> >> > > > > the
> > >> >> >> > > > > > > > question later. My main struggle when thinking
> about
> > >> this
> > >> >> is, "if
> > >> >> >> > > > > this
> > >> >> >> > > > > > > > would be an idempotent POST endpoint," would it
> be any
> > >> >> different?
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > How often do we allow resource requirements to be
> > >> changed?
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > There shall be no rate limiting on the FLINK
> side. If
> > >> >> this is
> > >> >> >> > > > > something
> > >> >> >> > > > > > > > your environment needs, you can achieve it on a
> > >> different
> > >> >> layer ("we
> > >> >> >> > > > > > > can't
> > >> >> >> > > > > > > > have FLINK to do everything").
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Versioning the JobGraph in the JobGraphStore
> rather
> > >> than
> > >> >> overwriting
> > >> >> >> > > > > it
> > >> >> >> > > > > > > > > might be an idea.
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > This sounds interesting since it would be closer
> to
> > >> the
> > >> >> JobGraph
> > >> >> >> > > > > being
> > >> >> >> > > > > > > > immutable. The main problem I see here is that
> this
> > >> would
> > >> >> introduce a
> > >> >> >> > > > > > > > BW-incompatible change so it might be a topic for
> > >> >> follow-up FLIP.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > I'm just wondering whether we bundle two things
> > >> together
> > >> >> that are
> > >> >> >> > > > > > > actually
> > >> >> >> > > > > > > > > separate
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Yup, this is how we think about it as well. The
> main
> > >> >> question is,
> > >> >> >> > > > > "who
> > >> >> >> > > > > > > > should be responsible for bookkeeping 1) the
> JobGraph
> > >> and
> > >> >> 2) the
> > >> >> >> > > > > > > > JobResourceRequirements". The JobMaster would be
> the
> > >> >> right place for
> > >> >> >> > > > > > > both,
> > >> >> >> > > > > > > > but it's currently not the case, and we're tightly
> > >> >> coupling the
> > >> >> >> > > > > > > dispatcher
> > >> >> >> > > > > > > > with the JobMaster.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Initially, we tried to introduce a separate HA
> > >> component
> > >> >> in JobMaster
> > >> >> >> > > > > > for
> > >> >> >> > > > > > > > bookkeeping the JobResourceRequirements, but that
> > >> proved
> > >> >> to be a more
> > >> >> >> > > > > > > > significant effort adding additional mess to the
> > >> already
> > >> >> messy HA
> > >> >> >> > > > > > > > ecosystem. Another approach we've discussed was
> > >> mutating
> > >> >> the JobGraph
> > >> >> >> > > > > > and
> > >> >> >> > > > > > > > setting JRR into the JobGraph structure itself.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > The middle ground for keeping this effort
> reasonably
> > >> >> sized and not
> > >> >> >> > > > > > > > violating "we want to keep JG immutable" too much
> is
> > >> >> keeping the
> > >> >> >> > > > > > > > JobResourceRequirements separate as an internal
> config
> > >> >> option in
> > >> >> >> > > > > > > JobGraph's
> > >> >> >> > > > > > > > configuration.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > We ultimately need to rethink the tight coupling
> of
> > >> >> Dispatcher and
> > >> >> >> > > > > > > > JobMaster, but it needs to be a separate effort.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > ...also considering the amount of data that can be
> > >> stored
> > >> >> in a
> > >> >> >> > > > > > > > > ConfigMap/ZooKeeper node if versioning the
> resource
> > >> >> requirement
> > >> >> >> > > > > > change
> > >> >> >> > > > > > > as
> > >> >> >> > > > > > > > > proposed in my previous item is an option for
> us.
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > AFAIK we're only storing pointers to the S3
> objects
> > >> in HA
> > >> >> metadata,
> > >> >> >> > > > > so
> > >> >> >> > > > > > we
> > >> >> >> > > > > > > > should be okay with having larger structures for
> now.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Updating the JobGraphStore means adding more
> requests
> > >> to
> > >> >> the HA
> > >> >> >> > > > > backend
> > >> >> >> > > > > > > > API.
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > It's fine unless you intend to override the
> resource
> > >> >> requirements a
> > >> >> >> > > > > few
> > >> >> >> > > > > > > > times per second.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > *@Shammon*
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > How about adding some more information such as
> vertex
> > >> type
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Since it was intended as a "debug" endpoint, it
> makes
> > >> >> complete sense!
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >  For sql jobs, we always use a unified
> parallelism for
> > >> >> most vertices.
> > >> >> >> > > > > > Can
> > >> >> >> > > > > > > > > we provide them with a more convenient setting
> > >> method
> > >> >> instead of
> > >> >> >> > > > > each
> > >> >> >> > > > > > > > one?
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > I completely feel with this. The main thoughts
> when
> > >> >> designing the API
> > >> >> >> > > > > > > were:
> > >> >> >> > > > > > > > - We want to keep it clean and easy to understand.
> > >> >> >> > > > > > > > - Global parallelism can be modeled using
> per-vertex
> > >> >> parallelism but
> > >> >> >> > > > > > not
> > >> >> >> > > > > > > > the other way around.
> > >> >> >> > > > > > > > - The API will be used by external tooling
> (operator,
> > >> >> auto scaler).
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Can you elaborate more about how you'd intend to
> use
> > >> the
> > >> >> endpoint? I
> > >> >> >> > > > > > > think
> > >> >> >> > > > > > > > we can ultimately introduce a way of re-declaring
> > >> >> "per-vertex
> > >> >> >> > > > > > defaults,"
> > >> >> >> > > > > > > > but I'd like to understand the use case bit more
> > >> first.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > *@Weijie*
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > What is the default value here (based on what
> > >> >> configuration), or just
> > >> >> >> > > > > > > > > infinite?
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Currently, for the lower bound, it's always one,
> and
> > >> for
> > >> >> the upper
> > >> >> >> > > > > > bound,
> > >> >> >> > > > > > > > it's either parallelism (if defined) or the
> > >> >> maxParallelism of the
> > >> >> >> > > > > > vertex
> > >> >> >> > > > > > > in
> > >> >> >> > > > > > > > JobGraph. This question might be another signal
> for
> > >> >> making the
> > >> >> >> > > > > defaults
> > >> >> >> > > > > > > > explicit (see the answer to Shammon's question
> above).
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Thanks, everyone, for your initial thoughts!
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > Best,
> > >> >> >> > > > > > > > D.
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <
> > >> >> guoweijiereswqa@gmail.com
> > >> >> >> > > > > >
> > >> >> >> > > > > > > > wrote:
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > > > > Thanks David for driving this. This is a very
> > >> valuable
> > >> >> work,
> > >> >> >> > > > > > especially
> > >> >> >> > > > > > > > for
> > >> >> >> > > > > > > > > cloud native environment.
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > >> How about adding some more information such
> as
> > >> >> vertex type
> > >> >> >> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of
> `get
> > >> jobs
> > >> >> >> > > > > > > > > resource-requirements`? For users, only
> vertex-id
> > >> may
> > >> >> be difficult
> > >> >> >> > > > > to
> > >> >> >> > > > > > > > > understand.
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > +1 for this suggestion, including jobvertex's
> name
> > >> in
> > >> >> the response
> > >> >> >> > > > > > body
> > >> >> >> > > > > > > > is
> > >> >> >> > > > > > > > > more
> > >> >> >> > > > > > > > > user-friendly.
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > I saw this sentence in FLIP: "Setting the upper
> > >> bound
> > >> >> to -1 will
> > >> >> >> > > > > > reset
> > >> >> >> > > > > > > > the
> > >> >> >> > > > > > > > > value to the default setting."  What is the
> default
> > >> >> value here
> > >> >> >> > > > > (based
> > >> >> >> > > > > > > on
> > >> >> >> > > > > > > > > what configuration), or just infinite?
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > Best regards,
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > Weijie
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一
> > >> 18:06写道：
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > > > > Hi David
> > >> >> >> > > > > > > > > >
> > >> >> >> > > > > > > > > > Thanks for initiating this discussion. I think
> > >> >> declaring job
> > >> >> >> > > > > > resource
> > >> >> >> > > > > > > > > > requirements by REST API is very valuable. I
> just
> > >> >> left some
> > >> >> >> > > > > > comments
> > >> >> >> > > > > > > as
> > >> >> >> > > > > > > > > > followed
> > >> >> >> > > > > > > > > >
> > >> >> >> > > > > > > > > > 1) How about adding some more information
> such as
> > >> >> vertex type
> > >> >> >> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of
> `get
> > >> >> jobs
> > >> >> >> > > > > > > > > > resource-requirements`? For users, only
> vertex-id
> > >> may
> > >> >> be
> > >> >> >> > > > > difficult
> > >> >> >> > > > > > to
> > >> >> >> > > > > > > > > > understand.
> > >> >> >> > > > > > > > > >
> > >> >> >> > > > > > > > > > 2) For sql jobs, we always use a unified
> > >> parallelism
> > >> >> for most
> > >> >> >> > > > > > > vertices.
> > >> >> >> > > > > > > > > Can
> > >> >> >> > > > > > > > > > we provide them with a more convenient setting
> > >> method
> > >> >> instead of
> > >> >> >> > > > > > each
> > >> >> >> > > > > > > > > one?
> > >> >> >> > > > > > > > > >
> > >> >> >> > > > > > > > > >
> > >> >> >> > > > > > > > > > Best,
> > >> >> >> > > > > > > > > > Shammon
> > >> >> >> > > > > > > > > >
> > >> >> >> > > > > > > > > >
> > >> >> >> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
> > >> >> >> > > > > > matthias.pohl@aiven.io
> > >> >> >> > > > > > > > > > .invalid>
> > >> >> >> > > > > > > > > > wrote:
> > >> >> >> > > > > > > > > >
> > >> >> >> > > > > > > > > > > Thanks David for creating this FLIP. It
> sounds
> > >> >> promising and
> > >> >> >> > > > > > useful
> > >> >> >> > > > > > > > to
> > >> >> >> > > > > > > > > > > have. Here are some thoughts from my side
> (some
> > >> of
> > >> >> them might
> > >> >> >> > > > > be
> > >> >> >> > > > > > > > > rather a
> > >> >> >> > > > > > > > > > > follow-up and not necessarily part of this
> > >> FLIP):
> > >> >> >> > > > > > > > > > > - I'm wondering whether it makes sense to
> add
> > >> some
> > >> >> kind of
> > >> >> >> > > > > > resource
> > >> >> >> > > > > > > > ID
> > >> >> >> > > > > > > > > to
> > >> >> >> > > > > > > > > > > the REST API. This would give Flink a tool
> to
> > >> >> verify the PATCH
> > >> >> >> > > > > > > > request
> > >> >> >> > > > > > > > > of
> > >> >> >> > > > > > > > > > > the external system in a compare-and-set
> kind of
> > >> >> manner. AFAIU,
> > >> >> >> > > > > > the
> > >> >> >> > > > > > > > > > process
> > >> >> >> > > > > > > > > > > requires the external system to retrieve the
> > >> >> resource
> > >> >> >> > > > > > requirements
> > >> >> >> > > > > > > > > first
> > >> >> >> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID
> > >> <ABC>
> > >> >> would be sent
> > >> >> >> > > > > > > along
> > >> >> >> > > > > > > > > as
> > >> >> >> > > > > > > > > > a
> > >> >> >> > > > > > > > > > > unique identifier for the provided setup.
> It's
> > >> >> essentially the
> > >> >> >> > > > > > > > version
> > >> >> >> > > > > > > > > ID
> > >> >> >> > > > > > > > > > > of the currently deployed resource
> requirement
> > >> >> configuration.
> > >> >> >> > > > > > Flink
> > >> >> >> > > > > > > > > > doesn't
> > >> >> >> > > > > > > > > > > know whether the external system would use
> the
> > >> >> provided
> > >> >> >> > > > > > information
> > >> >> >> > > > > > > > in
> > >> >> >> > > > > > > > > > some
> > >> >> >> > > > > > > > > > > way to derive a new set of resource
> requirements
> > >> >> for this job.
> > >> >> >> > > > > > The
> > >> >> >> > > > > > > > > > > subsequent PATCH request with updated
> resource
> > >> >> requirements
> > >> >> >> > > > > would
> > >> >> >> > > > > > > > > include
> > >> >> >> > > > > > > > > > > the previously retrieved resource ID <ABC>.
> The
> > >> >> PATCH call
> > >> >> >> > > > > would
> > >> >> >> > > > > > > fail
> > >> >> >> > > > > > > > > if
> > >> >> >> > > > > > > > > > > there was a concurrent PATCH call in between
> > >> >> indicating to the
> > >> >> >> > > > > > > > external
> > >> >> >> > > > > > > > > > > system that the resource requirements were
> > >> >> concurrently
> > >> >> >> > > > > updated.
> > >> >> >> > > > > > > > > > > - How often do we allow resource
> requirements
> > >> to be
> > >> >> changed?
> > >> >> >> > > > > That
> > >> >> >> > > > > > > > > > question
> > >> >> >> > > > > > > > > > > might make my previous comment on the
> resource
> > >> ID
> > >> >> obsolete
> > >> >> >> > > > > > because
> > >> >> >> > > > > > > we
> > >> >> >> > > > > > > > > > could
> > >> >> >> > > > > > > > > > > just make any PATCH call fail if there was a
> > >> >> resource
> > >> >> >> > > > > requirement
> > >> >> >> > > > > > > > > update
> > >> >> >> > > > > > > > > > > within a certain time frame before the
> request.
> > >> But
> > >> >> such a time
> > >> >> >> > > > > > > > period
> > >> >> >> > > > > > > > > is
> > >> >> >> > > > > > > > > > > something we might want to make configurable
> > >> then,
> > >> >> I guess.
> > >> >> >> > > > > > > > > > > - Versioning the JobGraph in the
> JobGraphStore
> > >> >> rather than
> > >> >> >> > > > > > > > overwriting
> > >> >> >> > > > > > > > > it
> > >> >> >> > > > > > > > > > > might be an idea. This would enable us to
> > >> provide
> > >> >> resource
> > >> >> >> > > > > > > > requirement
> > >> >> >> > > > > > > > > > > changes in the UI or through the REST API.
> It is
> > >> >> related to a
> > >> >> >> > > > > > > problem
> > >> >> >> > > > > > > > > > > around keeping track of the exception
> history
> > >> >> within the
> > >> >> >> > > > > > > > > > AdaptiveScheduler
> > >> >> >> > > > > > > > > > > and also having to consider multiple
> versions
> > >> of a
> > >> >> JobGraph.
> > >> >> >> > > > > But
> > >> >> >> > > > > > > for
> > >> >> >> > > > > > > > > that
> > >> >> >> > > > > > > > > > > one, we use the ExecutionGraphInfoStore
> right
> > >> now.
> > >> >> >> > > > > > > > > > > - Updating the JobGraph in the JobGraphStore
> > >> makes
> > >> >> sense. I'm
> > >> >> >> > > > > > just
> > >> >> >> > > > > > > > > > > wondering whether we bundle two things
> together
> > >> >> that are
> > >> >> >> > > > > actually
> > >> >> >> > > > > > > > > > separate:
> > >> >> >> > > > > > > > > > > The business logic and the execution
> > >> configuration
> > >> >> (the
> > >> >> >> > > > > resource
> > >> >> >> > > > > > > > > > > requirements). I'm aware that this is not a
> > >> flaw of
> > >> >> the current
> > >> >> >> > > > > > > FLIP
> > >> >> >> > > > > > > > > but
> > >> >> >> > > > > > > > > > > rather something that was not necessary to
> > >> address
> > >> >> in the past
> > >> >> >> > > > > > > > because
> > >> >> >> > > > > > > > > > the
> > >> >> >> > > > > > > > > > > JobGraph was kind of static. I don't
> remember
> > >> >> whether that was
> > >> >> >> > > > > > > > already
> > >> >> >> > > > > > > > > > > discussed while working on the
> AdaptiveScheduler
> > >> >> for FLIP-160
> > >> >> >> > > > > > [1].
> > >> >> >> > > > > > > > > Maybe,
> > >> >> >> > > > > > > > > > > I'm missing some functionality here that
> > >> requires
> > >> >> us to have
> > >> >> >> > > > > > > > everything
> > >> >> >> > > > > > > > > > in
> > >> >> >> > > > > > > > > > > one place. But it feels like updating the
> entire
> > >> >> JobGraph which
> > >> >> >> > > > > > > could
> > >> >> >> > > > > > > > > be
> > >> >> >> > > > > > > > > > > actually a "config change" is not
> reasonable.
> > >> >> ...also
> > >> >> >> > > > > considering
> > >> >> >> > > > > > > the
> > >> >> >> > > > > > > > > > > amount of data that can be stored in a
> > >> >> ConfigMap/ZooKeeper node
> > >> >> >> > > > > > if
> > >> >> >> > > > > > > > > > > versioning the resource requirement change
> as
> > >> >> proposed in my
> > >> >> >> > > > > > > previous
> > >> >> >> > > > > > > > > > item
> > >> >> >> > > > > > > > > > > is an option for us.
> > >> >> >> > > > > > > > > > > - Updating the JobGraphStore means adding
> more
> > >> >> requests to the
> > >> >> >> > > > > HA
> > >> >> >> > > > > > > > > backend
> > >> >> >> > > > > > > > > > > API. There were some concerns shared in the
> > >> >> discussion thread
> > >> >> >> > > > > [2]
> > >> >> >> > > > > > > for
> > >> >> >> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API
> server in
> > >> >> the past with
> > >> >> >> > > > > > too
> > >> >> >> > > > > > > > many
> > >> >> >> > > > > > > > > > > calls. Eventhough, it's more likely to be
> > >> caused by
> > >> >> >> > > > > > checkpointing,
> > >> >> >> > > > > > > I
> > >> >> >> > > > > > > > > > still
> > >> >> >> > > > > > > > > > > wanted to bring it up. We're working on a
> > >> >> standardized
> > >> >> >> > > > > > performance
> > >> >> >> > > > > > > > test
> > >> >> >> > > > > > > > > > to
> > >> >> >> > > > > > > > > > > prepare going forward with FLIP-270 [3]
> right
> > >> now.
> > >> >> >> > > > > > > > > > >
> > >> >> >> > > > > > > > > > > Best,
> > >> >> >> > > > > > > > > > > Matthias
> > >> >> >> > > > > > > > > > >
> > >> >> >> > > > > > > > > > > [1]
> > >> >> >> > > > > > > > > > >
> > >> >> >> > > > > > > > > > >
> > >> >> >> > > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > >
> > >> >> >> > > > > >
> > >> >> >> > > > >
> > >> >>
> > >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > >> >> >> > > > > > > > > > > [2]
> > >> >> >> > > > > > >
> > >> >> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > >> >> >> > > > > > > > > > > [3]
> > >> >> >> > > > > > > > > > >
> > >> >> >> > > > > > > > > > >
> > >> >> >> > > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > >
> > >> >> >> > > > > >
> > >> >> >> > > > >
> > >> >>
> > >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> > >> >> >> > > > > > > > > > >
> > >> >> >> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <
> > >> >> jam.gzczy@gmail.com
> > >> >> >> > > > > >
> > >> >> >> > > > > > > > wrote:
> > >> >> >> > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > Hi David:
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > Thank you for drive this flip, which helps
> > >> less
> > >> >> flink
> > >> >> >> > > > > shutdown
> > >> >> >> > > > > > > time
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > for this flip, I would like to make a few
> > >> idea on
> > >> >> share
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >    - when the number of "slots" is
> > >> insufficient,
> > >> >> can we can
> > >> >> >> > > > > > stop
> > >> >> >> > > > > > > > > users
> > >> >> >> > > > > > > > > > > >    rescaling or throw something to tell
> user
> > >> >> "less avaliable
> > >> >> >> > > > > > > slots
> > >> >> >> > > > > > > > to
> > >> >> >> > > > > > > > > > > > upgrade,
> > >> >> >> > > > > > > > > > > >    please checkout your alivalbe slots" ?
> Or
> > >> we
> > >> >> could have a
> > >> >> >> > > > > > > > request
> > >> >> >> > > > > > > > > > > >    switch(true/false) to allow this
> behavior
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >    - when user upgrade
> job-vertx-parallelism
> > >> . I
> > >> >> want to have
> > >> >> >> > > > > > an
> > >> >> >> > > > > > > > > > > interface
> > >> >> >> > > > > > > > > > > >    to query the current update parallel
> > >> execution
> > >> >> status, so
> > >> >> >> > > > > > that
> > >> >> >> > > > > > > > the
> > >> >> >> > > > > > > > > > > user
> > >> >> >> > > > > > > > > > > > or
> > >> >> >> > > > > > > > > > > >    program can understand the current
> status
> > >> >> >> > > > > > > > > > > >    - I want to have an interface to query
> the
> > >> >> current update
> > >> >> >> > > > > > > > > > parallelism
> > >> >> >> > > > > > > > > > > >    execution status. This also helps
> similar
> > >> to
> > >> >> *[1] Flink
> > >> >> >> > > > > K8S
> > >> >> >> > > > > > > > > > Operator*
> > >> >> >> > > > > > > > > > > >    management
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > {
> > >> >> >> > > > > > > > > > > >   status: Failed
> > >> >> >> > > > > > > > > > > >   reason: "less avaliable slots to
> upgrade,
> > >> >> please checkout
> > >> >> >> > > > > > your
> > >> >> >> > > > > > > > > > alivalbe
> > >> >> >> > > > > > > > > > > > slots"
> > >> >> >> > > > > > > > > > > > }
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >    - *Pending*: this job now is join the
> > >> upgrade
> > >> >> queue,it
> > >> >> >> > > > > will
> > >> >> >> > > > > > be
> > >> >> >> > > > > > > > > > update
> > >> >> >> > > > > > > > > > > >    later
> > >> >> >> > > > > > > > > > > >    - *Rescaling*: job now is
> rescaling,wait it
> > >> >> finish
> > >> >> >> > > > > > > > > > > >    - *Finished*: finish do it
> > >> >> >> > > > > > > > > > > >    - *Failed* : something have wrong,so
> this
> > >> job
> > >> >> is not
> > >> >> >> > > > > > alivable
> > >> >> >> > > > > > > > > > upgrade
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > I want to supplement my above content in
> flip,
> > >> >> what do you
> > >> >> >> > > > > > think
> > >> >> >> > > > > > > ?
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >    1.
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > >
> > >> >> >> > > > >
> > >> >>
> > >>
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > David Morávek <dm...@apache.org>
> 于2023年2月3日周五
> > >> >> 16:42写道：
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > > Hi everyone,
> > >> >> >> > > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > > This FLIP [1] introduces a new REST API
> for
> > >> >> declaring
> > >> >> >> > > > > > resource
> > >> >> >> > > > > > > > > > > > requirements
> > >> >> >> > > > > > > > > > > > > for the Adaptive Scheduler. There seems
> to
> > >> be a
> > >> >> clear need
> > >> >> >> > > > > > for
> > >> >> >> > > > > > > > this
> > >> >> >> > > > > > > > > > API
> > >> >> >> > > > > > > > > > > > > based on the discussion on the
> "Reworking
> > >> the
> > >> >> Rescale API"
> > >> >> >> > > > > > [2]
> > >> >> >> > > > > > > > > > thread.
> > >> >> >> > > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > > Before we get started, this work is
> heavily
> > >> >> based on the
> > >> >> >> > > > > > > > prototype
> > >> >> >> > > > > > > > > > [3]
> > >> >> >> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP
> is
> > >> being
> > >> >> published
> > >> >> >> > > > > > with
> > >> >> >> > > > > > > > his
> > >> >> >> > > > > > > > > > > > consent.
> > >> >> >> > > > > > > > > > > > > Big shoutout to him!
> > >> >> >> > > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > > Last and not least, thanks to Chesnay
> and
> > >> Roman
> > >> >> for the
> > >> >> >> > > > > > initial
> > >> >> >> > > > > > > > > > reviews
> > >> >> >> > > > > > > > > > > > and
> > >> >> >> > > > > > > > > > > > > discussions.
> > >> >> >> > > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > > The best start would be watching a short
> > >> demo
> > >> >> [4] that I've
> > >> >> >> > > > > > > > > recorded,
> > >> >> >> > > > > > > > > > > > which
> > >> >> >> > > > > > > > > > > > > illustrates newly added capabilities
> > >> (rescaling
> > >> >> the running
> > >> >> >> > > > > > > job,
> > >> >> >> > > > > > > > > > > handing
> > >> >> >> > > > > > > > > > > > > back resources to the RM, and session
> > >> cluster
> > >> >> support).
> > >> >> >> > > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > > The intuition behind the FLIP is being
> able
> > >> to
> > >> >> define
> > >> >> >> > > > > > resource
> > >> >> >> > > > > > > > > > > > requirements
> > >> >> >> > > > > > > > > > > > > ("resource boundaries") externally that
> the
> > >> >> >> > > > > AdaptiveScheduler
> > >> >> >> > > > > > > can
> > >> >> >> > > > > > > > > > > > navigate
> > >> >> >> > > > > > > > > > > > > within. This is a building block for
> > >> >> higher-level efforts
> > >> >> >> > > > > > such
> > >> >> >> > > > > > > as
> > >> >> >> > > > > > > > > an
> > >> >> >> > > > > > > > > > > > > external Autoscaler. The natural
> extension
> > >> of
> > >> >> this work
> > >> >> >> > > > > would
> > >> >> >> > > > > > > be
> > >> >> >> > > > > > > > to
> > >> >> >> > > > > > > > > > > allow
> > >> >> >> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
> > >> >> >> > > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > > Looking forward to your thoughts; any
> > >> feedback
> > >> >> is
> > >> >> >> > > > > > appreciated!
> > >> >> >> > > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > > [1]
> > >> >> >> > > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > >
> > >> >> >> > > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > >
> > >> >> >> > > > > >
> > >> >> >> > > > >
> > >> >>
> > >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > >> >> >> > > > > > > > > > > > > [2]
> > >> >> >> > > > > > > > >
> > >> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > >> >> >> > > > > > > > > > > > > [3]
> > >> >> https://github.com/tillrohrmann/flink/tree/autoscaling
> > >> >> >> > > > > > > > > > > > > [4]
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > >
> > >> >>
> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > >> >> >> > > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > > Best,
> > >> >> >> > > > > > > > > > > > > D.
> > >> >> >> > > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > --
> > >> >> >> > > > > > > > > > > > Best
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > > > ConradJam
> > >> >> >> > > > > > > > > > > >
> > >> >> >> > > > > > > > > > >
> > >> >> >> > > > > > > > > >
> > >> >> >> > > > > > > > >
> > >> >> >> > > > > > > >
> > >> >> >> > > > > > >
> > >> >> >> > > > > >
> > >> >> >> > > > >
> > >> >>
> > >>
> > >
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by Maximilian Michels <mx...@apache.org>.

Hi David,

Thanks for the update! We consider using the new declarative resource
API for autoscaling. Currently, we treat a scaling decision as a new
deployment which means surrendering all resources to Kubernetes and
subsequently reallocating them for the rescaled deployment. The
declarative resource management API is a great step forward because it
allows us to do faster and safer rescaling. Faster, because we can
continue to run while resources are pre-allocated which minimizes
downtime. Safer, because we can't get stuck when the desired resources
are not available.

An example with two vertices and their respective parallelisms:
  v1: 50
  v2: 10
Let's assume slot sharing is disabled, so we need 60 task slots to run
the vertices.

If the autoscaler was to decide to scale up v1 and v2, it could do so
in a safe way by using min/max configuration:
  v1: [min: 50, max: 70]
  v2: [min: 10, max: 20]
This would then need 90 task slots to run at max capacity.

I suppose we could further remove the min because it would always be
safer to scale down if resources are not available than to not run at
all [1]. In fact, I saw that the minimum bound is currently not used
in the code you posted above [2]. Is that still planned?

-Max

PS: Note that originally we had assumed min == max but I think that
would be a less safe scaling approach because we would get stuck
waiting for resources when they are not available, e.g. k8s resource
limits reached.

[1] However, there might be costs involved with executing the
rescaling, e.g. for using external storage like s3, especially without
local recovery.
[2] https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9

On Tue, Feb 28, 2023 at 9:33 AM David Morávek <dm...@apache.org> wrote:
>
> Hi Everyone,
>
> We had some more talks about the pre-allocation of resources with @Max, and
> here is the final state that we've converged to for now:
>
> The vital thing to note about the new API is that it's declarative, meaning
> we're declaring the desired state to which we want our job to converge; If,
> after the requirements update job no longer holds the desired resources
> (fewer resources than the lower bound), it will be canceled and transition
> back into the waiting for resources state.
>
> In some use cases, you might always want to rescale to the upper bound
> (this goes along the lines of "preallocating resources" and minimizing the
> number of rescales, which is especially useful with the large state). This
> can be controlled by two knobs that already exist:
>
> 1) "jobmanager.adaptive-scheduler.min-parallelism-increase" - this affects
> a minimal parallelism increase step of a running job; we'll slightly change
> the semantics, and we'll trigger rescaling either once this condition is
> met or when you hit the ceiling; setting this to the high number will
> ensure that you always rescale to the upper bound
>
> 2) "jobmanager.adaptive-scheduler.resource-stabilization-timeout" - for new
> and already restarting jobs, we'll always respect this timeout, which
> allows you to wait for more resources even though you already have more
> resources than defined in the lower bound; again, in the case we reach the
> ceiling (the upper bound), we'll transition into the executing state.
>
>
> We're still planning to dig deeper in this direction with other efforts,
> but this is already good enough and should allow us to move the FLIP
> forward.
>
> WDYT? Unless there are any objectives against the above, I'd like to
> proceed to a vote.
>
> Best,
> D.
>
> On Thu, Feb 23, 2023 at 5:39 PM David Morávek <dm...@apache.org> wrote:
>
> > Hi Everyone,
> >
> > @John
> >
> > This is a problem that we've spent some time trying to crack; in the end,
> > we've decided to go against doing any upgrades to JobGraphStore from
> > JobMaster to avoid having multiple writers that are guarded by different
> > leader election lock (Dispatcher and JobMaster might live in a different
> > process). The contract we've decided to choose instead is leveraging the
> > idempotency of the endpoint and having the user of the API retry in case
> > we're unable to persist new requirements in the JobGraphStore [1]. We
> > eventually need to move JobGraphStore out of the dispatcher, but that's way
> > out of the scope of this FLIP. The solution is a deliberate trade-off. The
> > worst scenario is that the Dispatcher fails over in between retries, which
> > would simply rescale the job to meet the previous resource requirements
> > (more extended unavailability of underlying HA storage would have worse
> > consequences than this). Does that answer your question?
> >
> > @Matthias
> >
> > Good catch! I'm fixing it now, thanks!
> >
> > [1]
> > https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9#diff-a4b690fb2c4975d25b05eb4161617af0d704a85ff7b1cad19d3c817c12f1e29cR1151
> >
> > Best,
> > D.
> >
> > On Tue, Feb 21, 2023 at 12:24 AM John Roesler <vv...@apache.org> wrote:
> >
> >> Thanks for the FLIP, David!
> >>
> >> I just had one small question. IIUC, the REST API PUT request will go
> >> through the new DispatcherGateway method to be handled. Then, after
> >> validation, the dispatcher would call the new JobMasterGateway method to
> >> actually update the job.
> >>
> >> Which component will write the updated JobGraph? I just wanted to make
> >> sure it’s the JobMaster because it it were the dispatcher, there could be a
> >> race condition with the async JobMaster method.
> >>
> >> Thanks!
> >> -John
> >>
> >> On Mon, Feb 20, 2023, at 07:34, Matthias Pohl wrote:
> >> > Thanks for your clarifications, David. I don't have any additional major
> >> > points to add. One thing about the FLIP: The RPC layer API for updating
> >> the
> >> > JRR returns a future with a JRR? I don't see value in returning a JRR
> >> here
> >> > since it's an idempotent operation? Wouldn't it be enough to return
> >> > CompletableFuture<Void> here? Or am I missing something?
> >> >
> >> > Matthias
> >> >
> >> > On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels <mx...@apache.org>
> >> wrote:
> >> >
> >> >> Thanks David! If we could get the pre-allocation working as part of
> >> >> the FLIP, that would be great.
> >> >>
> >> >> Concerning the downscale case, I agree this is a special case for the
> >> >> (single-job) application mode where we could re-allocate slots in a
> >> >> way that could leave entire task managers unoccupied which we would
> >> >> then be able to release. The goal essentially is to reduce slot
> >> >> fragmentation on scale down by packing the slots efficiently. The
> >> >> easiest way to add this optimization when running in application mode
> >> >> would be to drop as many task managers during the restart such that
> >> >> NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look into
> >> >> this independently of the FLIP.
> >> >>
> >> >> Feel free to start the vote.
> >> >>
> >> >> -Max
> >> >>
> >> >> On Mon, Feb 20, 2023 at 9:10 AM David Morávek <dm...@apache.org> wrote:
> >> >> >
> >> >> > Hi everyone,
> >> >> >
> >> >> > Thanks for the feedback! I've updated the FLIP to use idempotent PUT
> >> API
> >> >> instead of PATCH and to properly handle lower bound settings, to
> >> support
> >> >> the "pre-allocation" of the resources.
> >> >> >
> >> >> > @Max
> >> >> >
> >> >> > > How hard would it be to address this issue in the FLIP?
> >> >> >
> >> >> > I've included this in the FLIP. It might not be too hard to implement
> >> >> this in the end.
> >> >> >
> >> >> > > B) drop as many superfluous task managers as needed
> >> >> >
> >> >> > I've intentionally left this part out for now because this ultimately
> >> >> needs to be the responsibility of the Resource Manager. After all, in
> >> the
> >> >> Session Cluster scenario, the Scheduler doesn't have the bigger
> >> picture of
> >> >> other tasks of other jobs running on those TMs. This will most likely
> >> be a
> >> >> topic for another FLIP.
> >> >> >
> >> >> > WDYT? If there are no other questions or concerns, I'd like to start
> >> the
> >> >> vote on Wednesday.
> >> >> >
> >> >> > Best,
> >> >> > D.
> >> >> >
> >> >> > On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <mx...@apache.org>
> >> >> wrote:
> >> >> >>
> >> >> >> I missed that the FLIP states:
> >> >> >>
> >> >> >> > Currently, even though we’d expose the lower bound for clarity and
> >> >> API completeness, we won’t allow setting it to any other value than one
> >> >> until we have full support throughout the stack.
> >> >> >>
> >> >> >> How hard would it be to address this issue in the FLIP?
> >> >> >>
> >> >> >> There is not much value to offer setting a lower bound which won't
> >> be
> >> >> >> respected / throw an error when it is set. If we had support for a
> >> >> >> lower bound, we could enforce a resource contract externally via
> >> >> >> setting lowerBound == upperBound. That ties back to the Rescale API
> >> >> >> discussion we had. I want to better understand what the major
> >> concerns
> >> >> >> would be around allowing this.
> >> >> >>
> >> >> >> Just to outline how I imagine the logic to work:
> >> >> >>
> >> >> >> A) The resource constraints are already met => Nothing changes
> >> >> >> B) More resources available than required => Cancel the job, drop as
> >> >> >> many superfluous task managers as needed, restart the job
> >> >> >> C) Less resources available than required => Acquire new task
> >> >> >> managers, wait for them to register, cancel and restart the job
> >> >> >>
> >> >> >> I'm open to helping out with the implementation.
> >> >> >>
> >> >> >> -Max
> >> >> >>
> >> >> >> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <mx...@apache.org>
> >> >> wrote:
> >> >> >> >
> >> >> >> > Based on further discussion I had with Chesnay on this PR [1], I
> >> think
> >> >> >> > jobs would currently go into a restarting state after the resource
> >> >> >> > requirements have changed. This wouldn't achieve what we had in
> >> mind,
> >> >> >> > i.e. sticking to the old resource requirements until enough slots
> >> are
> >> >> >> > available to fulfil the new resource requirements. So this may
> >> not be
> >> >> >> > 100% what we need but it could be extended to do what we want.
> >> >> >> >
> >> >> >> > -Max
> >> >> >> >
> >> >> >> > [1]
> >> https://github.com/apache/flink/pull/21908#discussion_r1104792362
> >> >> >> >
> >> >> >> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <
> >> mxm@apache.org>
> >> >> wrote:
> >> >> >> > >
> >> >> >> > > Hi David,
> >> >> >> > >
> >> >> >> > > This is awesome! Great writeup and demo. This is pretty much
> >> what we
> >> >> >> > > need for the autoscaler as part of the Flink Kubernetes operator
> >> >> [1].
> >> >> >> > > Scaling Flink jobs effectively is hard but fortunately we have
> >> >> solved
> >> >> >> > > the issue as part of the Flink Kubernetes operator. The only
> >> >> critical
> >> >> >> > > piece we are missing is a better way to execute scaling
> >> decisions,
> >> >> as
> >> >> >> > > discussed in [2].
> >> >> >> > >
> >> >> >> > > Looking at your proposal, we would set lowerBound == upperBound
> >> for
> >> >> >> > > the parallelism because we want to fully determine the
> >> parallelism
> >> >> >> > > externally based on the scaling metrics. Does that sound right?
> >> >> >> > >
> >> >> >> > > What is the timeline for these changes? Is there a JIRA?
> >> >> >> > >
> >> >> >> > > Cheers,
> >> >> >> > > Max
> >> >> >> > >
> >> >> >> > > [1]
> >> >>
> >> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
> >> >> >> > > [2]
> >> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> >> >> >> > >
> >> >> >> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <
> >> xiangyu0xf@gmail.com>
> >> >> wrote:
> >> >> >> > > >
> >> >> >> > > > Hi David,
> >> >> >> > > >
> >> >> >> > > > Thanks for your reply.  I think your response totally make
> >> >> sense.  This
> >> >> >> > > > flip targets on declaring required resource to ResourceManager
> >> >> instead of
> >> >> >> > > > using  ResourceManager to add/remove TMs directly.
> >> >> >> > > >
> >> >> >> > > > Best,
> >> >> >> > > > Xiangyu
> >> >> >> > > >
> >> >> >> > > >
> >> >> >> > > >
> >> >> >> > > > David Morávek <da...@gmail.com> 于2023年2月13日周一
> >> 15:46写道：
> >> >> >> > > >
> >> >> >> > > > > Hi everyone,
> >> >> >> > > > >
> >> >> >> > > > > @Shammon
> >> >> >> > > > >
> >> >> >> > > > > I'm not entirely sure what "config file" you're referring
> >> to.
> >> >> You can, of
> >> >> >> > > > > course, override the default parallelism in
> >> "flink-conf.yaml",
> >> >> but for
> >> >> >> > > > > sinks and sources, the parallelism needs to be tweaked on
> >> the
> >> >> connector
> >> >> >> > > > > level ("WITH" statement).
> >> >> >> > > > >
> >> >> >> > > > > This is something that should be achieved with tooling
> >> around
> >> >> Flink. We
> >> >> >> > > > > want to provide an API on the lowest level that generalizes
> >> >> well. Achieving
> >> >> >> > > > > what you're describing should be straightforward with this
> >> API.
> >> >> >> > > > >
> >> >> >> > > > > @Xiangyu
> >> >> >> > > > >
> >> >> >> > > > > Is it possible for this REST API to declare TM resources in
> >> the
> >> >> future?
> >> >> >> > > > >
> >> >> >> > > > >
> >> >> >> > > > > Would you like to add/remove TMs if you use an active
> >> Resource
> >> >> Manager?
> >> >> >> > > > > This would be out of the scope of this effort since it
> >> targets
> >> >> the
> >> >> >> > > > > scheduler component only (we make no assumptions about the
> >> used
> >> >> Resource
> >> >> >> > > > > Manager). Also, the AdaptiveScheduler is only intended to be
> >> >> used for
> >> >> >> > > > > Streaming.
> >> >> >> > > > >
> >> >> >> > > > >  And for streaming jobs, I'm wondering if there is any
> >> >> situation we need to
> >> >> >> > > > > > rescale the TM resources of a flink cluster at first and
> >> then
> >> >> the
> >> >> >> > > > > adaptive
> >> >> >> > > > > > scheduler will rescale the per-vertex ResourceProfiles
> >> >> accordingly.
> >> >> >> > > > > >
> >> >> >> > > > >
> >> >> >> > > > > We plan on adding support for the ResourceProfiles (dynamic
> >> slot
> >> >> >> > > > > allocation) as the next step. Again we won't make any
> >> >> assumptions about the
> >> >> >> > > > > used Resource Manager. In other words, this effort ends by
> >> >> declaring
> >> >> >> > > > > desired resources to the Resource Manager.
> >> >> >> > > > >
> >> >> >> > > > > Does that make sense?
> >> >> >> > > > >
> >> >> >> > > > > @Matthias
> >> >> >> > > > >
> >> >> >> > > > > We've done another pass on the proposed API and currently
> >> lean
> >> >> towards
> >> >> >> > > > > having an idempotent PUT API.
> >> >> >> > > > > - We don't care too much about multiple writers' scenarios
> >> in
> >> >> terms of who
> >> >> >> > > > > can write an authoritative payload; this is up to the user
> >> of
> >> >> the API to
> >> >> >> > > > > figure out
> >> >> >> > > > > - It's indeed tricky to achieve atomicity with PATCH API;
> >> >> switching to PUT
> >> >> >> > > > > API seems to do the trick
> >> >> >> > > > > - We won't allow partial "payloads" anymore, meaning you
> >> need
> >> >> to define
> >> >> >> > > > > requirements for all vertices in the JobGraph; This is
> >> >> completely fine for
> >> >> >> > > > > the programmatic workflows. For DEBUG / DEMO purposes, you
> >> can
> >> >> use the GET
> >> >> >> > > > > endpoint and tweak the response to avoid writing the whole
> >> >> payload by hand.
> >> >> >> > > > >
> >> >> >> > > > > WDYT?
> >> >> >> > > > >
> >> >> >> > > > >
> >> >> >> > > > > Best,
> >> >> >> > > > > D.
> >> >> >> > > > >
> >> >> >> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <
> >> >> xiangyu0xf@gmail.com>
> >> >> >> > > > > wrote:
> >> >> >> > > > >
> >> >> >> > > > > > Hi David,
> >> >> >> > > > > >
> >> >> >> > > > > > Thanks for creating this flip. I think this work it is
> >> very
> >> >> useful,
> >> >> >> > > > > > especially in autoscaling scenario.  I would like to share
> >> >> some questions
> >> >> >> > > > > > from my view.
> >> >> >> > > > > >
> >> >> >> > > > > > 1, Is it possible for this REST API to declare TM
> >> resources
> >> >> in the
> >> >> >> > > > > future?
> >> >> >> > > > > > I'm asking because we are building the autoscaling feature
> >> >> for Flink OLAP
> >> >> >> > > > > > Session Cluster in ByteDance. We need to rescale the
> >> >> cluster's resource
> >> >> >> > > > > on
> >> >> >> > > > > > TM level instead of Job level. It would be very helpful
> >> if we
> >> >> have a REST
> >> >> >> > > > > > API for out external Autoscaling service to use.
> >> >> >> > > > > >
> >> >> >> > > > > > 2, And for streaming jobs, I'm wondering if there is any
> >> >> situation we
> >> >> >> > > > > need
> >> >> >> > > > > > to rescale the TM resources of a flink cluster at first
> >> and
> >> >> then the
> >> >> >> > > > > > adaptive scheduler will rescale the per-vertex
> >> >> ResourceProfiles
> >> >> >> > > > > > accordingly.
> >> >> >> > > > > >
> >> >> >> > > > > > best.
> >> >> >> > > > > > Xiangyu
> >> >> >> > > > > >
> >> >> >> > > > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
> >> >> >> > > > > >
> >> >> >> > > > > > > Hi David
> >> >> >> > > > > > >
> >> >> >> > > > > > > Thanks for your answer.
> >> >> >> > > > > > >
> >> >> >> > > > > > > > Can you elaborate more about how you'd intend to use
> >> the
> >> >> endpoint? I
> >> >> >> > > > > > > think we can ultimately introduce a way of re-declaring
> >> >> "per-vertex
> >> >> >> > > > > > > defaults," but I'd like to understand the use case bit
> >> more
> >> >> first.
> >> >> >> > > > > > >
> >> >> >> > > > > > > For this issue, I mainly consider the consistency of
> >> user
> >> >> configuration
> >> >> >> > > > > > and
> >> >> >> > > > > > > job runtime. For sql jobs, users usually set specific
> >> >> parallelism for
> >> >> >> > > > > > > source and sink, and set a global parallelism for other
> >> >> operators.
> >> >> >> > > > > These
> >> >> >> > > > > > > config items are stored in a config file. For some
> >> >> high-priority jobs,
> >> >> >> > > > > > > users may want to manage them manually.
> >> >> >> > > > > > > 1. When users need to scale the parallelism, they should
> >> >> update the
> >> >> >> > > > > > config
> >> >> >> > > > > > > file and restart flink job, which may take a long time.
> >> >> >> > > > > > > 2. After providing the REST API, users can just send a
> >> >> request to the
> >> >> >> > > > > job
> >> >> >> > > > > > > via REST API quickly after updating the config file.
> >> >> >> > > > > > > The configuration in the running job and config file
> >> should
> >> >> be the
> >> >> >> > > > > same.
> >> >> >> > > > > > > What do you think of this?
> >> >> >> > > > > > >
> >> >> >> > > > > > > best.
> >> >> >> > > > > > > Shammon
> >> >> >> > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <
> >> >> david.moravek@gmail.com>
> >> >> >> > > > > > > wrote:
> >> >> >> > > > > > >
> >> >> >> > > > > > > > Hi everyone,
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Let's try to answer the questions one by one.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > *@ConradJam*
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > when the number of "slots" is insufficient, can we can
> >> >> stop users
> >> >> >> > > > > > > rescaling
> >> >> >> > > > > > > > > or throw something to tell user "less avaliable
> >> slots
> >> >> to upgrade,
> >> >> >> > > > > > > please
> >> >> >> > > > > > > > > checkout your alivalbe slots" ?
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > The main property of AdaptiveScheduler is that it can
> >> >> adapt to
> >> >> >> > > > > > "available
> >> >> >> > > > > > > > resources," which means you're still able to make
> >> >> progress even
> >> >> >> > > > > though
> >> >> >> > > > > > > you
> >> >> >> > > > > > > > didn't get all the slots you've asked for. Let's break
> >> >> down the pros
> >> >> >> > > > > > and
> >> >> >> > > > > > > > cons of this property.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > - (plus) If you lose a TM for some reason, you can
> >> still
> >> >> recover even
> >> >> >> > > > > > if
> >> >> >> > > > > > > it
> >> >> >> > > > > > > > doesn't come back. We still need to give it some time
> >> to
> >> >> eliminate
> >> >> >> > > > > > > > unnecessary rescaling, which can be controlled by
> >> setting
> >> >> >> > > > > > > > "resource-stabilization-timeout."
> >> >> >> > > > > > > > - (plus) The resources can arrive with a significant
> >> >> delay. For
> >> >> >> > > > > > example,
> >> >> >> > > > > > > > you're unable to spawn enough TMs on time because
> >> you've
> >> >> run out of
> >> >> >> > > > > > > > resources in your k8s cluster, and you need to wait
> >> for
> >> >> the cluster
> >> >> >> > > > > > auto
> >> >> >> > > > > > > > scaler to kick in and add new nodes to the cluster. In
> >> >> this scenario,
> >> >> >> > > > > > > > you'll be able to start making progress faster, at the
> >> >> cost of
> >> >> >> > > > > multiple
> >> >> >> > > > > > > > rescalings (once the remaining resources arrive).
> >> >> >> > > > > > > > - (plus) This plays well with the declarative manner
> >> of
> >> >> today's
> >> >> >> > > > > > > > infrastructure. For example, you tell k8s that you
> >> need
> >> >> 10 TMs, and
> >> >> >> > > > > > > you'll
> >> >> >> > > > > > > > eventually get them.
> >> >> >> > > > > > > > - (minus) In the case of large state jobs, the cost of
> >> >> multiple
> >> >> >> > > > > > > rescalings
> >> >> >> > > > > > > > might outweigh the above.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > We've already touched on the solution to this problem
> >> on
> >> >> the FLIP.
> >> >> >> > > > > > Please
> >> >> >> > > > > > > > notice the parallelism knob being a range with a lower
> >> >> and upper
> >> >> >> > > > > bound.
> >> >> >> > > > > > > > Setting both the lower and upper bound to the same
> >> value
> >> >> could give
> >> >> >> > > > > the
> >> >> >> > > > > > > > behavior you're describing at the cost of giving up
> >> some
> >> >> properties
> >> >> >> > > > > > that
> >> >> >> > > > > > > AS
> >> >> >> > > > > > > > gives you (you'd be falling back to the
> >> >> DefaultScheduler's behavior).
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > when user upgrade job-vertx-parallelism . I want to
> >> have
> >> >> an interface
> >> >> >> > > > > > to
> >> >> >> > > > > > > > > query the current update parallel execution status,
> >> so
> >> >> that the
> >> >> >> > > > > user
> >> >> >> > > > > > or
> >> >> >> > > > > > > > > program can understand the current status
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > This is a misunderstanding. We're not introducing the
> >> >> RESCALE
> >> >> >> > > > > endpoint.
> >> >> >> > > > > > > > This endpoint allows you to re-declare the resources
> >> >> needed to run
> >> >> >> > > > > the
> >> >> >> > > > > > > job.
> >> >> >> > > > > > > > Once you reach the desired resources (you get more
> >> >> resources than the
> >> >> >> > > > > > > lower
> >> >> >> > > > > > > > bound defines), your job will run.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > We can expose a similar endpoint to "resource
> >> >> requirements" to give
> >> >> >> > > > > you
> >> >> >> > > > > > > an
> >> >> >> > > > > > > > overview of the resources the vertices already have.
> >> You
> >> >> can already
> >> >> >> > > > > > get
> >> >> >> > > > > > > > this from the REST API, so exposing this in yet
> >> another
> >> >> way should be
> >> >> >> > > > > > > > considered carefully.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > *@Matthias*
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > I'm wondering whether it makes sense to add some kind
> >> of
> >> >> resource ID
> >> >> >> > > > > to
> >> >> >> > > > > > > the
> >> >> >> > > > > > > > > REST API.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > That's a good question. I want to think about that and
> >> >> get back to
> >> >> >> > > > > the
> >> >> >> > > > > > > > question later. My main struggle when thinking about
> >> this
> >> >> is, "if
> >> >> >> > > > > this
> >> >> >> > > > > > > > would be an idempotent POST endpoint," would it be any
> >> >> different?
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > How often do we allow resource requirements to be
> >> changed?
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > There shall be no rate limiting on the FLINK side. If
> >> >> this is
> >> >> >> > > > > something
> >> >> >> > > > > > > > your environment needs, you can achieve it on a
> >> different
> >> >> layer ("we
> >> >> >> > > > > > > can't
> >> >> >> > > > > > > > have FLINK to do everything").
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Versioning the JobGraph in the JobGraphStore rather
> >> than
> >> >> overwriting
> >> >> >> > > > > it
> >> >> >> > > > > > > > > might be an idea.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > This sounds interesting since it would be closer to
> >> the
> >> >> JobGraph
> >> >> >> > > > > being
> >> >> >> > > > > > > > immutable. The main problem I see here is that this
> >> would
> >> >> introduce a
> >> >> >> > > > > > > > BW-incompatible change so it might be a topic for
> >> >> follow-up FLIP.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > I'm just wondering whether we bundle two things
> >> together
> >> >> that are
> >> >> >> > > > > > > actually
> >> >> >> > > > > > > > > separate
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Yup, this is how we think about it as well. The main
> >> >> question is,
> >> >> >> > > > > "who
> >> >> >> > > > > > > > should be responsible for bookkeeping 1) the JobGraph
> >> and
> >> >> 2) the
> >> >> >> > > > > > > > JobResourceRequirements". The JobMaster would be the
> >> >> right place for
> >> >> >> > > > > > > both,
> >> >> >> > > > > > > > but it's currently not the case, and we're tightly
> >> >> coupling the
> >> >> >> > > > > > > dispatcher
> >> >> >> > > > > > > > with the JobMaster.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Initially, we tried to introduce a separate HA
> >> component
> >> >> in JobMaster
> >> >> >> > > > > > for
> >> >> >> > > > > > > > bookkeeping the JobResourceRequirements, but that
> >> proved
> >> >> to be a more
> >> >> >> > > > > > > > significant effort adding additional mess to the
> >> already
> >> >> messy HA
> >> >> >> > > > > > > > ecosystem. Another approach we've discussed was
> >> mutating
> >> >> the JobGraph
> >> >> >> > > > > > and
> >> >> >> > > > > > > > setting JRR into the JobGraph structure itself.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > The middle ground for keeping this effort reasonably
> >> >> sized and not
> >> >> >> > > > > > > > violating "we want to keep JG immutable" too much is
> >> >> keeping the
> >> >> >> > > > > > > > JobResourceRequirements separate as an internal config
> >> >> option in
> >> >> >> > > > > > > JobGraph's
> >> >> >> > > > > > > > configuration.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > We ultimately need to rethink the tight coupling of
> >> >> Dispatcher and
> >> >> >> > > > > > > > JobMaster, but it needs to be a separate effort.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > ...also considering the amount of data that can be
> >> stored
> >> >> in a
> >> >> >> > > > > > > > > ConfigMap/ZooKeeper node if versioning the resource
> >> >> requirement
> >> >> >> > > > > > change
> >> >> >> > > > > > > as
> >> >> >> > > > > > > > > proposed in my previous item is an option for us.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > AFAIK we're only storing pointers to the S3 objects
> >> in HA
> >> >> metadata,
> >> >> >> > > > > so
> >> >> >> > > > > > we
> >> >> >> > > > > > > > should be okay with having larger structures for now.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Updating the JobGraphStore means adding more requests
> >> to
> >> >> the HA
> >> >> >> > > > > backend
> >> >> >> > > > > > > > API.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > It's fine unless you intend to override the resource
> >> >> requirements a
> >> >> >> > > > > few
> >> >> >> > > > > > > > times per second.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > *@Shammon*
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > How about adding some more information such as vertex
> >> type
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Since it was intended as a "debug" endpoint, it makes
> >> >> complete sense!
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >  For sql jobs, we always use a unified parallelism for
> >> >> most vertices.
> >> >> >> > > > > > Can
> >> >> >> > > > > > > > > we provide them with a more convenient setting
> >> method
> >> >> instead of
> >> >> >> > > > > each
> >> >> >> > > > > > > > one?
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > I completely feel with this. The main thoughts when
> >> >> designing the API
> >> >> >> > > > > > > were:
> >> >> >> > > > > > > > - We want to keep it clean and easy to understand.
> >> >> >> > > > > > > > - Global parallelism can be modeled using per-vertex
> >> >> parallelism but
> >> >> >> > > > > > not
> >> >> >> > > > > > > > the other way around.
> >> >> >> > > > > > > > - The API will be used by external tooling (operator,
> >> >> auto scaler).
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Can you elaborate more about how you'd intend to use
> >> the
> >> >> endpoint? I
> >> >> >> > > > > > > think
> >> >> >> > > > > > > > we can ultimately introduce a way of re-declaring
> >> >> "per-vertex
> >> >> >> > > > > > defaults,"
> >> >> >> > > > > > > > but I'd like to understand the use case bit more
> >> first.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > *@Weijie*
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > What is the default value here (based on what
> >> >> configuration), or just
> >> >> >> > > > > > > > > infinite?
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Currently, for the lower bound, it's always one, and
> >> for
> >> >> the upper
> >> >> >> > > > > > bound,
> >> >> >> > > > > > > > it's either parallelism (if defined) or the
> >> >> maxParallelism of the
> >> >> >> > > > > > vertex
> >> >> >> > > > > > > in
> >> >> >> > > > > > > > JobGraph. This question might be another signal for
> >> >> making the
> >> >> >> > > > > defaults
> >> >> >> > > > > > > > explicit (see the answer to Shammon's question above).
> >> >> >> > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Thanks, everyone, for your initial thoughts!
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > Best,
> >> >> >> > > > > > > > D.
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <
> >> >> guoweijiereswqa@gmail.com
> >> >> >> > > > > >
> >> >> >> > > > > > > > wrote:
> >> >> >> > > > > > > >
> >> >> >> > > > > > > > > Thanks David for driving this. This is a very
> >> valuable
> >> >> work,
> >> >> >> > > > > > especially
> >> >> >> > > > > > > > for
> >> >> >> > > > > > > > > cloud native environment.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > >> How about adding some more information such as
> >> >> vertex type
> >> >> >> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get
> >> jobs
> >> >> >> > > > > > > > > resource-requirements`? For users, only vertex-id
> >> may
> >> >> be difficult
> >> >> >> > > > > to
> >> >> >> > > > > > > > > understand.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > +1 for this suggestion, including jobvertex's name
> >> in
> >> >> the response
> >> >> >> > > > > > body
> >> >> >> > > > > > > > is
> >> >> >> > > > > > > > > more
> >> >> >> > > > > > > > > user-friendly.
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > I saw this sentence in FLIP: "Setting the upper
> >> bound
> >> >> to -1 will
> >> >> >> > > > > > reset
> >> >> >> > > > > > > > the
> >> >> >> > > > > > > > > value to the default setting."  What is the default
> >> >> value here
> >> >> >> > > > > (based
> >> >> >> > > > > > > on
> >> >> >> > > > > > > > > what configuration), or just infinite?
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > Best regards,
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > Weijie
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一
> >> 18:06写道：
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > > > > Hi David
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > Thanks for initiating this discussion. I think
> >> >> declaring job
> >> >> >> > > > > > resource
> >> >> >> > > > > > > > > > requirements by REST API is very valuable. I just
> >> >> left some
> >> >> >> > > > > > comments
> >> >> >> > > > > > > as
> >> >> >> > > > > > > > > > followed
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > 1) How about adding some more information such as
> >> >> vertex type
> >> >> >> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get
> >> >> jobs
> >> >> >> > > > > > > > > > resource-requirements`? For users, only vertex-id
> >> may
> >> >> be
> >> >> >> > > > > difficult
> >> >> >> > > > > > to
> >> >> >> > > > > > > > > > understand.
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > 2) For sql jobs, we always use a unified
> >> parallelism
> >> >> for most
> >> >> >> > > > > > > vertices.
> >> >> >> > > > > > > > > Can
> >> >> >> > > > > > > > > > we provide them with a more convenient setting
> >> method
> >> >> instead of
> >> >> >> > > > > > each
> >> >> >> > > > > > > > > one?
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > Best,
> >> >> >> > > > > > > > > > Shammon
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
> >> >> >> > > > > > matthias.pohl@aiven.io
> >> >> >> > > > > > > > > > .invalid>
> >> >> >> > > > > > > > > > wrote:
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > > > > Thanks David for creating this FLIP. It sounds
> >> >> promising and
> >> >> >> > > > > > useful
> >> >> >> > > > > > > > to
> >> >> >> > > > > > > > > > > have. Here are some thoughts from my side (some
> >> of
> >> >> them might
> >> >> >> > > > > be
> >> >> >> > > > > > > > > rather a
> >> >> >> > > > > > > > > > > follow-up and not necessarily part of this
> >> FLIP):
> >> >> >> > > > > > > > > > > - I'm wondering whether it makes sense to add
> >> some
> >> >> kind of
> >> >> >> > > > > > resource
> >> >> >> > > > > > > > ID
> >> >> >> > > > > > > > > to
> >> >> >> > > > > > > > > > > the REST API. This would give Flink a tool to
> >> >> verify the PATCH
> >> >> >> > > > > > > > request
> >> >> >> > > > > > > > > of
> >> >> >> > > > > > > > > > > the external system in a compare-and-set kind of
> >> >> manner. AFAIU,
> >> >> >> > > > > > the
> >> >> >> > > > > > > > > > process
> >> >> >> > > > > > > > > > > requires the external system to retrieve the
> >> >> resource
> >> >> >> > > > > > requirements
> >> >> >> > > > > > > > > first
> >> >> >> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID
> >> <ABC>
> >> >> would be sent
> >> >> >> > > > > > > along
> >> >> >> > > > > > > > > as
> >> >> >> > > > > > > > > > a
> >> >> >> > > > > > > > > > > unique identifier for the provided setup. It's
> >> >> essentially the
> >> >> >> > > > > > > > version
> >> >> >> > > > > > > > > ID
> >> >> >> > > > > > > > > > > of the currently deployed resource requirement
> >> >> configuration.
> >> >> >> > > > > > Flink
> >> >> >> > > > > > > > > > doesn't
> >> >> >> > > > > > > > > > > know whether the external system would use the
> >> >> provided
> >> >> >> > > > > > information
> >> >> >> > > > > > > > in
> >> >> >> > > > > > > > > > some
> >> >> >> > > > > > > > > > > way to derive a new set of resource requirements
> >> >> for this job.
> >> >> >> > > > > > The
> >> >> >> > > > > > > > > > > subsequent PATCH request with updated resource
> >> >> requirements
> >> >> >> > > > > would
> >> >> >> > > > > > > > > include
> >> >> >> > > > > > > > > > > the previously retrieved resource ID <ABC>. The
> >> >> PATCH call
> >> >> >> > > > > would
> >> >> >> > > > > > > fail
> >> >> >> > > > > > > > > if
> >> >> >> > > > > > > > > > > there was a concurrent PATCH call in between
> >> >> indicating to the
> >> >> >> > > > > > > > external
> >> >> >> > > > > > > > > > > system that the resource requirements were
> >> >> concurrently
> >> >> >> > > > > updated.
> >> >> >> > > > > > > > > > > - How often do we allow resource requirements
> >> to be
> >> >> changed?
> >> >> >> > > > > That
> >> >> >> > > > > > > > > > question
> >> >> >> > > > > > > > > > > might make my previous comment on the resource
> >> ID
> >> >> obsolete
> >> >> >> > > > > > because
> >> >> >> > > > > > > we
> >> >> >> > > > > > > > > > could
> >> >> >> > > > > > > > > > > just make any PATCH call fail if there was a
> >> >> resource
> >> >> >> > > > > requirement
> >> >> >> > > > > > > > > update
> >> >> >> > > > > > > > > > > within a certain time frame before the request.
> >> But
> >> >> such a time
> >> >> >> > > > > > > > period
> >> >> >> > > > > > > > > is
> >> >> >> > > > > > > > > > > something we might want to make configurable
> >> then,
> >> >> I guess.
> >> >> >> > > > > > > > > > > - Versioning the JobGraph in the JobGraphStore
> >> >> rather than
> >> >> >> > > > > > > > overwriting
> >> >> >> > > > > > > > > it
> >> >> >> > > > > > > > > > > might be an idea. This would enable us to
> >> provide
> >> >> resource
> >> >> >> > > > > > > > requirement
> >> >> >> > > > > > > > > > > changes in the UI or through the REST API. It is
> >> >> related to a
> >> >> >> > > > > > > problem
> >> >> >> > > > > > > > > > > around keeping track of the exception history
> >> >> within the
> >> >> >> > > > > > > > > > AdaptiveScheduler
> >> >> >> > > > > > > > > > > and also having to consider multiple versions
> >> of a
> >> >> JobGraph.
> >> >> >> > > > > But
> >> >> >> > > > > > > for
> >> >> >> > > > > > > > > that
> >> >> >> > > > > > > > > > > one, we use the ExecutionGraphInfoStore right
> >> now.
> >> >> >> > > > > > > > > > > - Updating the JobGraph in the JobGraphStore
> >> makes
> >> >> sense. I'm
> >> >> >> > > > > > just
> >> >> >> > > > > > > > > > > wondering whether we bundle two things together
> >> >> that are
> >> >> >> > > > > actually
> >> >> >> > > > > > > > > > separate:
> >> >> >> > > > > > > > > > > The business logic and the execution
> >> configuration
> >> >> (the
> >> >> >> > > > > resource
> >> >> >> > > > > > > > > > > requirements). I'm aware that this is not a
> >> flaw of
> >> >> the current
> >> >> >> > > > > > > FLIP
> >> >> >> > > > > > > > > but
> >> >> >> > > > > > > > > > > rather something that was not necessary to
> >> address
> >> >> in the past
> >> >> >> > > > > > > > because
> >> >> >> > > > > > > > > > the
> >> >> >> > > > > > > > > > > JobGraph was kind of static. I don't remember
> >> >> whether that was
> >> >> >> > > > > > > > already
> >> >> >> > > > > > > > > > > discussed while working on the AdaptiveScheduler
> >> >> for FLIP-160
> >> >> >> > > > > > [1].
> >> >> >> > > > > > > > > Maybe,
> >> >> >> > > > > > > > > > > I'm missing some functionality here that
> >> requires
> >> >> us to have
> >> >> >> > > > > > > > everything
> >> >> >> > > > > > > > > > in
> >> >> >> > > > > > > > > > > one place. But it feels like updating the entire
> >> >> JobGraph which
> >> >> >> > > > > > > could
> >> >> >> > > > > > > > > be
> >> >> >> > > > > > > > > > > actually a "config change" is not reasonable.
> >> >> ...also
> >> >> >> > > > > considering
> >> >> >> > > > > > > the
> >> >> >> > > > > > > > > > > amount of data that can be stored in a
> >> >> ConfigMap/ZooKeeper node
> >> >> >> > > > > > if
> >> >> >> > > > > > > > > > > versioning the resource requirement change as
> >> >> proposed in my
> >> >> >> > > > > > > previous
> >> >> >> > > > > > > > > > item
> >> >> >> > > > > > > > > > > is an option for us.
> >> >> >> > > > > > > > > > > - Updating the JobGraphStore means adding more
> >> >> requests to the
> >> >> >> > > > > HA
> >> >> >> > > > > > > > > backend
> >> >> >> > > > > > > > > > > API. There were some concerns shared in the
> >> >> discussion thread
> >> >> >> > > > > [2]
> >> >> >> > > > > > > for
> >> >> >> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API server in
> >> >> the past with
> >> >> >> > > > > > too
> >> >> >> > > > > > > > many
> >> >> >> > > > > > > > > > > calls. Eventhough, it's more likely to be
> >> caused by
> >> >> >> > > > > > checkpointing,
> >> >> >> > > > > > > I
> >> >> >> > > > > > > > > > still
> >> >> >> > > > > > > > > > > wanted to bring it up. We're working on a
> >> >> standardized
> >> >> >> > > > > > performance
> >> >> >> > > > > > > > test
> >> >> >> > > > > > > > > > to
> >> >> >> > > > > > > > > > > prepare going forward with FLIP-270 [3] right
> >> now.
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > Best,
> >> >> >> > > > > > > > > > > Matthias
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > [1]
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > >
> >> >> >> > > > >
> >> >>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> >> >> >> > > > > > > > > > > [2]
> >> >> >> > > > > > >
> >> >> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> >> >> >> > > > > > > > > > > [3]
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > >
> >> >> >> > > > >
> >> >>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <
> >> >> jam.gzczy@gmail.com
> >> >> >> > > > > >
> >> >> >> > > > > > > > wrote:
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > > > > Hi David:
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > Thank you for drive this flip, which helps
> >> less
> >> >> flink
> >> >> >> > > > > shutdown
> >> >> >> > > > > > > time
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > for this flip, I would like to make a few
> >> idea on
> >> >> share
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >    - when the number of "slots" is
> >> insufficient,
> >> >> can we can
> >> >> >> > > > > > stop
> >> >> >> > > > > > > > > users
> >> >> >> > > > > > > > > > > >    rescaling or throw something to tell user
> >> >> "less avaliable
> >> >> >> > > > > > > slots
> >> >> >> > > > > > > > to
> >> >> >> > > > > > > > > > > > upgrade,
> >> >> >> > > > > > > > > > > >    please checkout your alivalbe slots" ? Or
> >> we
> >> >> could have a
> >> >> >> > > > > > > > request
> >> >> >> > > > > > > > > > > >    switch(true/false) to allow this behavior
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >    - when user upgrade job-vertx-parallelism
> >> . I
> >> >> want to have
> >> >> >> > > > > > an
> >> >> >> > > > > > > > > > > interface
> >> >> >> > > > > > > > > > > >    to query the current update parallel
> >> execution
> >> >> status, so
> >> >> >> > > > > > that
> >> >> >> > > > > > > > the
> >> >> >> > > > > > > > > > > user
> >> >> >> > > > > > > > > > > > or
> >> >> >> > > > > > > > > > > >    program can understand the current status
> >> >> >> > > > > > > > > > > >    - I want to have an interface to query the
> >> >> current update
> >> >> >> > > > > > > > > > parallelism
> >> >> >> > > > > > > > > > > >    execution status. This also helps similar
> >> to
> >> >> *[1] Flink
> >> >> >> > > > > K8S
> >> >> >> > > > > > > > > > Operator*
> >> >> >> > > > > > > > > > > >    management
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > {
> >> >> >> > > > > > > > > > > >   status: Failed
> >> >> >> > > > > > > > > > > >   reason: "less avaliable slots to upgrade,
> >> >> please checkout
> >> >> >> > > > > > your
> >> >> >> > > > > > > > > > alivalbe
> >> >> >> > > > > > > > > > > > slots"
> >> >> >> > > > > > > > > > > > }
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >    - *Pending*: this job now is join the
> >> upgrade
> >> >> queue,it
> >> >> >> > > > > will
> >> >> >> > > > > > be
> >> >> >> > > > > > > > > > update
> >> >> >> > > > > > > > > > > >    later
> >> >> >> > > > > > > > > > > >    - *Rescaling*: job now is rescaling,wait it
> >> >> finish
> >> >> >> > > > > > > > > > > >    - *Finished*: finish do it
> >> >> >> > > > > > > > > > > >    - *Failed* : something have wrong,so this
> >> job
> >> >> is not
> >> >> >> > > > > > alivable
> >> >> >> > > > > > > > > > upgrade
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > I want to supplement my above content in flip,
> >> >> what do you
> >> >> >> > > > > > think
> >> >> >> > > > > > > ?
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >    1.
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > >
> >> >>
> >> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五
> >> >> 16:42写道：
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > Hi everyone,
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > This FLIP [1] introduces a new REST API for
> >> >> declaring
> >> >> >> > > > > > resource
> >> >> >> > > > > > > > > > > > requirements
> >> >> >> > > > > > > > > > > > > for the Adaptive Scheduler. There seems to
> >> be a
> >> >> clear need
> >> >> >> > > > > > for
> >> >> >> > > > > > > > this
> >> >> >> > > > > > > > > > API
> >> >> >> > > > > > > > > > > > > based on the discussion on the "Reworking
> >> the
> >> >> Rescale API"
> >> >> >> > > > > > [2]
> >> >> >> > > > > > > > > > thread.
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > Before we get started, this work is heavily
> >> >> based on the
> >> >> >> > > > > > > > prototype
> >> >> >> > > > > > > > > > [3]
> >> >> >> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP is
> >> being
> >> >> published
> >> >> >> > > > > > with
> >> >> >> > > > > > > > his
> >> >> >> > > > > > > > > > > > consent.
> >> >> >> > > > > > > > > > > > > Big shoutout to him!
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > Last and not least, thanks to Chesnay and
> >> Roman
> >> >> for the
> >> >> >> > > > > > initial
> >> >> >> > > > > > > > > > reviews
> >> >> >> > > > > > > > > > > > and
> >> >> >> > > > > > > > > > > > > discussions.
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > The best start would be watching a short
> >> demo
> >> >> [4] that I've
> >> >> >> > > > > > > > > recorded,
> >> >> >> > > > > > > > > > > > which
> >> >> >> > > > > > > > > > > > > illustrates newly added capabilities
> >> (rescaling
> >> >> the running
> >> >> >> > > > > > > job,
> >> >> >> > > > > > > > > > > handing
> >> >> >> > > > > > > > > > > > > back resources to the RM, and session
> >> cluster
> >> >> support).
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > The intuition behind the FLIP is being able
> >> to
> >> >> define
> >> >> >> > > > > > resource
> >> >> >> > > > > > > > > > > > requirements
> >> >> >> > > > > > > > > > > > > ("resource boundaries") externally that the
> >> >> >> > > > > AdaptiveScheduler
> >> >> >> > > > > > > can
> >> >> >> > > > > > > > > > > > navigate
> >> >> >> > > > > > > > > > > > > within. This is a building block for
> >> >> higher-level efforts
> >> >> >> > > > > > such
> >> >> >> > > > > > > as
> >> >> >> > > > > > > > > an
> >> >> >> > > > > > > > > > > > > external Autoscaler. The natural extension
> >> of
> >> >> this work
> >> >> >> > > > > would
> >> >> >> > > > > > > be
> >> >> >> > > > > > > > to
> >> >> >> > > > > > > > > > > allow
> >> >> >> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > Looking forward to your thoughts; any
> >> feedback
> >> >> is
> >> >> >> > > > > > appreciated!
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > [1]
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > >
> >> >> >> > > > >
> >> >>
> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> >> >> >> > > > > > > > > > > > > [2]
> >> >> >> > > > > > > > >
> >> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> >> >> >> > > > > > > > > > > > > [3]
> >> >> https://github.com/tillrohrmann/flink/tree/autoscaling
> >> >> >> > > > > > > > > > > > > [4]
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > >
> >> >> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > > Best,
> >> >> >> > > > > > > > > > > > > D.
> >> >> >> > > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > --
> >> >> >> > > > > > > > > > > > Best
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > > > ConradJam
> >> >> >> > > > > > > > > > > >
> >> >> >> > > > > > > > > > >
> >> >> >> > > > > > > > > >
> >> >> >> > > > > > > > >
> >> >> >> > > > > > > >
> >> >> >> > > > > > >
> >> >> >> > > > > >
> >> >> >> > > > >
> >> >>
> >>
> >

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by David Morávek <dm...@apache.org>.

Hi Everyone,

We had some more talks about the pre-allocation of resources with @Max, and
here is the final state that we've converged to for now:

The vital thing to note about the new API is that it's declarative, meaning
we're declaring the desired state to which we want our job to converge; If,
after the requirements update job no longer holds the desired resources
(fewer resources than the lower bound), it will be canceled and transition
back into the waiting for resources state.

In some use cases, you might always want to rescale to the upper bound
(this goes along the lines of "preallocating resources" and minimizing the
number of rescales, which is especially useful with the large state). This
can be controlled by two knobs that already exist:

1) "jobmanager.adaptive-scheduler.min-parallelism-increase" - this affects
a minimal parallelism increase step of a running job; we'll slightly change
the semantics, and we'll trigger rescaling either once this condition is
met or when you hit the ceiling; setting this to the high number will
ensure that you always rescale to the upper bound

2) "jobmanager.adaptive-scheduler.resource-stabilization-timeout" - for new
and already restarting jobs, we'll always respect this timeout, which
allows you to wait for more resources even though you already have more
resources than defined in the lower bound; again, in the case we reach the
ceiling (the upper bound), we'll transition into the executing state.


We're still planning to dig deeper in this direction with other efforts,
but this is already good enough and should allow us to move the FLIP
forward.

WDYT? Unless there are any objectives against the above, I'd like to
proceed to a vote.

Best,
D.

On Thu, Feb 23, 2023 at 5:39 PM David Morávek <dm...@apache.org> wrote:

> Hi Everyone,
>
> @John
>
> This is a problem that we've spent some time trying to crack; in the end,
> we've decided to go against doing any upgrades to JobGraphStore from
> JobMaster to avoid having multiple writers that are guarded by different
> leader election lock (Dispatcher and JobMaster might live in a different
> process). The contract we've decided to choose instead is leveraging the
> idempotency of the endpoint and having the user of the API retry in case
> we're unable to persist new requirements in the JobGraphStore [1]. We
> eventually need to move JobGraphStore out of the dispatcher, but that's way
> out of the scope of this FLIP. The solution is a deliberate trade-off. The
> worst scenario is that the Dispatcher fails over in between retries, which
> would simply rescale the job to meet the previous resource requirements
> (more extended unavailability of underlying HA storage would have worse
> consequences than this). Does that answer your question?
>
> @Matthias
>
> Good catch! I'm fixing it now, thanks!
>
> [1]
> https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9#diff-a4b690fb2c4975d25b05eb4161617af0d704a85ff7b1cad19d3c817c12f1e29cR1151
>
> Best,
> D.
>
> On Tue, Feb 21, 2023 at 12:24 AM John Roesler <vv...@apache.org> wrote:
>
>> Thanks for the FLIP, David!
>>
>> I just had one small question. IIUC, the REST API PUT request will go
>> through the new DispatcherGateway method to be handled. Then, after
>> validation, the dispatcher would call the new JobMasterGateway method to
>> actually update the job.
>>
>> Which component will write the updated JobGraph? I just wanted to make
>> sure it’s the JobMaster because it it were the dispatcher, there could be a
>> race condition with the async JobMaster method.
>>
>> Thanks!
>> -John
>>
>> On Mon, Feb 20, 2023, at 07:34, Matthias Pohl wrote:
>> > Thanks for your clarifications, David. I don't have any additional major
>> > points to add. One thing about the FLIP: The RPC layer API for updating
>> the
>> > JRR returns a future with a JRR? I don't see value in returning a JRR
>> here
>> > since it's an idempotent operation? Wouldn't it be enough to return
>> > CompletableFuture<Void> here? Or am I missing something?
>> >
>> > Matthias
>> >
>> > On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels <mx...@apache.org>
>> wrote:
>> >
>> >> Thanks David! If we could get the pre-allocation working as part of
>> >> the FLIP, that would be great.
>> >>
>> >> Concerning the downscale case, I agree this is a special case for the
>> >> (single-job) application mode where we could re-allocate slots in a
>> >> way that could leave entire task managers unoccupied which we would
>> >> then be able to release. The goal essentially is to reduce slot
>> >> fragmentation on scale down by packing the slots efficiently. The
>> >> easiest way to add this optimization when running in application mode
>> >> would be to drop as many task managers during the restart such that
>> >> NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look into
>> >> this independently of the FLIP.
>> >>
>> >> Feel free to start the vote.
>> >>
>> >> -Max
>> >>
>> >> On Mon, Feb 20, 2023 at 9:10 AM David Morávek <dm...@apache.org> wrote:
>> >> >
>> >> > Hi everyone,
>> >> >
>> >> > Thanks for the feedback! I've updated the FLIP to use idempotent PUT
>> API
>> >> instead of PATCH and to properly handle lower bound settings, to
>> support
>> >> the "pre-allocation" of the resources.
>> >> >
>> >> > @Max
>> >> >
>> >> > > How hard would it be to address this issue in the FLIP?
>> >> >
>> >> > I've included this in the FLIP. It might not be too hard to implement
>> >> this in the end.
>> >> >
>> >> > > B) drop as many superfluous task managers as needed
>> >> >
>> >> > I've intentionally left this part out for now because this ultimately
>> >> needs to be the responsibility of the Resource Manager. After all, in
>> the
>> >> Session Cluster scenario, the Scheduler doesn't have the bigger
>> picture of
>> >> other tasks of other jobs running on those TMs. This will most likely
>> be a
>> >> topic for another FLIP.
>> >> >
>> >> > WDYT? If there are no other questions or concerns, I'd like to start
>> the
>> >> vote on Wednesday.
>> >> >
>> >> > Best,
>> >> > D.
>> >> >
>> >> > On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <mx...@apache.org>
>> >> wrote:
>> >> >>
>> >> >> I missed that the FLIP states:
>> >> >>
>> >> >> > Currently, even though we’d expose the lower bound for clarity and
>> >> API completeness, we won’t allow setting it to any other value than one
>> >> until we have full support throughout the stack.
>> >> >>
>> >> >> How hard would it be to address this issue in the FLIP?
>> >> >>
>> >> >> There is not much value to offer setting a lower bound which won't
>> be
>> >> >> respected / throw an error when it is set. If we had support for a
>> >> >> lower bound, we could enforce a resource contract externally via
>> >> >> setting lowerBound == upperBound. That ties back to the Rescale API
>> >> >> discussion we had. I want to better understand what the major
>> concerns
>> >> >> would be around allowing this.
>> >> >>
>> >> >> Just to outline how I imagine the logic to work:
>> >> >>
>> >> >> A) The resource constraints are already met => Nothing changes
>> >> >> B) More resources available than required => Cancel the job, drop as
>> >> >> many superfluous task managers as needed, restart the job
>> >> >> C) Less resources available than required => Acquire new task
>> >> >> managers, wait for them to register, cancel and restart the job
>> >> >>
>> >> >> I'm open to helping out with the implementation.
>> >> >>
>> >> >> -Max
>> >> >>
>> >> >> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <mx...@apache.org>
>> >> wrote:
>> >> >> >
>> >> >> > Based on further discussion I had with Chesnay on this PR [1], I
>> think
>> >> >> > jobs would currently go into a restarting state after the resource
>> >> >> > requirements have changed. This wouldn't achieve what we had in
>> mind,
>> >> >> > i.e. sticking to the old resource requirements until enough slots
>> are
>> >> >> > available to fulfil the new resource requirements. So this may
>> not be
>> >> >> > 100% what we need but it could be extended to do what we want.
>> >> >> >
>> >> >> > -Max
>> >> >> >
>> >> >> > [1]
>> https://github.com/apache/flink/pull/21908#discussion_r1104792362
>> >> >> >
>> >> >> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <
>> mxm@apache.org>
>> >> wrote:
>> >> >> > >
>> >> >> > > Hi David,
>> >> >> > >
>> >> >> > > This is awesome! Great writeup and demo. This is pretty much
>> what we
>> >> >> > > need for the autoscaler as part of the Flink Kubernetes operator
>> >> [1].
>> >> >> > > Scaling Flink jobs effectively is hard but fortunately we have
>> >> solved
>> >> >> > > the issue as part of the Flink Kubernetes operator. The only
>> >> critical
>> >> >> > > piece we are missing is a better way to execute scaling
>> decisions,
>> >> as
>> >> >> > > discussed in [2].
>> >> >> > >
>> >> >> > > Looking at your proposal, we would set lowerBound == upperBound
>> for
>> >> >> > > the parallelism because we want to fully determine the
>> parallelism
>> >> >> > > externally based on the scaling metrics. Does that sound right?
>> >> >> > >
>> >> >> > > What is the timeline for these changes? Is there a JIRA?
>> >> >> > >
>> >> >> > > Cheers,
>> >> >> > > Max
>> >> >> > >
>> >> >> > > [1]
>> >>
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
>> >> >> > > [2]
>> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
>> >> >> > >
>> >> >> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <
>> xiangyu0xf@gmail.com>
>> >> wrote:
>> >> >> > > >
>> >> >> > > > Hi David,
>> >> >> > > >
>> >> >> > > > Thanks for your reply.  I think your response totally make
>> >> sense.  This
>> >> >> > > > flip targets on declaring required resource to ResourceManager
>> >> instead of
>> >> >> > > > using  ResourceManager to add/remove TMs directly.
>> >> >> > > >
>> >> >> > > > Best,
>> >> >> > > > Xiangyu
>> >> >> > > >
>> >> >> > > >
>> >> >> > > >
>> >> >> > > > David Morávek <da...@gmail.com> 于2023年2月13日周一
>> 15:46写道：
>> >> >> > > >
>> >> >> > > > > Hi everyone,
>> >> >> > > > >
>> >> >> > > > > @Shammon
>> >> >> > > > >
>> >> >> > > > > I'm not entirely sure what "config file" you're referring
>> to.
>> >> You can, of
>> >> >> > > > > course, override the default parallelism in
>> "flink-conf.yaml",
>> >> but for
>> >> >> > > > > sinks and sources, the parallelism needs to be tweaked on
>> the
>> >> connector
>> >> >> > > > > level ("WITH" statement).
>> >> >> > > > >
>> >> >> > > > > This is something that should be achieved with tooling
>> around
>> >> Flink. We
>> >> >> > > > > want to provide an API on the lowest level that generalizes
>> >> well. Achieving
>> >> >> > > > > what you're describing should be straightforward with this
>> API.
>> >> >> > > > >
>> >> >> > > > > @Xiangyu
>> >> >> > > > >
>> >> >> > > > > Is it possible for this REST API to declare TM resources in
>> the
>> >> future?
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > > > Would you like to add/remove TMs if you use an active
>> Resource
>> >> Manager?
>> >> >> > > > > This would be out of the scope of this effort since it
>> targets
>> >> the
>> >> >> > > > > scheduler component only (we make no assumptions about the
>> used
>> >> Resource
>> >> >> > > > > Manager). Also, the AdaptiveScheduler is only intended to be
>> >> used for
>> >> >> > > > > Streaming.
>> >> >> > > > >
>> >> >> > > > >  And for streaming jobs, I'm wondering if there is any
>> >> situation we need to
>> >> >> > > > > > rescale the TM resources of a flink cluster at first and
>> then
>> >> the
>> >> >> > > > > adaptive
>> >> >> > > > > > scheduler will rescale the per-vertex ResourceProfiles
>> >> accordingly.
>> >> >> > > > > >
>> >> >> > > > >
>> >> >> > > > > We plan on adding support for the ResourceProfiles (dynamic
>> slot
>> >> >> > > > > allocation) as the next step. Again we won't make any
>> >> assumptions about the
>> >> >> > > > > used Resource Manager. In other words, this effort ends by
>> >> declaring
>> >> >> > > > > desired resources to the Resource Manager.
>> >> >> > > > >
>> >> >> > > > > Does that make sense?
>> >> >> > > > >
>> >> >> > > > > @Matthias
>> >> >> > > > >
>> >> >> > > > > We've done another pass on the proposed API and currently
>> lean
>> >> towards
>> >> >> > > > > having an idempotent PUT API.
>> >> >> > > > > - We don't care too much about multiple writers' scenarios
>> in
>> >> terms of who
>> >> >> > > > > can write an authoritative payload; this is up to the user
>> of
>> >> the API to
>> >> >> > > > > figure out
>> >> >> > > > > - It's indeed tricky to achieve atomicity with PATCH API;
>> >> switching to PUT
>> >> >> > > > > API seems to do the trick
>> >> >> > > > > - We won't allow partial "payloads" anymore, meaning you
>> need
>> >> to define
>> >> >> > > > > requirements for all vertices in the JobGraph; This is
>> >> completely fine for
>> >> >> > > > > the programmatic workflows. For DEBUG / DEMO purposes, you
>> can
>> >> use the GET
>> >> >> > > > > endpoint and tweak the response to avoid writing the whole
>> >> payload by hand.
>> >> >> > > > >
>> >> >> > > > > WDYT?
>> >> >> > > > >
>> >> >> > > > >
>> >> >> > > > > Best,
>> >> >> > > > > D.
>> >> >> > > > >
>> >> >> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <
>> >> xiangyu0xf@gmail.com>
>> >> >> > > > > wrote:
>> >> >> > > > >
>> >> >> > > > > > Hi David,
>> >> >> > > > > >
>> >> >> > > > > > Thanks for creating this flip. I think this work it is
>> very
>> >> useful,
>> >> >> > > > > > especially in autoscaling scenario.  I would like to share
>> >> some questions
>> >> >> > > > > > from my view.
>> >> >> > > > > >
>> >> >> > > > > > 1, Is it possible for this REST API to declare TM
>> resources
>> >> in the
>> >> >> > > > > future?
>> >> >> > > > > > I'm asking because we are building the autoscaling feature
>> >> for Flink OLAP
>> >> >> > > > > > Session Cluster in ByteDance. We need to rescale the
>> >> cluster's resource
>> >> >> > > > > on
>> >> >> > > > > > TM level instead of Job level. It would be very helpful
>> if we
>> >> have a REST
>> >> >> > > > > > API for out external Autoscaling service to use.
>> >> >> > > > > >
>> >> >> > > > > > 2, And for streaming jobs, I'm wondering if there is any
>> >> situation we
>> >> >> > > > > need
>> >> >> > > > > > to rescale the TM resources of a flink cluster at first
>> and
>> >> then the
>> >> >> > > > > > adaptive scheduler will rescale the per-vertex
>> >> ResourceProfiles
>> >> >> > > > > > accordingly.
>> >> >> > > > > >
>> >> >> > > > > > best.
>> >> >> > > > > > Xiangyu
>> >> >> > > > > >
>> >> >> > > > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
>> >> >> > > > > >
>> >> >> > > > > > > Hi David
>> >> >> > > > > > >
>> >> >> > > > > > > Thanks for your answer.
>> >> >> > > > > > >
>> >> >> > > > > > > > Can you elaborate more about how you'd intend to use
>> the
>> >> endpoint? I
>> >> >> > > > > > > think we can ultimately introduce a way of re-declaring
>> >> "per-vertex
>> >> >> > > > > > > defaults," but I'd like to understand the use case bit
>> more
>> >> first.
>> >> >> > > > > > >
>> >> >> > > > > > > For this issue, I mainly consider the consistency of
>> user
>> >> configuration
>> >> >> > > > > > and
>> >> >> > > > > > > job runtime. For sql jobs, users usually set specific
>> >> parallelism for
>> >> >> > > > > > > source and sink, and set a global parallelism for other
>> >> operators.
>> >> >> > > > > These
>> >> >> > > > > > > config items are stored in a config file. For some
>> >> high-priority jobs,
>> >> >> > > > > > > users may want to manage them manually.
>> >> >> > > > > > > 1. When users need to scale the parallelism, they should
>> >> update the
>> >> >> > > > > > config
>> >> >> > > > > > > file and restart flink job, which may take a long time.
>> >> >> > > > > > > 2. After providing the REST API, users can just send a
>> >> request to the
>> >> >> > > > > job
>> >> >> > > > > > > via REST API quickly after updating the config file.
>> >> >> > > > > > > The configuration in the running job and config file
>> should
>> >> be the
>> >> >> > > > > same.
>> >> >> > > > > > > What do you think of this?
>> >> >> > > > > > >
>> >> >> > > > > > > best.
>> >> >> > > > > > > Shammon
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <
>> >> david.moravek@gmail.com>
>> >> >> > > > > > > wrote:
>> >> >> > > > > > >
>> >> >> > > > > > > > Hi everyone,
>> >> >> > > > > > > >
>> >> >> > > > > > > > Let's try to answer the questions one by one.
>> >> >> > > > > > > >
>> >> >> > > > > > > > *@ConradJam*
>> >> >> > > > > > > >
>> >> >> > > > > > > > when the number of "slots" is insufficient, can we can
>> >> stop users
>> >> >> > > > > > > rescaling
>> >> >> > > > > > > > > or throw something to tell user "less avaliable
>> slots
>> >> to upgrade,
>> >> >> > > > > > > please
>> >> >> > > > > > > > > checkout your alivalbe slots" ?
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > The main property of AdaptiveScheduler is that it can
>> >> adapt to
>> >> >> > > > > > "available
>> >> >> > > > > > > > resources," which means you're still able to make
>> >> progress even
>> >> >> > > > > though
>> >> >> > > > > > > you
>> >> >> > > > > > > > didn't get all the slots you've asked for. Let's break
>> >> down the pros
>> >> >> > > > > > and
>> >> >> > > > > > > > cons of this property.
>> >> >> > > > > > > >
>> >> >> > > > > > > > - (plus) If you lose a TM for some reason, you can
>> still
>> >> recover even
>> >> >> > > > > > if
>> >> >> > > > > > > it
>> >> >> > > > > > > > doesn't come back. We still need to give it some time
>> to
>> >> eliminate
>> >> >> > > > > > > > unnecessary rescaling, which can be controlled by
>> setting
>> >> >> > > > > > > > "resource-stabilization-timeout."
>> >> >> > > > > > > > - (plus) The resources can arrive with a significant
>> >> delay. For
>> >> >> > > > > > example,
>> >> >> > > > > > > > you're unable to spawn enough TMs on time because
>> you've
>> >> run out of
>> >> >> > > > > > > > resources in your k8s cluster, and you need to wait
>> for
>> >> the cluster
>> >> >> > > > > > auto
>> >> >> > > > > > > > scaler to kick in and add new nodes to the cluster. In
>> >> this scenario,
>> >> >> > > > > > > > you'll be able to start making progress faster, at the
>> >> cost of
>> >> >> > > > > multiple
>> >> >> > > > > > > > rescalings (once the remaining resources arrive).
>> >> >> > > > > > > > - (plus) This plays well with the declarative manner
>> of
>> >> today's
>> >> >> > > > > > > > infrastructure. For example, you tell k8s that you
>> need
>> >> 10 TMs, and
>> >> >> > > > > > > you'll
>> >> >> > > > > > > > eventually get them.
>> >> >> > > > > > > > - (minus) In the case of large state jobs, the cost of
>> >> multiple
>> >> >> > > > > > > rescalings
>> >> >> > > > > > > > might outweigh the above.
>> >> >> > > > > > > >
>> >> >> > > > > > > > We've already touched on the solution to this problem
>> on
>> >> the FLIP.
>> >> >> > > > > > Please
>> >> >> > > > > > > > notice the parallelism knob being a range with a lower
>> >> and upper
>> >> >> > > > > bound.
>> >> >> > > > > > > > Setting both the lower and upper bound to the same
>> value
>> >> could give
>> >> >> > > > > the
>> >> >> > > > > > > > behavior you're describing at the cost of giving up
>> some
>> >> properties
>> >> >> > > > > > that
>> >> >> > > > > > > AS
>> >> >> > > > > > > > gives you (you'd be falling back to the
>> >> DefaultScheduler's behavior).
>> >> >> > > > > > > >
>> >> >> > > > > > > > when user upgrade job-vertx-parallelism . I want to
>> have
>> >> an interface
>> >> >> > > > > > to
>> >> >> > > > > > > > > query the current update parallel execution status,
>> so
>> >> that the
>> >> >> > > > > user
>> >> >> > > > > > or
>> >> >> > > > > > > > > program can understand the current status
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > This is a misunderstanding. We're not introducing the
>> >> RESCALE
>> >> >> > > > > endpoint.
>> >> >> > > > > > > > This endpoint allows you to re-declare the resources
>> >> needed to run
>> >> >> > > > > the
>> >> >> > > > > > > job.
>> >> >> > > > > > > > Once you reach the desired resources (you get more
>> >> resources than the
>> >> >> > > > > > > lower
>> >> >> > > > > > > > bound defines), your job will run.
>> >> >> > > > > > > >
>> >> >> > > > > > > > We can expose a similar endpoint to "resource
>> >> requirements" to give
>> >> >> > > > > you
>> >> >> > > > > > > an
>> >> >> > > > > > > > overview of the resources the vertices already have.
>> You
>> >> can already
>> >> >> > > > > > get
>> >> >> > > > > > > > this from the REST API, so exposing this in yet
>> another
>> >> way should be
>> >> >> > > > > > > > considered carefully.
>> >> >> > > > > > > >
>> >> >> > > > > > > > *@Matthias*
>> >> >> > > > > > > >
>> >> >> > > > > > > > I'm wondering whether it makes sense to add some kind
>> of
>> >> resource ID
>> >> >> > > > > to
>> >> >> > > > > > > the
>> >> >> > > > > > > > > REST API.
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > That's a good question. I want to think about that and
>> >> get back to
>> >> >> > > > > the
>> >> >> > > > > > > > question later. My main struggle when thinking about
>> this
>> >> is, "if
>> >> >> > > > > this
>> >> >> > > > > > > > would be an idempotent POST endpoint," would it be any
>> >> different?
>> >> >> > > > > > > >
>> >> >> > > > > > > > How often do we allow resource requirements to be
>> changed?
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > There shall be no rate limiting on the FLINK side. If
>> >> this is
>> >> >> > > > > something
>> >> >> > > > > > > > your environment needs, you can achieve it on a
>> different
>> >> layer ("we
>> >> >> > > > > > > can't
>> >> >> > > > > > > > have FLINK to do everything").
>> >> >> > > > > > > >
>> >> >> > > > > > > > Versioning the JobGraph in the JobGraphStore rather
>> than
>> >> overwriting
>> >> >> > > > > it
>> >> >> > > > > > > > > might be an idea.
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > This sounds interesting since it would be closer to
>> the
>> >> JobGraph
>> >> >> > > > > being
>> >> >> > > > > > > > immutable. The main problem I see here is that this
>> would
>> >> introduce a
>> >> >> > > > > > > > BW-incompatible change so it might be a topic for
>> >> follow-up FLIP.
>> >> >> > > > > > > >
>> >> >> > > > > > > > I'm just wondering whether we bundle two things
>> together
>> >> that are
>> >> >> > > > > > > actually
>> >> >> > > > > > > > > separate
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > Yup, this is how we think about it as well. The main
>> >> question is,
>> >> >> > > > > "who
>> >> >> > > > > > > > should be responsible for bookkeeping 1) the JobGraph
>> and
>> >> 2) the
>> >> >> > > > > > > > JobResourceRequirements". The JobMaster would be the
>> >> right place for
>> >> >> > > > > > > both,
>> >> >> > > > > > > > but it's currently not the case, and we're tightly
>> >> coupling the
>> >> >> > > > > > > dispatcher
>> >> >> > > > > > > > with the JobMaster.
>> >> >> > > > > > > >
>> >> >> > > > > > > > Initially, we tried to introduce a separate HA
>> component
>> >> in JobMaster
>> >> >> > > > > > for
>> >> >> > > > > > > > bookkeeping the JobResourceRequirements, but that
>> proved
>> >> to be a more
>> >> >> > > > > > > > significant effort adding additional mess to the
>> already
>> >> messy HA
>> >> >> > > > > > > > ecosystem. Another approach we've discussed was
>> mutating
>> >> the JobGraph
>> >> >> > > > > > and
>> >> >> > > > > > > > setting JRR into the JobGraph structure itself.
>> >> >> > > > > > > >
>> >> >> > > > > > > > The middle ground for keeping this effort reasonably
>> >> sized and not
>> >> >> > > > > > > > violating "we want to keep JG immutable" too much is
>> >> keeping the
>> >> >> > > > > > > > JobResourceRequirements separate as an internal config
>> >> option in
>> >> >> > > > > > > JobGraph's
>> >> >> > > > > > > > configuration.
>> >> >> > > > > > > >
>> >> >> > > > > > > > We ultimately need to rethink the tight coupling of
>> >> Dispatcher and
>> >> >> > > > > > > > JobMaster, but it needs to be a separate effort.
>> >> >> > > > > > > >
>> >> >> > > > > > > > ...also considering the amount of data that can be
>> stored
>> >> in a
>> >> >> > > > > > > > > ConfigMap/ZooKeeper node if versioning the resource
>> >> requirement
>> >> >> > > > > > change
>> >> >> > > > > > > as
>> >> >> > > > > > > > > proposed in my previous item is an option for us.
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > AFAIK we're only storing pointers to the S3 objects
>> in HA
>> >> metadata,
>> >> >> > > > > so
>> >> >> > > > > > we
>> >> >> > > > > > > > should be okay with having larger structures for now.
>> >> >> > > > > > > >
>> >> >> > > > > > > > Updating the JobGraphStore means adding more requests
>> to
>> >> the HA
>> >> >> > > > > backend
>> >> >> > > > > > > > API.
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > It's fine unless you intend to override the resource
>> >> requirements a
>> >> >> > > > > few
>> >> >> > > > > > > > times per second.
>> >> >> > > > > > > >
>> >> >> > > > > > > > *@Shammon*
>> >> >> > > > > > > >
>> >> >> > > > > > > > How about adding some more information such as vertex
>> type
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > Since it was intended as a "debug" endpoint, it makes
>> >> complete sense!
>> >> >> > > > > > > >
>> >> >> > > > > > > >  For sql jobs, we always use a unified parallelism for
>> >> most vertices.
>> >> >> > > > > > Can
>> >> >> > > > > > > > > we provide them with a more convenient setting
>> method
>> >> instead of
>> >> >> > > > > each
>> >> >> > > > > > > > one?
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > I completely feel with this. The main thoughts when
>> >> designing the API
>> >> >> > > > > > > were:
>> >> >> > > > > > > > - We want to keep it clean and easy to understand.
>> >> >> > > > > > > > - Global parallelism can be modeled using per-vertex
>> >> parallelism but
>> >> >> > > > > > not
>> >> >> > > > > > > > the other way around.
>> >> >> > > > > > > > - The API will be used by external tooling (operator,
>> >> auto scaler).
>> >> >> > > > > > > >
>> >> >> > > > > > > > Can you elaborate more about how you'd intend to use
>> the
>> >> endpoint? I
>> >> >> > > > > > > think
>> >> >> > > > > > > > we can ultimately introduce a way of re-declaring
>> >> "per-vertex
>> >> >> > > > > > defaults,"
>> >> >> > > > > > > > but I'd like to understand the use case bit more
>> first.
>> >> >> > > > > > > >
>> >> >> > > > > > > > *@Weijie*
>> >> >> > > > > > > >
>> >> >> > > > > > > > What is the default value here (based on what
>> >> configuration), or just
>> >> >> > > > > > > > > infinite?
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > Currently, for the lower bound, it's always one, and
>> for
>> >> the upper
>> >> >> > > > > > bound,
>> >> >> > > > > > > > it's either parallelism (if defined) or the
>> >> maxParallelism of the
>> >> >> > > > > > vertex
>> >> >> > > > > > > in
>> >> >> > > > > > > > JobGraph. This question might be another signal for
>> >> making the
>> >> >> > > > > defaults
>> >> >> > > > > > > > explicit (see the answer to Shammon's question above).
>> >> >> > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > > > Thanks, everyone, for your initial thoughts!
>> >> >> > > > > > > >
>> >> >> > > > > > > > Best,
>> >> >> > > > > > > > D.
>> >> >> > > > > > > >
>> >> >> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <
>> >> guoweijiereswqa@gmail.com
>> >> >> > > > > >
>> >> >> > > > > > > > wrote:
>> >> >> > > > > > > >
>> >> >> > > > > > > > > Thanks David for driving this. This is a very
>> valuable
>> >> work,
>> >> >> > > > > > especially
>> >> >> > > > > > > > for
>> >> >> > > > > > > > > cloud native environment.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > >> How about adding some more information such as
>> >> vertex type
>> >> >> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get
>> jobs
>> >> >> > > > > > > > > resource-requirements`? For users, only vertex-id
>> may
>> >> be difficult
>> >> >> > > > > to
>> >> >> > > > > > > > > understand.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > +1 for this suggestion, including jobvertex's name
>> in
>> >> the response
>> >> >> > > > > > body
>> >> >> > > > > > > > is
>> >> >> > > > > > > > > more
>> >> >> > > > > > > > > user-friendly.
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > I saw this sentence in FLIP: "Setting the upper
>> bound
>> >> to -1 will
>> >> >> > > > > > reset
>> >> >> > > > > > > > the
>> >> >> > > > > > > > > value to the default setting."  What is the default
>> >> value here
>> >> >> > > > > (based
>> >> >> > > > > > > on
>> >> >> > > > > > > > > what configuration), or just infinite?
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Best regards,
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Weijie
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一
>> 18:06写道：
>> >> >> > > > > > > > >
>> >> >> > > > > > > > > > Hi David
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > Thanks for initiating this discussion. I think
>> >> declaring job
>> >> >> > > > > > resource
>> >> >> > > > > > > > > > requirements by REST API is very valuable. I just
>> >> left some
>> >> >> > > > > > comments
>> >> >> > > > > > > as
>> >> >> > > > > > > > > > followed
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > 1) How about adding some more information such as
>> >> vertex type
>> >> >> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get
>> >> jobs
>> >> >> > > > > > > > > > resource-requirements`? For users, only vertex-id
>> may
>> >> be
>> >> >> > > > > difficult
>> >> >> > > > > > to
>> >> >> > > > > > > > > > understand.
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > 2) For sql jobs, we always use a unified
>> parallelism
>> >> for most
>> >> >> > > > > > > vertices.
>> >> >> > > > > > > > > Can
>> >> >> > > > > > > > > > we provide them with a more convenient setting
>> method
>> >> instead of
>> >> >> > > > > > each
>> >> >> > > > > > > > > one?
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > Best,
>> >> >> > > > > > > > > > Shammon
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
>> >> >> > > > > > matthias.pohl@aiven.io
>> >> >> > > > > > > > > > .invalid>
>> >> >> > > > > > > > > > wrote:
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > > > > Thanks David for creating this FLIP. It sounds
>> >> promising and
>> >> >> > > > > > useful
>> >> >> > > > > > > > to
>> >> >> > > > > > > > > > > have. Here are some thoughts from my side (some
>> of
>> >> them might
>> >> >> > > > > be
>> >> >> > > > > > > > > rather a
>> >> >> > > > > > > > > > > follow-up and not necessarily part of this
>> FLIP):
>> >> >> > > > > > > > > > > - I'm wondering whether it makes sense to add
>> some
>> >> kind of
>> >> >> > > > > > resource
>> >> >> > > > > > > > ID
>> >> >> > > > > > > > > to
>> >> >> > > > > > > > > > > the REST API. This would give Flink a tool to
>> >> verify the PATCH
>> >> >> > > > > > > > request
>> >> >> > > > > > > > > of
>> >> >> > > > > > > > > > > the external system in a compare-and-set kind of
>> >> manner. AFAIU,
>> >> >> > > > > > the
>> >> >> > > > > > > > > > process
>> >> >> > > > > > > > > > > requires the external system to retrieve the
>> >> resource
>> >> >> > > > > > requirements
>> >> >> > > > > > > > > first
>> >> >> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID
>> <ABC>
>> >> would be sent
>> >> >> > > > > > > along
>> >> >> > > > > > > > > as
>> >> >> > > > > > > > > > a
>> >> >> > > > > > > > > > > unique identifier for the provided setup. It's
>> >> essentially the
>> >> >> > > > > > > > version
>> >> >> > > > > > > > > ID
>> >> >> > > > > > > > > > > of the currently deployed resource requirement
>> >> configuration.
>> >> >> > > > > > Flink
>> >> >> > > > > > > > > > doesn't
>> >> >> > > > > > > > > > > know whether the external system would use the
>> >> provided
>> >> >> > > > > > information
>> >> >> > > > > > > > in
>> >> >> > > > > > > > > > some
>> >> >> > > > > > > > > > > way to derive a new set of resource requirements
>> >> for this job.
>> >> >> > > > > > The
>> >> >> > > > > > > > > > > subsequent PATCH request with updated resource
>> >> requirements
>> >> >> > > > > would
>> >> >> > > > > > > > > include
>> >> >> > > > > > > > > > > the previously retrieved resource ID <ABC>. The
>> >> PATCH call
>> >> >> > > > > would
>> >> >> > > > > > > fail
>> >> >> > > > > > > > > if
>> >> >> > > > > > > > > > > there was a concurrent PATCH call in between
>> >> indicating to the
>> >> >> > > > > > > > external
>> >> >> > > > > > > > > > > system that the resource requirements were
>> >> concurrently
>> >> >> > > > > updated.
>> >> >> > > > > > > > > > > - How often do we allow resource requirements
>> to be
>> >> changed?
>> >> >> > > > > That
>> >> >> > > > > > > > > > question
>> >> >> > > > > > > > > > > might make my previous comment on the resource
>> ID
>> >> obsolete
>> >> >> > > > > > because
>> >> >> > > > > > > we
>> >> >> > > > > > > > > > could
>> >> >> > > > > > > > > > > just make any PATCH call fail if there was a
>> >> resource
>> >> >> > > > > requirement
>> >> >> > > > > > > > > update
>> >> >> > > > > > > > > > > within a certain time frame before the request.
>> But
>> >> such a time
>> >> >> > > > > > > > period
>> >> >> > > > > > > > > is
>> >> >> > > > > > > > > > > something we might want to make configurable
>> then,
>> >> I guess.
>> >> >> > > > > > > > > > > - Versioning the JobGraph in the JobGraphStore
>> >> rather than
>> >> >> > > > > > > > overwriting
>> >> >> > > > > > > > > it
>> >> >> > > > > > > > > > > might be an idea. This would enable us to
>> provide
>> >> resource
>> >> >> > > > > > > > requirement
>> >> >> > > > > > > > > > > changes in the UI or through the REST API. It is
>> >> related to a
>> >> >> > > > > > > problem
>> >> >> > > > > > > > > > > around keeping track of the exception history
>> >> within the
>> >> >> > > > > > > > > > AdaptiveScheduler
>> >> >> > > > > > > > > > > and also having to consider multiple versions
>> of a
>> >> JobGraph.
>> >> >> > > > > But
>> >> >> > > > > > > for
>> >> >> > > > > > > > > that
>> >> >> > > > > > > > > > > one, we use the ExecutionGraphInfoStore right
>> now.
>> >> >> > > > > > > > > > > - Updating the JobGraph in the JobGraphStore
>> makes
>> >> sense. I'm
>> >> >> > > > > > just
>> >> >> > > > > > > > > > > wondering whether we bundle two things together
>> >> that are
>> >> >> > > > > actually
>> >> >> > > > > > > > > > separate:
>> >> >> > > > > > > > > > > The business logic and the execution
>> configuration
>> >> (the
>> >> >> > > > > resource
>> >> >> > > > > > > > > > > requirements). I'm aware that this is not a
>> flaw of
>> >> the current
>> >> >> > > > > > > FLIP
>> >> >> > > > > > > > > but
>> >> >> > > > > > > > > > > rather something that was not necessary to
>> address
>> >> in the past
>> >> >> > > > > > > > because
>> >> >> > > > > > > > > > the
>> >> >> > > > > > > > > > > JobGraph was kind of static. I don't remember
>> >> whether that was
>> >> >> > > > > > > > already
>> >> >> > > > > > > > > > > discussed while working on the AdaptiveScheduler
>> >> for FLIP-160
>> >> >> > > > > > [1].
>> >> >> > > > > > > > > Maybe,
>> >> >> > > > > > > > > > > I'm missing some functionality here that
>> requires
>> >> us to have
>> >> >> > > > > > > > everything
>> >> >> > > > > > > > > > in
>> >> >> > > > > > > > > > > one place. But it feels like updating the entire
>> >> JobGraph which
>> >> >> > > > > > > could
>> >> >> > > > > > > > > be
>> >> >> > > > > > > > > > > actually a "config change" is not reasonable.
>> >> ...also
>> >> >> > > > > considering
>> >> >> > > > > > > the
>> >> >> > > > > > > > > > > amount of data that can be stored in a
>> >> ConfigMap/ZooKeeper node
>> >> >> > > > > > if
>> >> >> > > > > > > > > > > versioning the resource requirement change as
>> >> proposed in my
>> >> >> > > > > > > previous
>> >> >> > > > > > > > > > item
>> >> >> > > > > > > > > > > is an option for us.
>> >> >> > > > > > > > > > > - Updating the JobGraphStore means adding more
>> >> requests to the
>> >> >> > > > > HA
>> >> >> > > > > > > > > backend
>> >> >> > > > > > > > > > > API. There were some concerns shared in the
>> >> discussion thread
>> >> >> > > > > [2]
>> >> >> > > > > > > for
>> >> >> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API server in
>> >> the past with
>> >> >> > > > > > too
>> >> >> > > > > > > > many
>> >> >> > > > > > > > > > > calls. Eventhough, it's more likely to be
>> caused by
>> >> >> > > > > > checkpointing,
>> >> >> > > > > > > I
>> >> >> > > > > > > > > > still
>> >> >> > > > > > > > > > > wanted to bring it up. We're working on a
>> >> standardized
>> >> >> > > > > > performance
>> >> >> > > > > > > > test
>> >> >> > > > > > > > > > to
>> >> >> > > > > > > > > > > prepare going forward with FLIP-270 [3] right
>> now.
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > Best,
>> >> >> > > > > > > > > > > Matthias
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > [1]
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
>> >> >> > > > > > > > > > > [2]
>> >> >> > > > > > >
>> >> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
>> >> >> > > > > > > > > > > [3]
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <
>> >> jam.gzczy@gmail.com
>> >> >> > > > > >
>> >> >> > > > > > > > wrote:
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > > > > Hi David:
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > Thank you for drive this flip, which helps
>> less
>> >> flink
>> >> >> > > > > shutdown
>> >> >> > > > > > > time
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > for this flip, I would like to make a few
>> idea on
>> >> share
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >    - when the number of "slots" is
>> insufficient,
>> >> can we can
>> >> >> > > > > > stop
>> >> >> > > > > > > > > users
>> >> >> > > > > > > > > > > >    rescaling or throw something to tell user
>> >> "less avaliable
>> >> >> > > > > > > slots
>> >> >> > > > > > > > to
>> >> >> > > > > > > > > > > > upgrade,
>> >> >> > > > > > > > > > > >    please checkout your alivalbe slots" ? Or
>> we
>> >> could have a
>> >> >> > > > > > > > request
>> >> >> > > > > > > > > > > >    switch(true/false) to allow this behavior
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >    - when user upgrade job-vertx-parallelism
>> . I
>> >> want to have
>> >> >> > > > > > an
>> >> >> > > > > > > > > > > interface
>> >> >> > > > > > > > > > > >    to query the current update parallel
>> execution
>> >> status, so
>> >> >> > > > > > that
>> >> >> > > > > > > > the
>> >> >> > > > > > > > > > > user
>> >> >> > > > > > > > > > > > or
>> >> >> > > > > > > > > > > >    program can understand the current status
>> >> >> > > > > > > > > > > >    - I want to have an interface to query the
>> >> current update
>> >> >> > > > > > > > > > parallelism
>> >> >> > > > > > > > > > > >    execution status. This also helps similar
>> to
>> >> *[1] Flink
>> >> >> > > > > K8S
>> >> >> > > > > > > > > > Operator*
>> >> >> > > > > > > > > > > >    management
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > {
>> >> >> > > > > > > > > > > >   status: Failed
>> >> >> > > > > > > > > > > >   reason: "less avaliable slots to upgrade,
>> >> please checkout
>> >> >> > > > > > your
>> >> >> > > > > > > > > > alivalbe
>> >> >> > > > > > > > > > > > slots"
>> >> >> > > > > > > > > > > > }
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >    - *Pending*: this job now is join the
>> upgrade
>> >> queue,it
>> >> >> > > > > will
>> >> >> > > > > > be
>> >> >> > > > > > > > > > update
>> >> >> > > > > > > > > > > >    later
>> >> >> > > > > > > > > > > >    - *Rescaling*: job now is rescaling,wait it
>> >> finish
>> >> >> > > > > > > > > > > >    - *Finished*: finish do it
>> >> >> > > > > > > > > > > >    - *Failed* : something have wrong,so this
>> job
>> >> is not
>> >> >> > > > > > alivable
>> >> >> > > > > > > > > > upgrade
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > I want to supplement my above content in flip,
>> >> what do you
>> >> >> > > > > > think
>> >> >> > > > > > > ?
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >    1.
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > >
>> >>
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五
>> >> 16:42写道：
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Hi everyone,
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > This FLIP [1] introduces a new REST API for
>> >> declaring
>> >> >> > > > > > resource
>> >> >> > > > > > > > > > > > requirements
>> >> >> > > > > > > > > > > > > for the Adaptive Scheduler. There seems to
>> be a
>> >> clear need
>> >> >> > > > > > for
>> >> >> > > > > > > > this
>> >> >> > > > > > > > > > API
>> >> >> > > > > > > > > > > > > based on the discussion on the "Reworking
>> the
>> >> Rescale API"
>> >> >> > > > > > [2]
>> >> >> > > > > > > > > > thread.
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Before we get started, this work is heavily
>> >> based on the
>> >> >> > > > > > > > prototype
>> >> >> > > > > > > > > > [3]
>> >> >> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP is
>> being
>> >> published
>> >> >> > > > > > with
>> >> >> > > > > > > > his
>> >> >> > > > > > > > > > > > consent.
>> >> >> > > > > > > > > > > > > Big shoutout to him!
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Last and not least, thanks to Chesnay and
>> Roman
>> >> for the
>> >> >> > > > > > initial
>> >> >> > > > > > > > > > reviews
>> >> >> > > > > > > > > > > > and
>> >> >> > > > > > > > > > > > > discussions.
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > The best start would be watching a short
>> demo
>> >> [4] that I've
>> >> >> > > > > > > > > recorded,
>> >> >> > > > > > > > > > > > which
>> >> >> > > > > > > > > > > > > illustrates newly added capabilities
>> (rescaling
>> >> the running
>> >> >> > > > > > > job,
>> >> >> > > > > > > > > > > handing
>> >> >> > > > > > > > > > > > > back resources to the RM, and session
>> cluster
>> >> support).
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > The intuition behind the FLIP is being able
>> to
>> >> define
>> >> >> > > > > > resource
>> >> >> > > > > > > > > > > > requirements
>> >> >> > > > > > > > > > > > > ("resource boundaries") externally that the
>> >> >> > > > > AdaptiveScheduler
>> >> >> > > > > > > can
>> >> >> > > > > > > > > > > > navigate
>> >> >> > > > > > > > > > > > > within. This is a building block for
>> >> higher-level efforts
>> >> >> > > > > > such
>> >> >> > > > > > > as
>> >> >> > > > > > > > > an
>> >> >> > > > > > > > > > > > > external Autoscaler. The natural extension
>> of
>> >> this work
>> >> >> > > > > would
>> >> >> > > > > > > be
>> >> >> > > > > > > > to
>> >> >> > > > > > > > > > > allow
>> >> >> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Looking forward to your thoughts; any
>> feedback
>> >> is
>> >> >> > > > > > appreciated!
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > [1]
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
>> >> >> > > > > > > > > > > > > [2]
>> >> >> > > > > > > > >
>> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
>> >> >> > > > > > > > > > > > > [3]
>> >> https://github.com/tillrohrmann/flink/tree/autoscaling
>> >> >> > > > > > > > > > > > > [4]
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > >
>> >> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > > > Best,
>> >> >> > > > > > > > > > > > > D.
>> >> >> > > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > --
>> >> >> > > > > > > > > > > > Best
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > > > ConradJam
>> >> >> > > > > > > > > > > >
>> >> >> > > > > > > > > > >
>> >> >> > > > > > > > > >
>> >> >> > > > > > > > >
>> >> >> > > > > > > >
>> >> >> > > > > > >
>> >> >> > > > > >
>> >> >> > > > >
>> >>
>>
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by David Morávek <dm...@apache.org>.

Hi Everyone,

@John

This is a problem that we've spent some time trying to crack; in the end,
we've decided to go against doing any upgrades to JobGraphStore from
JobMaster to avoid having multiple writers that are guarded by different
leader election lock (Dispatcher and JobMaster might live in a different
process). The contract we've decided to choose instead is leveraging the
idempotency of the endpoint and having the user of the API retry in case
we're unable to persist new requirements in the JobGraphStore [1]. We
eventually need to move JobGraphStore out of the dispatcher, but that's way
out of the scope of this FLIP. The solution is a deliberate trade-off. The
worst scenario is that the Dispatcher fails over in between retries, which
would simply rescale the job to meet the previous resource requirements
(more extended unavailability of underlying HA storage would have worse
consequences than this). Does that answer your question?

@Matthias

Good catch! I'm fixing it now, thanks!

[1]
https://github.com/dmvk/flink/commit/5e7edcb77d8522c367bc6977f80173b14dc03ce9#diff-a4b690fb2c4975d25b05eb4161617af0d704a85ff7b1cad19d3c817c12f1e29cR1151

Best,
D.

On Tue, Feb 21, 2023 at 12:24 AM John Roesler <vv...@apache.org> wrote:

> Thanks for the FLIP, David!
>
> I just had one small question. IIUC, the REST API PUT request will go
> through the new DispatcherGateway method to be handled. Then, after
> validation, the dispatcher would call the new JobMasterGateway method to
> actually update the job.
>
> Which component will write the updated JobGraph? I just wanted to make
> sure it’s the JobMaster because it it were the dispatcher, there could be a
> race condition with the async JobMaster method.
>
> Thanks!
> -John
>
> On Mon, Feb 20, 2023, at 07:34, Matthias Pohl wrote:
> > Thanks for your clarifications, David. I don't have any additional major
> > points to add. One thing about the FLIP: The RPC layer API for updating
> the
> > JRR returns a future with a JRR? I don't see value in returning a JRR
> here
> > since it's an idempotent operation? Wouldn't it be enough to return
> > CompletableFuture<Void> here? Or am I missing something?
> >
> > Matthias
> >
> > On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels <mx...@apache.org>
> wrote:
> >
> >> Thanks David! If we could get the pre-allocation working as part of
> >> the FLIP, that would be great.
> >>
> >> Concerning the downscale case, I agree this is a special case for the
> >> (single-job) application mode where we could re-allocate slots in a
> >> way that could leave entire task managers unoccupied which we would
> >> then be able to release. The goal essentially is to reduce slot
> >> fragmentation on scale down by packing the slots efficiently. The
> >> easiest way to add this optimization when running in application mode
> >> would be to drop as many task managers during the restart such that
> >> NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look into
> >> this independently of the FLIP.
> >>
> >> Feel free to start the vote.
> >>
> >> -Max
> >>
> >> On Mon, Feb 20, 2023 at 9:10 AM David Morávek <dm...@apache.org> wrote:
> >> >
> >> > Hi everyone,
> >> >
> >> > Thanks for the feedback! I've updated the FLIP to use idempotent PUT
> API
> >> instead of PATCH and to properly handle lower bound settings, to support
> >> the "pre-allocation" of the resources.
> >> >
> >> > @Max
> >> >
> >> > > How hard would it be to address this issue in the FLIP?
> >> >
> >> > I've included this in the FLIP. It might not be too hard to implement
> >> this in the end.
> >> >
> >> > > B) drop as many superfluous task managers as needed
> >> >
> >> > I've intentionally left this part out for now because this ultimately
> >> needs to be the responsibility of the Resource Manager. After all, in
> the
> >> Session Cluster scenario, the Scheduler doesn't have the bigger picture
> of
> >> other tasks of other jobs running on those TMs. This will most likely
> be a
> >> topic for another FLIP.
> >> >
> >> > WDYT? If there are no other questions or concerns, I'd like to start
> the
> >> vote on Wednesday.
> >> >
> >> > Best,
> >> > D.
> >> >
> >> > On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <mx...@apache.org>
> >> wrote:
> >> >>
> >> >> I missed that the FLIP states:
> >> >>
> >> >> > Currently, even though we’d expose the lower bound for clarity and
> >> API completeness, we won’t allow setting it to any other value than one
> >> until we have full support throughout the stack.
> >> >>
> >> >> How hard would it be to address this issue in the FLIP?
> >> >>
> >> >> There is not much value to offer setting a lower bound which won't be
> >> >> respected / throw an error when it is set. If we had support for a
> >> >> lower bound, we could enforce a resource contract externally via
> >> >> setting lowerBound == upperBound. That ties back to the Rescale API
> >> >> discussion we had. I want to better understand what the major
> concerns
> >> >> would be around allowing this.
> >> >>
> >> >> Just to outline how I imagine the logic to work:
> >> >>
> >> >> A) The resource constraints are already met => Nothing changes
> >> >> B) More resources available than required => Cancel the job, drop as
> >> >> many superfluous task managers as needed, restart the job
> >> >> C) Less resources available than required => Acquire new task
> >> >> managers, wait for them to register, cancel and restart the job
> >> >>
> >> >> I'm open to helping out with the implementation.
> >> >>
> >> >> -Max
> >> >>
> >> >> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <mx...@apache.org>
> >> wrote:
> >> >> >
> >> >> > Based on further discussion I had with Chesnay on this PR [1], I
> think
> >> >> > jobs would currently go into a restarting state after the resource
> >> >> > requirements have changed. This wouldn't achieve what we had in
> mind,
> >> >> > i.e. sticking to the old resource requirements until enough slots
> are
> >> >> > available to fulfil the new resource requirements. So this may not
> be
> >> >> > 100% what we need but it could be extended to do what we want.
> >> >> >
> >> >> > -Max
> >> >> >
> >> >> > [1]
> https://github.com/apache/flink/pull/21908#discussion_r1104792362
> >> >> >
> >> >> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <mxm@apache.org
> >
> >> wrote:
> >> >> > >
> >> >> > > Hi David,
> >> >> > >
> >> >> > > This is awesome! Great writeup and demo. This is pretty much
> what we
> >> >> > > need for the autoscaler as part of the Flink Kubernetes operator
> >> [1].
> >> >> > > Scaling Flink jobs effectively is hard but fortunately we have
> >> solved
> >> >> > > the issue as part of the Flink Kubernetes operator. The only
> >> critical
> >> >> > > piece we are missing is a better way to execute scaling
> decisions,
> >> as
> >> >> > > discussed in [2].
> >> >> > >
> >> >> > > Looking at your proposal, we would set lowerBound == upperBound
> for
> >> >> > > the parallelism because we want to fully determine the
> parallelism
> >> >> > > externally based on the scaling metrics. Does that sound right?
> >> >> > >
> >> >> > > What is the timeline for these changes? Is there a JIRA?
> >> >> > >
> >> >> > > Cheers,
> >> >> > > Max
> >> >> > >
> >> >> > > [1]
> >>
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
> >> >> > > [2]
> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> >> >> > >
> >> >> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <
> xiangyu0xf@gmail.com>
> >> wrote:
> >> >> > > >
> >> >> > > > Hi David,
> >> >> > > >
> >> >> > > > Thanks for your reply.  I think your response totally make
> >> sense.  This
> >> >> > > > flip targets on declaring required resource to ResourceManager
> >> instead of
> >> >> > > > using  ResourceManager to add/remove TMs directly.
> >> >> > > >
> >> >> > > > Best,
> >> >> > > > Xiangyu
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > David Morávek <da...@gmail.com> 于2023年2月13日周一 15:46写道：
> >> >> > > >
> >> >> > > > > Hi everyone,
> >> >> > > > >
> >> >> > > > > @Shammon
> >> >> > > > >
> >> >> > > > > I'm not entirely sure what "config file" you're referring to.
> >> You can, of
> >> >> > > > > course, override the default parallelism in
> "flink-conf.yaml",
> >> but for
> >> >> > > > > sinks and sources, the parallelism needs to be tweaked on the
> >> connector
> >> >> > > > > level ("WITH" statement).
> >> >> > > > >
> >> >> > > > > This is something that should be achieved with tooling around
> >> Flink. We
> >> >> > > > > want to provide an API on the lowest level that generalizes
> >> well. Achieving
> >> >> > > > > what you're describing should be straightforward with this
> API.
> >> >> > > > >
> >> >> > > > > @Xiangyu
> >> >> > > > >
> >> >> > > > > Is it possible for this REST API to declare TM resources in
> the
> >> future?
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > Would you like to add/remove TMs if you use an active
> Resource
> >> Manager?
> >> >> > > > > This would be out of the scope of this effort since it
> targets
> >> the
> >> >> > > > > scheduler component only (we make no assumptions about the
> used
> >> Resource
> >> >> > > > > Manager). Also, the AdaptiveScheduler is only intended to be
> >> used for
> >> >> > > > > Streaming.
> >> >> > > > >
> >> >> > > > >  And for streaming jobs, I'm wondering if there is any
> >> situation we need to
> >> >> > > > > > rescale the TM resources of a flink cluster at first and
> then
> >> the
> >> >> > > > > adaptive
> >> >> > > > > > scheduler will rescale the per-vertex ResourceProfiles
> >> accordingly.
> >> >> > > > > >
> >> >> > > > >
> >> >> > > > > We plan on adding support for the ResourceProfiles (dynamic
> slot
> >> >> > > > > allocation) as the next step. Again we won't make any
> >> assumptions about the
> >> >> > > > > used Resource Manager. In other words, this effort ends by
> >> declaring
> >> >> > > > > desired resources to the Resource Manager.
> >> >> > > > >
> >> >> > > > > Does that make sense?
> >> >> > > > >
> >> >> > > > > @Matthias
> >> >> > > > >
> >> >> > > > > We've done another pass on the proposed API and currently
> lean
> >> towards
> >> >> > > > > having an idempotent PUT API.
> >> >> > > > > - We don't care too much about multiple writers' scenarios in
> >> terms of who
> >> >> > > > > can write an authoritative payload; this is up to the user of
> >> the API to
> >> >> > > > > figure out
> >> >> > > > > - It's indeed tricky to achieve atomicity with PATCH API;
> >> switching to PUT
> >> >> > > > > API seems to do the trick
> >> >> > > > > - We won't allow partial "payloads" anymore, meaning you need
> >> to define
> >> >> > > > > requirements for all vertices in the JobGraph; This is
> >> completely fine for
> >> >> > > > > the programmatic workflows. For DEBUG / DEMO purposes, you
> can
> >> use the GET
> >> >> > > > > endpoint and tweak the response to avoid writing the whole
> >> payload by hand.
> >> >> > > > >
> >> >> > > > > WDYT?
> >> >> > > > >
> >> >> > > > >
> >> >> > > > > Best,
> >> >> > > > > D.
> >> >> > > > >
> >> >> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <
> >> xiangyu0xf@gmail.com>
> >> >> > > > > wrote:
> >> >> > > > >
> >> >> > > > > > Hi David,
> >> >> > > > > >
> >> >> > > > > > Thanks for creating this flip. I think this work it is very
> >> useful,
> >> >> > > > > > especially in autoscaling scenario.  I would like to share
> >> some questions
> >> >> > > > > > from my view.
> >> >> > > > > >
> >> >> > > > > > 1, Is it possible for this REST API to declare TM resources
> >> in the
> >> >> > > > > future?
> >> >> > > > > > I'm asking because we are building the autoscaling feature
> >> for Flink OLAP
> >> >> > > > > > Session Cluster in ByteDance. We need to rescale the
> >> cluster's resource
> >> >> > > > > on
> >> >> > > > > > TM level instead of Job level. It would be very helpful if
> we
> >> have a REST
> >> >> > > > > > API for out external Autoscaling service to use.
> >> >> > > > > >
> >> >> > > > > > 2, And for streaming jobs, I'm wondering if there is any
> >> situation we
> >> >> > > > > need
> >> >> > > > > > to rescale the TM resources of a flink cluster at first and
> >> then the
> >> >> > > > > > adaptive scheduler will rescale the per-vertex
> >> ResourceProfiles
> >> >> > > > > > accordingly.
> >> >> > > > > >
> >> >> > > > > > best.
> >> >> > > > > > Xiangyu
> >> >> > > > > >
> >> >> > > > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
> >> >> > > > > >
> >> >> > > > > > > Hi David
> >> >> > > > > > >
> >> >> > > > > > > Thanks for your answer.
> >> >> > > > > > >
> >> >> > > > > > > > Can you elaborate more about how you'd intend to use
> the
> >> endpoint? I
> >> >> > > > > > > think we can ultimately introduce a way of re-declaring
> >> "per-vertex
> >> >> > > > > > > defaults," but I'd like to understand the use case bit
> more
> >> first.
> >> >> > > > > > >
> >> >> > > > > > > For this issue, I mainly consider the consistency of user
> >> configuration
> >> >> > > > > > and
> >> >> > > > > > > job runtime. For sql jobs, users usually set specific
> >> parallelism for
> >> >> > > > > > > source and sink, and set a global parallelism for other
> >> operators.
> >> >> > > > > These
> >> >> > > > > > > config items are stored in a config file. For some
> >> high-priority jobs,
> >> >> > > > > > > users may want to manage them manually.
> >> >> > > > > > > 1. When users need to scale the parallelism, they should
> >> update the
> >> >> > > > > > config
> >> >> > > > > > > file and restart flink job, which may take a long time.
> >> >> > > > > > > 2. After providing the REST API, users can just send a
> >> request to the
> >> >> > > > > job
> >> >> > > > > > > via REST API quickly after updating the config file.
> >> >> > > > > > > The configuration in the running job and config file
> should
> >> be the
> >> >> > > > > same.
> >> >> > > > > > > What do you think of this?
> >> >> > > > > > >
> >> >> > > > > > > best.
> >> >> > > > > > > Shammon
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <
> >> david.moravek@gmail.com>
> >> >> > > > > > > wrote:
> >> >> > > > > > >
> >> >> > > > > > > > Hi everyone,
> >> >> > > > > > > >
> >> >> > > > > > > > Let's try to answer the questions one by one.
> >> >> > > > > > > >
> >> >> > > > > > > > *@ConradJam*
> >> >> > > > > > > >
> >> >> > > > > > > > when the number of "slots" is insufficient, can we can
> >> stop users
> >> >> > > > > > > rescaling
> >> >> > > > > > > > > or throw something to tell user "less avaliable slots
> >> to upgrade,
> >> >> > > > > > > please
> >> >> > > > > > > > > checkout your alivalbe slots" ?
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > The main property of AdaptiveScheduler is that it can
> >> adapt to
> >> >> > > > > > "available
> >> >> > > > > > > > resources," which means you're still able to make
> >> progress even
> >> >> > > > > though
> >> >> > > > > > > you
> >> >> > > > > > > > didn't get all the slots you've asked for. Let's break
> >> down the pros
> >> >> > > > > > and
> >> >> > > > > > > > cons of this property.
> >> >> > > > > > > >
> >> >> > > > > > > > - (plus) If you lose a TM for some reason, you can
> still
> >> recover even
> >> >> > > > > > if
> >> >> > > > > > > it
> >> >> > > > > > > > doesn't come back. We still need to give it some time
> to
> >> eliminate
> >> >> > > > > > > > unnecessary rescaling, which can be controlled by
> setting
> >> >> > > > > > > > "resource-stabilization-timeout."
> >> >> > > > > > > > - (plus) The resources can arrive with a significant
> >> delay. For
> >> >> > > > > > example,
> >> >> > > > > > > > you're unable to spawn enough TMs on time because
> you've
> >> run out of
> >> >> > > > > > > > resources in your k8s cluster, and you need to wait for
> >> the cluster
> >> >> > > > > > auto
> >> >> > > > > > > > scaler to kick in and add new nodes to the cluster. In
> >> this scenario,
> >> >> > > > > > > > you'll be able to start making progress faster, at the
> >> cost of
> >> >> > > > > multiple
> >> >> > > > > > > > rescalings (once the remaining resources arrive).
> >> >> > > > > > > > - (plus) This plays well with the declarative manner of
> >> today's
> >> >> > > > > > > > infrastructure. For example, you tell k8s that you need
> >> 10 TMs, and
> >> >> > > > > > > you'll
> >> >> > > > > > > > eventually get them.
> >> >> > > > > > > > - (minus) In the case of large state jobs, the cost of
> >> multiple
> >> >> > > > > > > rescalings
> >> >> > > > > > > > might outweigh the above.
> >> >> > > > > > > >
> >> >> > > > > > > > We've already touched on the solution to this problem
> on
> >> the FLIP.
> >> >> > > > > > Please
> >> >> > > > > > > > notice the parallelism knob being a range with a lower
> >> and upper
> >> >> > > > > bound.
> >> >> > > > > > > > Setting both the lower and upper bound to the same
> value
> >> could give
> >> >> > > > > the
> >> >> > > > > > > > behavior you're describing at the cost of giving up
> some
> >> properties
> >> >> > > > > > that
> >> >> > > > > > > AS
> >> >> > > > > > > > gives you (you'd be falling back to the
> >> DefaultScheduler's behavior).
> >> >> > > > > > > >
> >> >> > > > > > > > when user upgrade job-vertx-parallelism . I want to
> have
> >> an interface
> >> >> > > > > > to
> >> >> > > > > > > > > query the current update parallel execution status,
> so
> >> that the
> >> >> > > > > user
> >> >> > > > > > or
> >> >> > > > > > > > > program can understand the current status
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > This is a misunderstanding. We're not introducing the
> >> RESCALE
> >> >> > > > > endpoint.
> >> >> > > > > > > > This endpoint allows you to re-declare the resources
> >> needed to run
> >> >> > > > > the
> >> >> > > > > > > job.
> >> >> > > > > > > > Once you reach the desired resources (you get more
> >> resources than the
> >> >> > > > > > > lower
> >> >> > > > > > > > bound defines), your job will run.
> >> >> > > > > > > >
> >> >> > > > > > > > We can expose a similar endpoint to "resource
> >> requirements" to give
> >> >> > > > > you
> >> >> > > > > > > an
> >> >> > > > > > > > overview of the resources the vertices already have.
> You
> >> can already
> >> >> > > > > > get
> >> >> > > > > > > > this from the REST API, so exposing this in yet another
> >> way should be
> >> >> > > > > > > > considered carefully.
> >> >> > > > > > > >
> >> >> > > > > > > > *@Matthias*
> >> >> > > > > > > >
> >> >> > > > > > > > I'm wondering whether it makes sense to add some kind
> of
> >> resource ID
> >> >> > > > > to
> >> >> > > > > > > the
> >> >> > > > > > > > > REST API.
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > That's a good question. I want to think about that and
> >> get back to
> >> >> > > > > the
> >> >> > > > > > > > question later. My main struggle when thinking about
> this
> >> is, "if
> >> >> > > > > this
> >> >> > > > > > > > would be an idempotent POST endpoint," would it be any
> >> different?
> >> >> > > > > > > >
> >> >> > > > > > > > How often do we allow resource requirements to be
> changed?
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > There shall be no rate limiting on the FLINK side. If
> >> this is
> >> >> > > > > something
> >> >> > > > > > > > your environment needs, you can achieve it on a
> different
> >> layer ("we
> >> >> > > > > > > can't
> >> >> > > > > > > > have FLINK to do everything").
> >> >> > > > > > > >
> >> >> > > > > > > > Versioning the JobGraph in the JobGraphStore rather
> than
> >> overwriting
> >> >> > > > > it
> >> >> > > > > > > > > might be an idea.
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > This sounds interesting since it would be closer to the
> >> JobGraph
> >> >> > > > > being
> >> >> > > > > > > > immutable. The main problem I see here is that this
> would
> >> introduce a
> >> >> > > > > > > > BW-incompatible change so it might be a topic for
> >> follow-up FLIP.
> >> >> > > > > > > >
> >> >> > > > > > > > I'm just wondering whether we bundle two things
> together
> >> that are
> >> >> > > > > > > actually
> >> >> > > > > > > > > separate
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > Yup, this is how we think about it as well. The main
> >> question is,
> >> >> > > > > "who
> >> >> > > > > > > > should be responsible for bookkeeping 1) the JobGraph
> and
> >> 2) the
> >> >> > > > > > > > JobResourceRequirements". The JobMaster would be the
> >> right place for
> >> >> > > > > > > both,
> >> >> > > > > > > > but it's currently not the case, and we're tightly
> >> coupling the
> >> >> > > > > > > dispatcher
> >> >> > > > > > > > with the JobMaster.
> >> >> > > > > > > >
> >> >> > > > > > > > Initially, we tried to introduce a separate HA
> component
> >> in JobMaster
> >> >> > > > > > for
> >> >> > > > > > > > bookkeeping the JobResourceRequirements, but that
> proved
> >> to be a more
> >> >> > > > > > > > significant effort adding additional mess to the
> already
> >> messy HA
> >> >> > > > > > > > ecosystem. Another approach we've discussed was
> mutating
> >> the JobGraph
> >> >> > > > > > and
> >> >> > > > > > > > setting JRR into the JobGraph structure itself.
> >> >> > > > > > > >
> >> >> > > > > > > > The middle ground for keeping this effort reasonably
> >> sized and not
> >> >> > > > > > > > violating "we want to keep JG immutable" too much is
> >> keeping the
> >> >> > > > > > > > JobResourceRequirements separate as an internal config
> >> option in
> >> >> > > > > > > JobGraph's
> >> >> > > > > > > > configuration.
> >> >> > > > > > > >
> >> >> > > > > > > > We ultimately need to rethink the tight coupling of
> >> Dispatcher and
> >> >> > > > > > > > JobMaster, but it needs to be a separate effort.
> >> >> > > > > > > >
> >> >> > > > > > > > ...also considering the amount of data that can be
> stored
> >> in a
> >> >> > > > > > > > > ConfigMap/ZooKeeper node if versioning the resource
> >> requirement
> >> >> > > > > > change
> >> >> > > > > > > as
> >> >> > > > > > > > > proposed in my previous item is an option for us.
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > AFAIK we're only storing pointers to the S3 objects in
> HA
> >> metadata,
> >> >> > > > > so
> >> >> > > > > > we
> >> >> > > > > > > > should be okay with having larger structures for now.
> >> >> > > > > > > >
> >> >> > > > > > > > Updating the JobGraphStore means adding more requests
> to
> >> the HA
> >> >> > > > > backend
> >> >> > > > > > > > API.
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > It's fine unless you intend to override the resource
> >> requirements a
> >> >> > > > > few
> >> >> > > > > > > > times per second.
> >> >> > > > > > > >
> >> >> > > > > > > > *@Shammon*
> >> >> > > > > > > >
> >> >> > > > > > > > How about adding some more information such as vertex
> type
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > Since it was intended as a "debug" endpoint, it makes
> >> complete sense!
> >> >> > > > > > > >
> >> >> > > > > > > >  For sql jobs, we always use a unified parallelism for
> >> most vertices.
> >> >> > > > > > Can
> >> >> > > > > > > > > we provide them with a more convenient setting method
> >> instead of
> >> >> > > > > each
> >> >> > > > > > > > one?
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > I completely feel with this. The main thoughts when
> >> designing the API
> >> >> > > > > > > were:
> >> >> > > > > > > > - We want to keep it clean and easy to understand.
> >> >> > > > > > > > - Global parallelism can be modeled using per-vertex
> >> parallelism but
> >> >> > > > > > not
> >> >> > > > > > > > the other way around.
> >> >> > > > > > > > - The API will be used by external tooling (operator,
> >> auto scaler).
> >> >> > > > > > > >
> >> >> > > > > > > > Can you elaborate more about how you'd intend to use
> the
> >> endpoint? I
> >> >> > > > > > > think
> >> >> > > > > > > > we can ultimately introduce a way of re-declaring
> >> "per-vertex
> >> >> > > > > > defaults,"
> >> >> > > > > > > > but I'd like to understand the use case bit more first.
> >> >> > > > > > > >
> >> >> > > > > > > > *@Weijie*
> >> >> > > > > > > >
> >> >> > > > > > > > What is the default value here (based on what
> >> configuration), or just
> >> >> > > > > > > > > infinite?
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > Currently, for the lower bound, it's always one, and
> for
> >> the upper
> >> >> > > > > > bound,
> >> >> > > > > > > > it's either parallelism (if defined) or the
> >> maxParallelism of the
> >> >> > > > > > vertex
> >> >> > > > > > > in
> >> >> > > > > > > > JobGraph. This question might be another signal for
> >> making the
> >> >> > > > > defaults
> >> >> > > > > > > > explicit (see the answer to Shammon's question above).
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > Thanks, everyone, for your initial thoughts!
> >> >> > > > > > > >
> >> >> > > > > > > > Best,
> >> >> > > > > > > > D.
> >> >> > > > > > > >
> >> >> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <
> >> guoweijiereswqa@gmail.com
> >> >> > > > > >
> >> >> > > > > > > > wrote:
> >> >> > > > > > > >
> >> >> > > > > > > > > Thanks David for driving this. This is a very
> valuable
> >> work,
> >> >> > > > > > especially
> >> >> > > > > > > > for
> >> >> > > > > > > > > cloud native environment.
> >> >> > > > > > > > >
> >> >> > > > > > > > > >> How about adding some more information such as
> >> vertex type
> >> >> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get
> jobs
> >> >> > > > > > > > > resource-requirements`? For users, only vertex-id may
> >> be difficult
> >> >> > > > > to
> >> >> > > > > > > > > understand.
> >> >> > > > > > > > >
> >> >> > > > > > > > > +1 for this suggestion, including jobvertex's name in
> >> the response
> >> >> > > > > > body
> >> >> > > > > > > > is
> >> >> > > > > > > > > more
> >> >> > > > > > > > > user-friendly.
> >> >> > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > > I saw this sentence in FLIP: "Setting the upper bound
> >> to -1 will
> >> >> > > > > > reset
> >> >> > > > > > > > the
> >> >> > > > > > > > > value to the default setting."  What is the default
> >> value here
> >> >> > > > > (based
> >> >> > > > > > > on
> >> >> > > > > > > > > what configuration), or just infinite?
> >> >> > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > > Best regards,
> >> >> > > > > > > > >
> >> >> > > > > > > > > Weijie
> >> >> > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
> >> >> > > > > > > > >
> >> >> > > > > > > > > > Hi David
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > Thanks for initiating this discussion. I think
> >> declaring job
> >> >> > > > > > resource
> >> >> > > > > > > > > > requirements by REST API is very valuable. I just
> >> left some
> >> >> > > > > > comments
> >> >> > > > > > > as
> >> >> > > > > > > > > > followed
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > 1) How about adding some more information such as
> >> vertex type
> >> >> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get
> >> jobs
> >> >> > > > > > > > > > resource-requirements`? For users, only vertex-id
> may
> >> be
> >> >> > > > > difficult
> >> >> > > > > > to
> >> >> > > > > > > > > > understand.
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > 2) For sql jobs, we always use a unified
> parallelism
> >> for most
> >> >> > > > > > > vertices.
> >> >> > > > > > > > > Can
> >> >> > > > > > > > > > we provide them with a more convenient setting
> method
> >> instead of
> >> >> > > > > > each
> >> >> > > > > > > > > one?
> >> >> > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > Best,
> >> >> > > > > > > > > > Shammon
> >> >> > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
> >> >> > > > > > matthias.pohl@aiven.io
> >> >> > > > > > > > > > .invalid>
> >> >> > > > > > > > > > wrote:
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > > Thanks David for creating this FLIP. It sounds
> >> promising and
> >> >> > > > > > useful
> >> >> > > > > > > > to
> >> >> > > > > > > > > > > have. Here are some thoughts from my side (some
> of
> >> them might
> >> >> > > > > be
> >> >> > > > > > > > > rather a
> >> >> > > > > > > > > > > follow-up and not necessarily part of this FLIP):
> >> >> > > > > > > > > > > - I'm wondering whether it makes sense to add
> some
> >> kind of
> >> >> > > > > > resource
> >> >> > > > > > > > ID
> >> >> > > > > > > > > to
> >> >> > > > > > > > > > > the REST API. This would give Flink a tool to
> >> verify the PATCH
> >> >> > > > > > > > request
> >> >> > > > > > > > > of
> >> >> > > > > > > > > > > the external system in a compare-and-set kind of
> >> manner. AFAIU,
> >> >> > > > > > the
> >> >> > > > > > > > > > process
> >> >> > > > > > > > > > > requires the external system to retrieve the
> >> resource
> >> >> > > > > > requirements
> >> >> > > > > > > > > first
> >> >> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID <ABC>
> >> would be sent
> >> >> > > > > > > along
> >> >> > > > > > > > > as
> >> >> > > > > > > > > > a
> >> >> > > > > > > > > > > unique identifier for the provided setup. It's
> >> essentially the
> >> >> > > > > > > > version
> >> >> > > > > > > > > ID
> >> >> > > > > > > > > > > of the currently deployed resource requirement
> >> configuration.
> >> >> > > > > > Flink
> >> >> > > > > > > > > > doesn't
> >> >> > > > > > > > > > > know whether the external system would use the
> >> provided
> >> >> > > > > > information
> >> >> > > > > > > > in
> >> >> > > > > > > > > > some
> >> >> > > > > > > > > > > way to derive a new set of resource requirements
> >> for this job.
> >> >> > > > > > The
> >> >> > > > > > > > > > > subsequent PATCH request with updated resource
> >> requirements
> >> >> > > > > would
> >> >> > > > > > > > > include
> >> >> > > > > > > > > > > the previously retrieved resource ID <ABC>. The
> >> PATCH call
> >> >> > > > > would
> >> >> > > > > > > fail
> >> >> > > > > > > > > if
> >> >> > > > > > > > > > > there was a concurrent PATCH call in between
> >> indicating to the
> >> >> > > > > > > > external
> >> >> > > > > > > > > > > system that the resource requirements were
> >> concurrently
> >> >> > > > > updated.
> >> >> > > > > > > > > > > - How often do we allow resource requirements to
> be
> >> changed?
> >> >> > > > > That
> >> >> > > > > > > > > > question
> >> >> > > > > > > > > > > might make my previous comment on the resource ID
> >> obsolete
> >> >> > > > > > because
> >> >> > > > > > > we
> >> >> > > > > > > > > > could
> >> >> > > > > > > > > > > just make any PATCH call fail if there was a
> >> resource
> >> >> > > > > requirement
> >> >> > > > > > > > > update
> >> >> > > > > > > > > > > within a certain time frame before the request.
> But
> >> such a time
> >> >> > > > > > > > period
> >> >> > > > > > > > > is
> >> >> > > > > > > > > > > something we might want to make configurable
> then,
> >> I guess.
> >> >> > > > > > > > > > > - Versioning the JobGraph in the JobGraphStore
> >> rather than
> >> >> > > > > > > > overwriting
> >> >> > > > > > > > > it
> >> >> > > > > > > > > > > might be an idea. This would enable us to provide
> >> resource
> >> >> > > > > > > > requirement
> >> >> > > > > > > > > > > changes in the UI or through the REST API. It is
> >> related to a
> >> >> > > > > > > problem
> >> >> > > > > > > > > > > around keeping track of the exception history
> >> within the
> >> >> > > > > > > > > > AdaptiveScheduler
> >> >> > > > > > > > > > > and also having to consider multiple versions of
> a
> >> JobGraph.
> >> >> > > > > But
> >> >> > > > > > > for
> >> >> > > > > > > > > that
> >> >> > > > > > > > > > > one, we use the ExecutionGraphInfoStore right
> now.
> >> >> > > > > > > > > > > - Updating the JobGraph in the JobGraphStore
> makes
> >> sense. I'm
> >> >> > > > > > just
> >> >> > > > > > > > > > > wondering whether we bundle two things together
> >> that are
> >> >> > > > > actually
> >> >> > > > > > > > > > separate:
> >> >> > > > > > > > > > > The business logic and the execution
> configuration
> >> (the
> >> >> > > > > resource
> >> >> > > > > > > > > > > requirements). I'm aware that this is not a flaw
> of
> >> the current
> >> >> > > > > > > FLIP
> >> >> > > > > > > > > but
> >> >> > > > > > > > > > > rather something that was not necessary to
> address
> >> in the past
> >> >> > > > > > > > because
> >> >> > > > > > > > > > the
> >> >> > > > > > > > > > > JobGraph was kind of static. I don't remember
> >> whether that was
> >> >> > > > > > > > already
> >> >> > > > > > > > > > > discussed while working on the AdaptiveScheduler
> >> for FLIP-160
> >> >> > > > > > [1].
> >> >> > > > > > > > > Maybe,
> >> >> > > > > > > > > > > I'm missing some functionality here that requires
> >> us to have
> >> >> > > > > > > > everything
> >> >> > > > > > > > > > in
> >> >> > > > > > > > > > > one place. But it feels like updating the entire
> >> JobGraph which
> >> >> > > > > > > could
> >> >> > > > > > > > > be
> >> >> > > > > > > > > > > actually a "config change" is not reasonable.
> >> ...also
> >> >> > > > > considering
> >> >> > > > > > > the
> >> >> > > > > > > > > > > amount of data that can be stored in a
> >> ConfigMap/ZooKeeper node
> >> >> > > > > > if
> >> >> > > > > > > > > > > versioning the resource requirement change as
> >> proposed in my
> >> >> > > > > > > previous
> >> >> > > > > > > > > > item
> >> >> > > > > > > > > > > is an option for us.
> >> >> > > > > > > > > > > - Updating the JobGraphStore means adding more
> >> requests to the
> >> >> > > > > HA
> >> >> > > > > > > > > backend
> >> >> > > > > > > > > > > API. There were some concerns shared in the
> >> discussion thread
> >> >> > > > > [2]
> >> >> > > > > > > for
> >> >> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API server in
> >> the past with
> >> >> > > > > > too
> >> >> > > > > > > > many
> >> >> > > > > > > > > > > calls. Eventhough, it's more likely to be caused
> by
> >> >> > > > > > checkpointing,
> >> >> > > > > > > I
> >> >> > > > > > > > > > still
> >> >> > > > > > > > > > > wanted to bring it up. We're working on a
> >> standardized
> >> >> > > > > > performance
> >> >> > > > > > > > test
> >> >> > > > > > > > > > to
> >> >> > > > > > > > > > > prepare going forward with FLIP-270 [3] right
> now.
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > Best,
> >> >> > > > > > > > > > > Matthias
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > [1]
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> >> >> > > > > > > > > > > [2]
> >> >> > > > > > >
> >> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> >> >> > > > > > > > > > > [3]
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <
> >> jam.gzczy@gmail.com
> >> >> > > > > >
> >> >> > > > > > > > wrote:
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > > Hi David:
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > Thank you for drive this flip, which helps less
> >> flink
> >> >> > > > > shutdown
> >> >> > > > > > > time
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > for this flip, I would like to make a few idea
> on
> >> share
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >    - when the number of "slots" is
> insufficient,
> >> can we can
> >> >> > > > > > stop
> >> >> > > > > > > > > users
> >> >> > > > > > > > > > > >    rescaling or throw something to tell user
> >> "less avaliable
> >> >> > > > > > > slots
> >> >> > > > > > > > to
> >> >> > > > > > > > > > > > upgrade,
> >> >> > > > > > > > > > > >    please checkout your alivalbe slots" ? Or we
> >> could have a
> >> >> > > > > > > > request
> >> >> > > > > > > > > > > >    switch(true/false) to allow this behavior
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >    - when user upgrade job-vertx-parallelism .
> I
> >> want to have
> >> >> > > > > > an
> >> >> > > > > > > > > > > interface
> >> >> > > > > > > > > > > >    to query the current update parallel
> execution
> >> status, so
> >> >> > > > > > that
> >> >> > > > > > > > the
> >> >> > > > > > > > > > > user
> >> >> > > > > > > > > > > > or
> >> >> > > > > > > > > > > >    program can understand the current status
> >> >> > > > > > > > > > > >    - I want to have an interface to query the
> >> current update
> >> >> > > > > > > > > > parallelism
> >> >> > > > > > > > > > > >    execution status. This also helps similar to
> >> *[1] Flink
> >> >> > > > > K8S
> >> >> > > > > > > > > > Operator*
> >> >> > > > > > > > > > > >    management
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > {
> >> >> > > > > > > > > > > >   status: Failed
> >> >> > > > > > > > > > > >   reason: "less avaliable slots to upgrade,
> >> please checkout
> >> >> > > > > > your
> >> >> > > > > > > > > > alivalbe
> >> >> > > > > > > > > > > > slots"
> >> >> > > > > > > > > > > > }
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >    - *Pending*: this job now is join the
> upgrade
> >> queue,it
> >> >> > > > > will
> >> >> > > > > > be
> >> >> > > > > > > > > > update
> >> >> > > > > > > > > > > >    later
> >> >> > > > > > > > > > > >    - *Rescaling*: job now is rescaling,wait it
> >> finish
> >> >> > > > > > > > > > > >    - *Finished*: finish do it
> >> >> > > > > > > > > > > >    - *Failed* : something have wrong,so this
> job
> >> is not
> >> >> > > > > > alivable
> >> >> > > > > > > > > > upgrade
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > I want to supplement my above content in flip,
> >> what do you
> >> >> > > > > > think
> >> >> > > > > > > ?
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >    1.
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > >
> >> >> > > > >
> >> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五
> >> 16:42写道：
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > > Hi everyone,
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > This FLIP [1] introduces a new REST API for
> >> declaring
> >> >> > > > > > resource
> >> >> > > > > > > > > > > > requirements
> >> >> > > > > > > > > > > > > for the Adaptive Scheduler. There seems to
> be a
> >> clear need
> >> >> > > > > > for
> >> >> > > > > > > > this
> >> >> > > > > > > > > > API
> >> >> > > > > > > > > > > > > based on the discussion on the "Reworking the
> >> Rescale API"
> >> >> > > > > > [2]
> >> >> > > > > > > > > > thread.
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > Before we get started, this work is heavily
> >> based on the
> >> >> > > > > > > > prototype
> >> >> > > > > > > > > > [3]
> >> >> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP is
> being
> >> published
> >> >> > > > > > with
> >> >> > > > > > > > his
> >> >> > > > > > > > > > > > consent.
> >> >> > > > > > > > > > > > > Big shoutout to him!
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > Last and not least, thanks to Chesnay and
> Roman
> >> for the
> >> >> > > > > > initial
> >> >> > > > > > > > > > reviews
> >> >> > > > > > > > > > > > and
> >> >> > > > > > > > > > > > > discussions.
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > The best start would be watching a short demo
> >> [4] that I've
> >> >> > > > > > > > > recorded,
> >> >> > > > > > > > > > > > which
> >> >> > > > > > > > > > > > > illustrates newly added capabilities
> (rescaling
> >> the running
> >> >> > > > > > > job,
> >> >> > > > > > > > > > > handing
> >> >> > > > > > > > > > > > > back resources to the RM, and session cluster
> >> support).
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > The intuition behind the FLIP is being able
> to
> >> define
> >> >> > > > > > resource
> >> >> > > > > > > > > > > > requirements
> >> >> > > > > > > > > > > > > ("resource boundaries") externally that the
> >> >> > > > > AdaptiveScheduler
> >> >> > > > > > > can
> >> >> > > > > > > > > > > > navigate
> >> >> > > > > > > > > > > > > within. This is a building block for
> >> higher-level efforts
> >> >> > > > > > such
> >> >> > > > > > > as
> >> >> > > > > > > > > an
> >> >> > > > > > > > > > > > > external Autoscaler. The natural extension of
> >> this work
> >> >> > > > > would
> >> >> > > > > > > be
> >> >> > > > > > > > to
> >> >> > > > > > > > > > > allow
> >> >> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > Looking forward to your thoughts; any
> feedback
> >> is
> >> >> > > > > > appreciated!
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > [1]
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> >> >> > > > > > > > > > > > > [2]
> >> >> > > > > > > > >
> >> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> >> >> > > > > > > > > > > > > [3]
> >> https://github.com/tillrohrmann/flink/tree/autoscaling
> >> >> > > > > > > > > > > > > [4]
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > >
> >> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > Best,
> >> >> > > > > > > > > > > > > D.
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > --
> >> >> > > > > > > > > > > > Best
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > ConradJam
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > >
> >> >> > > > >
> >>
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by John Roesler <vv...@apache.org>.

Thanks for the FLIP, David!

I just had one small question. IIUC, the REST API PUT request will go through the new DispatcherGateway method to be handled. Then, after validation, the dispatcher would call the new JobMasterGateway method to actually update the job. 

Which component will write the updated JobGraph? I just wanted to make sure it’s the JobMaster because it it were the dispatcher, there could be a race condition with the async JobMaster method. 

Thanks!
-John 

On Mon, Feb 20, 2023, at 07:34, Matthias Pohl wrote:
> Thanks for your clarifications, David. I don't have any additional major
> points to add. One thing about the FLIP: The RPC layer API for updating the
> JRR returns a future with a JRR? I don't see value in returning a JRR here
> since it's an idempotent operation? Wouldn't it be enough to return
> CompletableFuture<Void> here? Or am I missing something?
>
> Matthias
>
> On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels <mx...@apache.org> wrote:
>
>> Thanks David! If we could get the pre-allocation working as part of
>> the FLIP, that would be great.
>>
>> Concerning the downscale case, I agree this is a special case for the
>> (single-job) application mode where we could re-allocate slots in a
>> way that could leave entire task managers unoccupied which we would
>> then be able to release. The goal essentially is to reduce slot
>> fragmentation on scale down by packing the slots efficiently. The
>> easiest way to add this optimization when running in application mode
>> would be to drop as many task managers during the restart such that
>> NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look into
>> this independently of the FLIP.
>>
>> Feel free to start the vote.
>>
>> -Max
>>
>> On Mon, Feb 20, 2023 at 9:10 AM David Morávek <dm...@apache.org> wrote:
>> >
>> > Hi everyone,
>> >
>> > Thanks for the feedback! I've updated the FLIP to use idempotent PUT API
>> instead of PATCH and to properly handle lower bound settings, to support
>> the "pre-allocation" of the resources.
>> >
>> > @Max
>> >
>> > > How hard would it be to address this issue in the FLIP?
>> >
>> > I've included this in the FLIP. It might not be too hard to implement
>> this in the end.
>> >
>> > > B) drop as many superfluous task managers as needed
>> >
>> > I've intentionally left this part out for now because this ultimately
>> needs to be the responsibility of the Resource Manager. After all, in the
>> Session Cluster scenario, the Scheduler doesn't have the bigger picture of
>> other tasks of other jobs running on those TMs. This will most likely be a
>> topic for another FLIP.
>> >
>> > WDYT? If there are no other questions or concerns, I'd like to start the
>> vote on Wednesday.
>> >
>> > Best,
>> > D.
>> >
>> > On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <mx...@apache.org>
>> wrote:
>> >>
>> >> I missed that the FLIP states:
>> >>
>> >> > Currently, even though we’d expose the lower bound for clarity and
>> API completeness, we won’t allow setting it to any other value than one
>> until we have full support throughout the stack.
>> >>
>> >> How hard would it be to address this issue in the FLIP?
>> >>
>> >> There is not much value to offer setting a lower bound which won't be
>> >> respected / throw an error when it is set. If we had support for a
>> >> lower bound, we could enforce a resource contract externally via
>> >> setting lowerBound == upperBound. That ties back to the Rescale API
>> >> discussion we had. I want to better understand what the major concerns
>> >> would be around allowing this.
>> >>
>> >> Just to outline how I imagine the logic to work:
>> >>
>> >> A) The resource constraints are already met => Nothing changes
>> >> B) More resources available than required => Cancel the job, drop as
>> >> many superfluous task managers as needed, restart the job
>> >> C) Less resources available than required => Acquire new task
>> >> managers, wait for them to register, cancel and restart the job
>> >>
>> >> I'm open to helping out with the implementation.
>> >>
>> >> -Max
>> >>
>> >> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <mx...@apache.org>
>> wrote:
>> >> >
>> >> > Based on further discussion I had with Chesnay on this PR [1], I think
>> >> > jobs would currently go into a restarting state after the resource
>> >> > requirements have changed. This wouldn't achieve what we had in mind,
>> >> > i.e. sticking to the old resource requirements until enough slots are
>> >> > available to fulfil the new resource requirements. So this may not be
>> >> > 100% what we need but it could be extended to do what we want.
>> >> >
>> >> > -Max
>> >> >
>> >> > [1] https://github.com/apache/flink/pull/21908#discussion_r1104792362
>> >> >
>> >> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <mx...@apache.org>
>> wrote:
>> >> > >
>> >> > > Hi David,
>> >> > >
>> >> > > This is awesome! Great writeup and demo. This is pretty much what we
>> >> > > need for the autoscaler as part of the Flink Kubernetes operator
>> [1].
>> >> > > Scaling Flink jobs effectively is hard but fortunately we have
>> solved
>> >> > > the issue as part of the Flink Kubernetes operator. The only
>> critical
>> >> > > piece we are missing is a better way to execute scaling decisions,
>> as
>> >> > > discussed in [2].
>> >> > >
>> >> > > Looking at your proposal, we would set lowerBound == upperBound for
>> >> > > the parallelism because we want to fully determine the parallelism
>> >> > > externally based on the scaling metrics. Does that sound right?
>> >> > >
>> >> > > What is the timeline for these changes? Is there a JIRA?
>> >> > >
>> >> > > Cheers,
>> >> > > Max
>> >> > >
>> >> > > [1]
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
>> >> > > [2]
>> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
>> >> > >
>> >> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <xi...@gmail.com>
>> wrote:
>> >> > > >
>> >> > > > Hi David,
>> >> > > >
>> >> > > > Thanks for your reply.  I think your response totally make
>> sense.  This
>> >> > > > flip targets on declaring required resource to ResourceManager
>> instead of
>> >> > > > using  ResourceManager to add/remove TMs directly.
>> >> > > >
>> >> > > > Best,
>> >> > > > Xiangyu
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > David Morávek <da...@gmail.com> 于2023年2月13日周一 15:46写道：
>> >> > > >
>> >> > > > > Hi everyone,
>> >> > > > >
>> >> > > > > @Shammon
>> >> > > > >
>> >> > > > > I'm not entirely sure what "config file" you're referring to.
>> You can, of
>> >> > > > > course, override the default parallelism in "flink-conf.yaml",
>> but for
>> >> > > > > sinks and sources, the parallelism needs to be tweaked on the
>> connector
>> >> > > > > level ("WITH" statement).
>> >> > > > >
>> >> > > > > This is something that should be achieved with tooling around
>> Flink. We
>> >> > > > > want to provide an API on the lowest level that generalizes
>> well. Achieving
>> >> > > > > what you're describing should be straightforward with this API.
>> >> > > > >
>> >> > > > > @Xiangyu
>> >> > > > >
>> >> > > > > Is it possible for this REST API to declare TM resources in the
>> future?
>> >> > > > >
>> >> > > > >
>> >> > > > > Would you like to add/remove TMs if you use an active Resource
>> Manager?
>> >> > > > > This would be out of the scope of this effort since it targets
>> the
>> >> > > > > scheduler component only (we make no assumptions about the used
>> Resource
>> >> > > > > Manager). Also, the AdaptiveScheduler is only intended to be
>> used for
>> >> > > > > Streaming.
>> >> > > > >
>> >> > > > >  And for streaming jobs, I'm wondering if there is any
>> situation we need to
>> >> > > > > > rescale the TM resources of a flink cluster at first and then
>> the
>> >> > > > > adaptive
>> >> > > > > > scheduler will rescale the per-vertex ResourceProfiles
>> accordingly.
>> >> > > > > >
>> >> > > > >
>> >> > > > > We plan on adding support for the ResourceProfiles (dynamic slot
>> >> > > > > allocation) as the next step. Again we won't make any
>> assumptions about the
>> >> > > > > used Resource Manager. In other words, this effort ends by
>> declaring
>> >> > > > > desired resources to the Resource Manager.
>> >> > > > >
>> >> > > > > Does that make sense?
>> >> > > > >
>> >> > > > > @Matthias
>> >> > > > >
>> >> > > > > We've done another pass on the proposed API and currently lean
>> towards
>> >> > > > > having an idempotent PUT API.
>> >> > > > > - We don't care too much about multiple writers' scenarios in
>> terms of who
>> >> > > > > can write an authoritative payload; this is up to the user of
>> the API to
>> >> > > > > figure out
>> >> > > > > - It's indeed tricky to achieve atomicity with PATCH API;
>> switching to PUT
>> >> > > > > API seems to do the trick
>> >> > > > > - We won't allow partial "payloads" anymore, meaning you need
>> to define
>> >> > > > > requirements for all vertices in the JobGraph; This is
>> completely fine for
>> >> > > > > the programmatic workflows. For DEBUG / DEMO purposes, you can
>> use the GET
>> >> > > > > endpoint and tweak the response to avoid writing the whole
>> payload by hand.
>> >> > > > >
>> >> > > > > WDYT?
>> >> > > > >
>> >> > > > >
>> >> > > > > Best,
>> >> > > > > D.
>> >> > > > >
>> >> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <
>> xiangyu0xf@gmail.com>
>> >> > > > > wrote:
>> >> > > > >
>> >> > > > > > Hi David,
>> >> > > > > >
>> >> > > > > > Thanks for creating this flip. I think this work it is very
>> useful,
>> >> > > > > > especially in autoscaling scenario.  I would like to share
>> some questions
>> >> > > > > > from my view.
>> >> > > > > >
>> >> > > > > > 1, Is it possible for this REST API to declare TM resources
>> in the
>> >> > > > > future?
>> >> > > > > > I'm asking because we are building the autoscaling feature
>> for Flink OLAP
>> >> > > > > > Session Cluster in ByteDance. We need to rescale the
>> cluster's resource
>> >> > > > > on
>> >> > > > > > TM level instead of Job level. It would be very helpful if we
>> have a REST
>> >> > > > > > API for out external Autoscaling service to use.
>> >> > > > > >
>> >> > > > > > 2, And for streaming jobs, I'm wondering if there is any
>> situation we
>> >> > > > > need
>> >> > > > > > to rescale the TM resources of a flink cluster at first and
>> then the
>> >> > > > > > adaptive scheduler will rescale the per-vertex
>> ResourceProfiles
>> >> > > > > > accordingly.
>> >> > > > > >
>> >> > > > > > best.
>> >> > > > > > Xiangyu
>> >> > > > > >
>> >> > > > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
>> >> > > > > >
>> >> > > > > > > Hi David
>> >> > > > > > >
>> >> > > > > > > Thanks for your answer.
>> >> > > > > > >
>> >> > > > > > > > Can you elaborate more about how you'd intend to use the
>> endpoint? I
>> >> > > > > > > think we can ultimately introduce a way of re-declaring
>> "per-vertex
>> >> > > > > > > defaults," but I'd like to understand the use case bit more
>> first.
>> >> > > > > > >
>> >> > > > > > > For this issue, I mainly consider the consistency of user
>> configuration
>> >> > > > > > and
>> >> > > > > > > job runtime. For sql jobs, users usually set specific
>> parallelism for
>> >> > > > > > > source and sink, and set a global parallelism for other
>> operators.
>> >> > > > > These
>> >> > > > > > > config items are stored in a config file. For some
>> high-priority jobs,
>> >> > > > > > > users may want to manage them manually.
>> >> > > > > > > 1. When users need to scale the parallelism, they should
>> update the
>> >> > > > > > config
>> >> > > > > > > file and restart flink job, which may take a long time.
>> >> > > > > > > 2. After providing the REST API, users can just send a
>> request to the
>> >> > > > > job
>> >> > > > > > > via REST API quickly after updating the config file.
>> >> > > > > > > The configuration in the running job and config file should
>> be the
>> >> > > > > same.
>> >> > > > > > > What do you think of this?
>> >> > > > > > >
>> >> > > > > > > best.
>> >> > > > > > > Shammon
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <
>> david.moravek@gmail.com>
>> >> > > > > > > wrote:
>> >> > > > > > >
>> >> > > > > > > > Hi everyone,
>> >> > > > > > > >
>> >> > > > > > > > Let's try to answer the questions one by one.
>> >> > > > > > > >
>> >> > > > > > > > *@ConradJam*
>> >> > > > > > > >
>> >> > > > > > > > when the number of "slots" is insufficient, can we can
>> stop users
>> >> > > > > > > rescaling
>> >> > > > > > > > > or throw something to tell user "less avaliable slots
>> to upgrade,
>> >> > > > > > > please
>> >> > > > > > > > > checkout your alivalbe slots" ?
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > The main property of AdaptiveScheduler is that it can
>> adapt to
>> >> > > > > > "available
>> >> > > > > > > > resources," which means you're still able to make
>> progress even
>> >> > > > > though
>> >> > > > > > > you
>> >> > > > > > > > didn't get all the slots you've asked for. Let's break
>> down the pros
>> >> > > > > > and
>> >> > > > > > > > cons of this property.
>> >> > > > > > > >
>> >> > > > > > > > - (plus) If you lose a TM for some reason, you can still
>> recover even
>> >> > > > > > if
>> >> > > > > > > it
>> >> > > > > > > > doesn't come back. We still need to give it some time to
>> eliminate
>> >> > > > > > > > unnecessary rescaling, which can be controlled by setting
>> >> > > > > > > > "resource-stabilization-timeout."
>> >> > > > > > > > - (plus) The resources can arrive with a significant
>> delay. For
>> >> > > > > > example,
>> >> > > > > > > > you're unable to spawn enough TMs on time because you've
>> run out of
>> >> > > > > > > > resources in your k8s cluster, and you need to wait for
>> the cluster
>> >> > > > > > auto
>> >> > > > > > > > scaler to kick in and add new nodes to the cluster. In
>> this scenario,
>> >> > > > > > > > you'll be able to start making progress faster, at the
>> cost of
>> >> > > > > multiple
>> >> > > > > > > > rescalings (once the remaining resources arrive).
>> >> > > > > > > > - (plus) This plays well with the declarative manner of
>> today's
>> >> > > > > > > > infrastructure. For example, you tell k8s that you need
>> 10 TMs, and
>> >> > > > > > > you'll
>> >> > > > > > > > eventually get them.
>> >> > > > > > > > - (minus) In the case of large state jobs, the cost of
>> multiple
>> >> > > > > > > rescalings
>> >> > > > > > > > might outweigh the above.
>> >> > > > > > > >
>> >> > > > > > > > We've already touched on the solution to this problem on
>> the FLIP.
>> >> > > > > > Please
>> >> > > > > > > > notice the parallelism knob being a range with a lower
>> and upper
>> >> > > > > bound.
>> >> > > > > > > > Setting both the lower and upper bound to the same value
>> could give
>> >> > > > > the
>> >> > > > > > > > behavior you're describing at the cost of giving up some
>> properties
>> >> > > > > > that
>> >> > > > > > > AS
>> >> > > > > > > > gives you (you'd be falling back to the
>> DefaultScheduler's behavior).
>> >> > > > > > > >
>> >> > > > > > > > when user upgrade job-vertx-parallelism . I want to have
>> an interface
>> >> > > > > > to
>> >> > > > > > > > > query the current update parallel execution status, so
>> that the
>> >> > > > > user
>> >> > > > > > or
>> >> > > > > > > > > program can understand the current status
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > This is a misunderstanding. We're not introducing the
>> RESCALE
>> >> > > > > endpoint.
>> >> > > > > > > > This endpoint allows you to re-declare the resources
>> needed to run
>> >> > > > > the
>> >> > > > > > > job.
>> >> > > > > > > > Once you reach the desired resources (you get more
>> resources than the
>> >> > > > > > > lower
>> >> > > > > > > > bound defines), your job will run.
>> >> > > > > > > >
>> >> > > > > > > > We can expose a similar endpoint to "resource
>> requirements" to give
>> >> > > > > you
>> >> > > > > > > an
>> >> > > > > > > > overview of the resources the vertices already have. You
>> can already
>> >> > > > > > get
>> >> > > > > > > > this from the REST API, so exposing this in yet another
>> way should be
>> >> > > > > > > > considered carefully.
>> >> > > > > > > >
>> >> > > > > > > > *@Matthias*
>> >> > > > > > > >
>> >> > > > > > > > I'm wondering whether it makes sense to add some kind of
>> resource ID
>> >> > > > > to
>> >> > > > > > > the
>> >> > > > > > > > > REST API.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > That's a good question. I want to think about that and
>> get back to
>> >> > > > > the
>> >> > > > > > > > question later. My main struggle when thinking about this
>> is, "if
>> >> > > > > this
>> >> > > > > > > > would be an idempotent POST endpoint," would it be any
>> different?
>> >> > > > > > > >
>> >> > > > > > > > How often do we allow resource requirements to be changed?
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > There shall be no rate limiting on the FLINK side. If
>> this is
>> >> > > > > something
>> >> > > > > > > > your environment needs, you can achieve it on a different
>> layer ("we
>> >> > > > > > > can't
>> >> > > > > > > > have FLINK to do everything").
>> >> > > > > > > >
>> >> > > > > > > > Versioning the JobGraph in the JobGraphStore rather than
>> overwriting
>> >> > > > > it
>> >> > > > > > > > > might be an idea.
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > This sounds interesting since it would be closer to the
>> JobGraph
>> >> > > > > being
>> >> > > > > > > > immutable. The main problem I see here is that this would
>> introduce a
>> >> > > > > > > > BW-incompatible change so it might be a topic for
>> follow-up FLIP.
>> >> > > > > > > >
>> >> > > > > > > > I'm just wondering whether we bundle two things together
>> that are
>> >> > > > > > > actually
>> >> > > > > > > > > separate
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > Yup, this is how we think about it as well. The main
>> question is,
>> >> > > > > "who
>> >> > > > > > > > should be responsible for bookkeeping 1) the JobGraph and
>> 2) the
>> >> > > > > > > > JobResourceRequirements". The JobMaster would be the
>> right place for
>> >> > > > > > > both,
>> >> > > > > > > > but it's currently not the case, and we're tightly
>> coupling the
>> >> > > > > > > dispatcher
>> >> > > > > > > > with the JobMaster.
>> >> > > > > > > >
>> >> > > > > > > > Initially, we tried to introduce a separate HA component
>> in JobMaster
>> >> > > > > > for
>> >> > > > > > > > bookkeeping the JobResourceRequirements, but that proved
>> to be a more
>> >> > > > > > > > significant effort adding additional mess to the already
>> messy HA
>> >> > > > > > > > ecosystem. Another approach we've discussed was mutating
>> the JobGraph
>> >> > > > > > and
>> >> > > > > > > > setting JRR into the JobGraph structure itself.
>> >> > > > > > > >
>> >> > > > > > > > The middle ground for keeping this effort reasonably
>> sized and not
>> >> > > > > > > > violating "we want to keep JG immutable" too much is
>> keeping the
>> >> > > > > > > > JobResourceRequirements separate as an internal config
>> option in
>> >> > > > > > > JobGraph's
>> >> > > > > > > > configuration.
>> >> > > > > > > >
>> >> > > > > > > > We ultimately need to rethink the tight coupling of
>> Dispatcher and
>> >> > > > > > > > JobMaster, but it needs to be a separate effort.
>> >> > > > > > > >
>> >> > > > > > > > ...also considering the amount of data that can be stored
>> in a
>> >> > > > > > > > > ConfigMap/ZooKeeper node if versioning the resource
>> requirement
>> >> > > > > > change
>> >> > > > > > > as
>> >> > > > > > > > > proposed in my previous item is an option for us.
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > AFAIK we're only storing pointers to the S3 objects in HA
>> metadata,
>> >> > > > > so
>> >> > > > > > we
>> >> > > > > > > > should be okay with having larger structures for now.
>> >> > > > > > > >
>> >> > > > > > > > Updating the JobGraphStore means adding more requests to
>> the HA
>> >> > > > > backend
>> >> > > > > > > > API.
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > It's fine unless you intend to override the resource
>> requirements a
>> >> > > > > few
>> >> > > > > > > > times per second.
>> >> > > > > > > >
>> >> > > > > > > > *@Shammon*
>> >> > > > > > > >
>> >> > > > > > > > How about adding some more information such as vertex type
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > Since it was intended as a "debug" endpoint, it makes
>> complete sense!
>> >> > > > > > > >
>> >> > > > > > > >  For sql jobs, we always use a unified parallelism for
>> most vertices.
>> >> > > > > > Can
>> >> > > > > > > > > we provide them with a more convenient setting method
>> instead of
>> >> > > > > each
>> >> > > > > > > > one?
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > I completely feel with this. The main thoughts when
>> designing the API
>> >> > > > > > > were:
>> >> > > > > > > > - We want to keep it clean and easy to understand.
>> >> > > > > > > > - Global parallelism can be modeled using per-vertex
>> parallelism but
>> >> > > > > > not
>> >> > > > > > > > the other way around.
>> >> > > > > > > > - The API will be used by external tooling (operator,
>> auto scaler).
>> >> > > > > > > >
>> >> > > > > > > > Can you elaborate more about how you'd intend to use the
>> endpoint? I
>> >> > > > > > > think
>> >> > > > > > > > we can ultimately introduce a way of re-declaring
>> "per-vertex
>> >> > > > > > defaults,"
>> >> > > > > > > > but I'd like to understand the use case bit more first.
>> >> > > > > > > >
>> >> > > > > > > > *@Weijie*
>> >> > > > > > > >
>> >> > > > > > > > What is the default value here (based on what
>> configuration), or just
>> >> > > > > > > > > infinite?
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > Currently, for the lower bound, it's always one, and for
>> the upper
>> >> > > > > > bound,
>> >> > > > > > > > it's either parallelism (if defined) or the
>> maxParallelism of the
>> >> > > > > > vertex
>> >> > > > > > > in
>> >> > > > > > > > JobGraph. This question might be another signal for
>> making the
>> >> > > > > defaults
>> >> > > > > > > > explicit (see the answer to Shammon's question above).
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > Thanks, everyone, for your initial thoughts!
>> >> > > > > > > >
>> >> > > > > > > > Best,
>> >> > > > > > > > D.
>> >> > > > > > > >
>> >> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <
>> guoweijiereswqa@gmail.com
>> >> > > > > >
>> >> > > > > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > > Thanks David for driving this. This is a very valuable
>> work,
>> >> > > > > > especially
>> >> > > > > > > > for
>> >> > > > > > > > > cloud native environment.
>> >> > > > > > > > >
>> >> > > > > > > > > >> How about adding some more information such as
>> vertex type
>> >> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
>> >> > > > > > > > > resource-requirements`? For users, only vertex-id may
>> be difficult
>> >> > > > > to
>> >> > > > > > > > > understand.
>> >> > > > > > > > >
>> >> > > > > > > > > +1 for this suggestion, including jobvertex's name in
>> the response
>> >> > > > > > body
>> >> > > > > > > > is
>> >> > > > > > > > > more
>> >> > > > > > > > > user-friendly.
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > I saw this sentence in FLIP: "Setting the upper bound
>> to -1 will
>> >> > > > > > reset
>> >> > > > > > > > the
>> >> > > > > > > > > value to the default setting."  What is the default
>> value here
>> >> > > > > (based
>> >> > > > > > > on
>> >> > > > > > > > > what configuration), or just infinite?
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > Best regards,
>> >> > > > > > > > >
>> >> > > > > > > > > Weijie
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
>> >> > > > > > > > >
>> >> > > > > > > > > > Hi David
>> >> > > > > > > > > >
>> >> > > > > > > > > > Thanks for initiating this discussion. I think
>> declaring job
>> >> > > > > > resource
>> >> > > > > > > > > > requirements by REST API is very valuable. I just
>> left some
>> >> > > > > > comments
>> >> > > > > > > as
>> >> > > > > > > > > > followed
>> >> > > > > > > > > >
>> >> > > > > > > > > > 1) How about adding some more information such as
>> vertex type
>> >> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get
>> jobs
>> >> > > > > > > > > > resource-requirements`? For users, only vertex-id may
>> be
>> >> > > > > difficult
>> >> > > > > > to
>> >> > > > > > > > > > understand.
>> >> > > > > > > > > >
>> >> > > > > > > > > > 2) For sql jobs, we always use a unified parallelism
>> for most
>> >> > > > > > > vertices.
>> >> > > > > > > > > Can
>> >> > > > > > > > > > we provide them with a more convenient setting method
>> instead of
>> >> > > > > > each
>> >> > > > > > > > > one?
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > > Best,
>> >> > > > > > > > > > Shammon
>> >> > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
>> >> > > > > > matthias.pohl@aiven.io
>> >> > > > > > > > > > .invalid>
>> >> > > > > > > > > > wrote:
>> >> > > > > > > > > >
>> >> > > > > > > > > > > Thanks David for creating this FLIP. It sounds
>> promising and
>> >> > > > > > useful
>> >> > > > > > > > to
>> >> > > > > > > > > > > have. Here are some thoughts from my side (some of
>> them might
>> >> > > > > be
>> >> > > > > > > > > rather a
>> >> > > > > > > > > > > follow-up and not necessarily part of this FLIP):
>> >> > > > > > > > > > > - I'm wondering whether it makes sense to add some
>> kind of
>> >> > > > > > resource
>> >> > > > > > > > ID
>> >> > > > > > > > > to
>> >> > > > > > > > > > > the REST API. This would give Flink a tool to
>> verify the PATCH
>> >> > > > > > > > request
>> >> > > > > > > > > of
>> >> > > > > > > > > > > the external system in a compare-and-set kind of
>> manner. AFAIU,
>> >> > > > > > the
>> >> > > > > > > > > > process
>> >> > > > > > > > > > > requires the external system to retrieve the
>> resource
>> >> > > > > > requirements
>> >> > > > > > > > > first
>> >> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID <ABC>
>> would be sent
>> >> > > > > > > along
>> >> > > > > > > > > as
>> >> > > > > > > > > > a
>> >> > > > > > > > > > > unique identifier for the provided setup. It's
>> essentially the
>> >> > > > > > > > version
>> >> > > > > > > > > ID
>> >> > > > > > > > > > > of the currently deployed resource requirement
>> configuration.
>> >> > > > > > Flink
>> >> > > > > > > > > > doesn't
>> >> > > > > > > > > > > know whether the external system would use the
>> provided
>> >> > > > > > information
>> >> > > > > > > > in
>> >> > > > > > > > > > some
>> >> > > > > > > > > > > way to derive a new set of resource requirements
>> for this job.
>> >> > > > > > The
>> >> > > > > > > > > > > subsequent PATCH request with updated resource
>> requirements
>> >> > > > > would
>> >> > > > > > > > > include
>> >> > > > > > > > > > > the previously retrieved resource ID <ABC>. The
>> PATCH call
>> >> > > > > would
>> >> > > > > > > fail
>> >> > > > > > > > > if
>> >> > > > > > > > > > > there was a concurrent PATCH call in between
>> indicating to the
>> >> > > > > > > > external
>> >> > > > > > > > > > > system that the resource requirements were
>> concurrently
>> >> > > > > updated.
>> >> > > > > > > > > > > - How often do we allow resource requirements to be
>> changed?
>> >> > > > > That
>> >> > > > > > > > > > question
>> >> > > > > > > > > > > might make my previous comment on the resource ID
>> obsolete
>> >> > > > > > because
>> >> > > > > > > we
>> >> > > > > > > > > > could
>> >> > > > > > > > > > > just make any PATCH call fail if there was a
>> resource
>> >> > > > > requirement
>> >> > > > > > > > > update
>> >> > > > > > > > > > > within a certain time frame before the request. But
>> such a time
>> >> > > > > > > > period
>> >> > > > > > > > > is
>> >> > > > > > > > > > > something we might want to make configurable then,
>> I guess.
>> >> > > > > > > > > > > - Versioning the JobGraph in the JobGraphStore
>> rather than
>> >> > > > > > > > overwriting
>> >> > > > > > > > > it
>> >> > > > > > > > > > > might be an idea. This would enable us to provide
>> resource
>> >> > > > > > > > requirement
>> >> > > > > > > > > > > changes in the UI or through the REST API. It is
>> related to a
>> >> > > > > > > problem
>> >> > > > > > > > > > > around keeping track of the exception history
>> within the
>> >> > > > > > > > > > AdaptiveScheduler
>> >> > > > > > > > > > > and also having to consider multiple versions of a
>> JobGraph.
>> >> > > > > But
>> >> > > > > > > for
>> >> > > > > > > > > that
>> >> > > > > > > > > > > one, we use the ExecutionGraphInfoStore right now.
>> >> > > > > > > > > > > - Updating the JobGraph in the JobGraphStore makes
>> sense. I'm
>> >> > > > > > just
>> >> > > > > > > > > > > wondering whether we bundle two things together
>> that are
>> >> > > > > actually
>> >> > > > > > > > > > separate:
>> >> > > > > > > > > > > The business logic and the execution configuration
>> (the
>> >> > > > > resource
>> >> > > > > > > > > > > requirements). I'm aware that this is not a flaw of
>> the current
>> >> > > > > > > FLIP
>> >> > > > > > > > > but
>> >> > > > > > > > > > > rather something that was not necessary to address
>> in the past
>> >> > > > > > > > because
>> >> > > > > > > > > > the
>> >> > > > > > > > > > > JobGraph was kind of static. I don't remember
>> whether that was
>> >> > > > > > > > already
>> >> > > > > > > > > > > discussed while working on the AdaptiveScheduler
>> for FLIP-160
>> >> > > > > > [1].
>> >> > > > > > > > > Maybe,
>> >> > > > > > > > > > > I'm missing some functionality here that requires
>> us to have
>> >> > > > > > > > everything
>> >> > > > > > > > > > in
>> >> > > > > > > > > > > one place. But it feels like updating the entire
>> JobGraph which
>> >> > > > > > > could
>> >> > > > > > > > > be
>> >> > > > > > > > > > > actually a "config change" is not reasonable.
>> ...also
>> >> > > > > considering
>> >> > > > > > > the
>> >> > > > > > > > > > > amount of data that can be stored in a
>> ConfigMap/ZooKeeper node
>> >> > > > > > if
>> >> > > > > > > > > > > versioning the resource requirement change as
>> proposed in my
>> >> > > > > > > previous
>> >> > > > > > > > > > item
>> >> > > > > > > > > > > is an option for us.
>> >> > > > > > > > > > > - Updating the JobGraphStore means adding more
>> requests to the
>> >> > > > > HA
>> >> > > > > > > > > backend
>> >> > > > > > > > > > > API. There were some concerns shared in the
>> discussion thread
>> >> > > > > [2]
>> >> > > > > > > for
>> >> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API server in
>> the past with
>> >> > > > > > too
>> >> > > > > > > > many
>> >> > > > > > > > > > > calls. Eventhough, it's more likely to be caused by
>> >> > > > > > checkpointing,
>> >> > > > > > > I
>> >> > > > > > > > > > still
>> >> > > > > > > > > > > wanted to bring it up. We're working on a
>> standardized
>> >> > > > > > performance
>> >> > > > > > > > test
>> >> > > > > > > > > > to
>> >> > > > > > > > > > > prepare going forward with FLIP-270 [3] right now.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > Best,
>> >> > > > > > > > > > > Matthias
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > [1]
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
>> >> > > > > > > > > > > [2]
>> >> > > > > > >
>> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
>> >> > > > > > > > > > > [3]
>> >> > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <
>> jam.gzczy@gmail.com
>> >> > > > > >
>> >> > > > > > > > wrote:
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > > Hi David:
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > Thank you for drive this flip, which helps less
>> flink
>> >> > > > > shutdown
>> >> > > > > > > time
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > for this flip, I would like to make a few idea on
>> share
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >    - when the number of "slots" is insufficient,
>> can we can
>> >> > > > > > stop
>> >> > > > > > > > > users
>> >> > > > > > > > > > > >    rescaling or throw something to tell user
>> "less avaliable
>> >> > > > > > > slots
>> >> > > > > > > > to
>> >> > > > > > > > > > > > upgrade,
>> >> > > > > > > > > > > >    please checkout your alivalbe slots" ? Or we
>> could have a
>> >> > > > > > > > request
>> >> > > > > > > > > > > >    switch(true/false) to allow this behavior
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >    - when user upgrade job-vertx-parallelism . I
>> want to have
>> >> > > > > > an
>> >> > > > > > > > > > > interface
>> >> > > > > > > > > > > >    to query the current update parallel execution
>> status, so
>> >> > > > > > that
>> >> > > > > > > > the
>> >> > > > > > > > > > > user
>> >> > > > > > > > > > > > or
>> >> > > > > > > > > > > >    program can understand the current status
>> >> > > > > > > > > > > >    - I want to have an interface to query the
>> current update
>> >> > > > > > > > > > parallelism
>> >> > > > > > > > > > > >    execution status. This also helps similar to
>> *[1] Flink
>> >> > > > > K8S
>> >> > > > > > > > > > Operator*
>> >> > > > > > > > > > > >    management
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > {
>> >> > > > > > > > > > > >   status: Failed
>> >> > > > > > > > > > > >   reason: "less avaliable slots to upgrade,
>> please checkout
>> >> > > > > > your
>> >> > > > > > > > > > alivalbe
>> >> > > > > > > > > > > > slots"
>> >> > > > > > > > > > > > }
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >    - *Pending*: this job now is join the upgrade
>> queue,it
>> >> > > > > will
>> >> > > > > > be
>> >> > > > > > > > > > update
>> >> > > > > > > > > > > >    later
>> >> > > > > > > > > > > >    - *Rescaling*: job now is rescaling,wait it
>> finish
>> >> > > > > > > > > > > >    - *Finished*: finish do it
>> >> > > > > > > > > > > >    - *Failed* : something have wrong,so this job
>> is not
>> >> > > > > > alivable
>> >> > > > > > > > > > upgrade
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > I want to supplement my above content in flip,
>> what do you
>> >> > > > > > think
>> >> > > > > > > ?
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >    1.
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > >
>> >> > > > >
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五
>> 16:42写道：
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > > Hi everyone,
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > This FLIP [1] introduces a new REST API for
>> declaring
>> >> > > > > > resource
>> >> > > > > > > > > > > > requirements
>> >> > > > > > > > > > > > > for the Adaptive Scheduler. There seems to be a
>> clear need
>> >> > > > > > for
>> >> > > > > > > > this
>> >> > > > > > > > > > API
>> >> > > > > > > > > > > > > based on the discussion on the "Reworking the
>> Rescale API"
>> >> > > > > > [2]
>> >> > > > > > > > > > thread.
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > Before we get started, this work is heavily
>> based on the
>> >> > > > > > > > prototype
>> >> > > > > > > > > > [3]
>> >> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP is being
>> published
>> >> > > > > > with
>> >> > > > > > > > his
>> >> > > > > > > > > > > > consent.
>> >> > > > > > > > > > > > > Big shoutout to him!
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > Last and not least, thanks to Chesnay and Roman
>> for the
>> >> > > > > > initial
>> >> > > > > > > > > > reviews
>> >> > > > > > > > > > > > and
>> >> > > > > > > > > > > > > discussions.
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > The best start would be watching a short demo
>> [4] that I've
>> >> > > > > > > > > recorded,
>> >> > > > > > > > > > > > which
>> >> > > > > > > > > > > > > illustrates newly added capabilities (rescaling
>> the running
>> >> > > > > > > job,
>> >> > > > > > > > > > > handing
>> >> > > > > > > > > > > > > back resources to the RM, and session cluster
>> support).
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > The intuition behind the FLIP is being able to
>> define
>> >> > > > > > resource
>> >> > > > > > > > > > > > requirements
>> >> > > > > > > > > > > > > ("resource boundaries") externally that the
>> >> > > > > AdaptiveScheduler
>> >> > > > > > > can
>> >> > > > > > > > > > > > navigate
>> >> > > > > > > > > > > > > within. This is a building block for
>> higher-level efforts
>> >> > > > > > such
>> >> > > > > > > as
>> >> > > > > > > > > an
>> >> > > > > > > > > > > > > external Autoscaler. The natural extension of
>> this work
>> >> > > > > would
>> >> > > > > > > be
>> >> > > > > > > > to
>> >> > > > > > > > > > > allow
>> >> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > Looking forward to your thoughts; any feedback
>> is
>> >> > > > > > appreciated!
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > [1]
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
>> >> > > > > > > > > > > > > [2]
>> >> > > > > > > > >
>> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
>> >> > > > > > > > > > > > > [3]
>> https://github.com/tillrohrmann/flink/tree/autoscaling
>> >> > > > > > > > > > > > > [4]
>> >> > > > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > >
>> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > Best,
>> >> > > > > > > > > > > > > D.
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > --
>> >> > > > > > > > > > > > Best
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > ConradJam
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by Matthias Pohl <ma...@aiven.io.INVALID>.

Thanks for your clarifications, David. I don't have any additional major
points to add. One thing about the FLIP: The RPC layer API for updating the
JRR returns a future with a JRR? I don't see value in returning a JRR here
since it's an idempotent operation? Wouldn't it be enough to return
CompletableFuture<Void> here? Or am I missing something?

Matthias

On Mon, Feb 20, 2023 at 1:48 PM Maximilian Michels <mx...@apache.org> wrote:

> Thanks David! If we could get the pre-allocation working as part of
> the FLIP, that would be great.
>
> Concerning the downscale case, I agree this is a special case for the
> (single-job) application mode where we could re-allocate slots in a
> way that could leave entire task managers unoccupied which we would
> then be able to release. The goal essentially is to reduce slot
> fragmentation on scale down by packing the slots efficiently. The
> easiest way to add this optimization when running in application mode
> would be to drop as many task managers during the restart such that
> NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look into
> this independently of the FLIP.
>
> Feel free to start the vote.
>
> -Max
>
> On Mon, Feb 20, 2023 at 9:10 AM David Morávek <dm...@apache.org> wrote:
> >
> > Hi everyone,
> >
> > Thanks for the feedback! I've updated the FLIP to use idempotent PUT API
> instead of PATCH and to properly handle lower bound settings, to support
> the "pre-allocation" of the resources.
> >
> > @Max
> >
> > > How hard would it be to address this issue in the FLIP?
> >
> > I've included this in the FLIP. It might not be too hard to implement
> this in the end.
> >
> > > B) drop as many superfluous task managers as needed
> >
> > I've intentionally left this part out for now because this ultimately
> needs to be the responsibility of the Resource Manager. After all, in the
> Session Cluster scenario, the Scheduler doesn't have the bigger picture of
> other tasks of other jobs running on those TMs. This will most likely be a
> topic for another FLIP.
> >
> > WDYT? If there are no other questions or concerns, I'd like to start the
> vote on Wednesday.
> >
> > Best,
> > D.
> >
> > On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <mx...@apache.org>
> wrote:
> >>
> >> I missed that the FLIP states:
> >>
> >> > Currently, even though we’d expose the lower bound for clarity and
> API completeness, we won’t allow setting it to any other value than one
> until we have full support throughout the stack.
> >>
> >> How hard would it be to address this issue in the FLIP?
> >>
> >> There is not much value to offer setting a lower bound which won't be
> >> respected / throw an error when it is set. If we had support for a
> >> lower bound, we could enforce a resource contract externally via
> >> setting lowerBound == upperBound. That ties back to the Rescale API
> >> discussion we had. I want to better understand what the major concerns
> >> would be around allowing this.
> >>
> >> Just to outline how I imagine the logic to work:
> >>
> >> A) The resource constraints are already met => Nothing changes
> >> B) More resources available than required => Cancel the job, drop as
> >> many superfluous task managers as needed, restart the job
> >> C) Less resources available than required => Acquire new task
> >> managers, wait for them to register, cancel and restart the job
> >>
> >> I'm open to helping out with the implementation.
> >>
> >> -Max
> >>
> >> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <mx...@apache.org>
> wrote:
> >> >
> >> > Based on further discussion I had with Chesnay on this PR [1], I think
> >> > jobs would currently go into a restarting state after the resource
> >> > requirements have changed. This wouldn't achieve what we had in mind,
> >> > i.e. sticking to the old resource requirements until enough slots are
> >> > available to fulfil the new resource requirements. So this may not be
> >> > 100% what we need but it could be extended to do what we want.
> >> >
> >> > -Max
> >> >
> >> > [1] https://github.com/apache/flink/pull/21908#discussion_r1104792362
> >> >
> >> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <mx...@apache.org>
> wrote:
> >> > >
> >> > > Hi David,
> >> > >
> >> > > This is awesome! Great writeup and demo. This is pretty much what we
> >> > > need for the autoscaler as part of the Flink Kubernetes operator
> [1].
> >> > > Scaling Flink jobs effectively is hard but fortunately we have
> solved
> >> > > the issue as part of the Flink Kubernetes operator. The only
> critical
> >> > > piece we are missing is a better way to execute scaling decisions,
> as
> >> > > discussed in [2].
> >> > >
> >> > > Looking at your proposal, we would set lowerBound == upperBound for
> >> > > the parallelism because we want to fully determine the parallelism
> >> > > externally based on the scaling metrics. Does that sound right?
> >> > >
> >> > > What is the timeline for these changes? Is there a JIRA?
> >> > >
> >> > > Cheers,
> >> > > Max
> >> > >
> >> > > [1]
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
> >> > > [2]
> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> >> > >
> >> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <xi...@gmail.com>
> wrote:
> >> > > >
> >> > > > Hi David,
> >> > > >
> >> > > > Thanks for your reply.  I think your response totally make
> sense.  This
> >> > > > flip targets on declaring required resource to ResourceManager
> instead of
> >> > > > using  ResourceManager to add/remove TMs directly.
> >> > > >
> >> > > > Best,
> >> > > > Xiangyu
> >> > > >
> >> > > >
> >> > > >
> >> > > > David Morávek <da...@gmail.com> 于2023年2月13日周一 15:46写道：
> >> > > >
> >> > > > > Hi everyone,
> >> > > > >
> >> > > > > @Shammon
> >> > > > >
> >> > > > > I'm not entirely sure what "config file" you're referring to.
> You can, of
> >> > > > > course, override the default parallelism in "flink-conf.yaml",
> but for
> >> > > > > sinks and sources, the parallelism needs to be tweaked on the
> connector
> >> > > > > level ("WITH" statement).
> >> > > > >
> >> > > > > This is something that should be achieved with tooling around
> Flink. We
> >> > > > > want to provide an API on the lowest level that generalizes
> well. Achieving
> >> > > > > what you're describing should be straightforward with this API.
> >> > > > >
> >> > > > > @Xiangyu
> >> > > > >
> >> > > > > Is it possible for this REST API to declare TM resources in the
> future?
> >> > > > >
> >> > > > >
> >> > > > > Would you like to add/remove TMs if you use an active Resource
> Manager?
> >> > > > > This would be out of the scope of this effort since it targets
> the
> >> > > > > scheduler component only (we make no assumptions about the used
> Resource
> >> > > > > Manager). Also, the AdaptiveScheduler is only intended to be
> used for
> >> > > > > Streaming.
> >> > > > >
> >> > > > >  And for streaming jobs, I'm wondering if there is any
> situation we need to
> >> > > > > > rescale the TM resources of a flink cluster at first and then
> the
> >> > > > > adaptive
> >> > > > > > scheduler will rescale the per-vertex ResourceProfiles
> accordingly.
> >> > > > > >
> >> > > > >
> >> > > > > We plan on adding support for the ResourceProfiles (dynamic slot
> >> > > > > allocation) as the next step. Again we won't make any
> assumptions about the
> >> > > > > used Resource Manager. In other words, this effort ends by
> declaring
> >> > > > > desired resources to the Resource Manager.
> >> > > > >
> >> > > > > Does that make sense?
> >> > > > >
> >> > > > > @Matthias
> >> > > > >
> >> > > > > We've done another pass on the proposed API and currently lean
> towards
> >> > > > > having an idempotent PUT API.
> >> > > > > - We don't care too much about multiple writers' scenarios in
> terms of who
> >> > > > > can write an authoritative payload; this is up to the user of
> the API to
> >> > > > > figure out
> >> > > > > - It's indeed tricky to achieve atomicity with PATCH API;
> switching to PUT
> >> > > > > API seems to do the trick
> >> > > > > - We won't allow partial "payloads" anymore, meaning you need
> to define
> >> > > > > requirements for all vertices in the JobGraph; This is
> completely fine for
> >> > > > > the programmatic workflows. For DEBUG / DEMO purposes, you can
> use the GET
> >> > > > > endpoint and tweak the response to avoid writing the whole
> payload by hand.
> >> > > > >
> >> > > > > WDYT?
> >> > > > >
> >> > > > >
> >> > > > > Best,
> >> > > > > D.
> >> > > > >
> >> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <
> xiangyu0xf@gmail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Hi David,
> >> > > > > >
> >> > > > > > Thanks for creating this flip. I think this work it is very
> useful,
> >> > > > > > especially in autoscaling scenario.  I would like to share
> some questions
> >> > > > > > from my view.
> >> > > > > >
> >> > > > > > 1, Is it possible for this REST API to declare TM resources
> in the
> >> > > > > future?
> >> > > > > > I'm asking because we are building the autoscaling feature
> for Flink OLAP
> >> > > > > > Session Cluster in ByteDance. We need to rescale the
> cluster's resource
> >> > > > > on
> >> > > > > > TM level instead of Job level. It would be very helpful if we
> have a REST
> >> > > > > > API for out external Autoscaling service to use.
> >> > > > > >
> >> > > > > > 2, And for streaming jobs, I'm wondering if there is any
> situation we
> >> > > > > need
> >> > > > > > to rescale the TM resources of a flink cluster at first and
> then the
> >> > > > > > adaptive scheduler will rescale the per-vertex
> ResourceProfiles
> >> > > > > > accordingly.
> >> > > > > >
> >> > > > > > best.
> >> > > > > > Xiangyu
> >> > > > > >
> >> > > > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
> >> > > > > >
> >> > > > > > > Hi David
> >> > > > > > >
> >> > > > > > > Thanks for your answer.
> >> > > > > > >
> >> > > > > > > > Can you elaborate more about how you'd intend to use the
> endpoint? I
> >> > > > > > > think we can ultimately introduce a way of re-declaring
> "per-vertex
> >> > > > > > > defaults," but I'd like to understand the use case bit more
> first.
> >> > > > > > >
> >> > > > > > > For this issue, I mainly consider the consistency of user
> configuration
> >> > > > > > and
> >> > > > > > > job runtime. For sql jobs, users usually set specific
> parallelism for
> >> > > > > > > source and sink, and set a global parallelism for other
> operators.
> >> > > > > These
> >> > > > > > > config items are stored in a config file. For some
> high-priority jobs,
> >> > > > > > > users may want to manage them manually.
> >> > > > > > > 1. When users need to scale the parallelism, they should
> update the
> >> > > > > > config
> >> > > > > > > file and restart flink job, which may take a long time.
> >> > > > > > > 2. After providing the REST API, users can just send a
> request to the
> >> > > > > job
> >> > > > > > > via REST API quickly after updating the config file.
> >> > > > > > > The configuration in the running job and config file should
> be the
> >> > > > > same.
> >> > > > > > > What do you think of this?
> >> > > > > > >
> >> > > > > > > best.
> >> > > > > > > Shammon
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <
> david.moravek@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Hi everyone,
> >> > > > > > > >
> >> > > > > > > > Let's try to answer the questions one by one.
> >> > > > > > > >
> >> > > > > > > > *@ConradJam*
> >> > > > > > > >
> >> > > > > > > > when the number of "slots" is insufficient, can we can
> stop users
> >> > > > > > > rescaling
> >> > > > > > > > > or throw something to tell user "less avaliable slots
> to upgrade,
> >> > > > > > > please
> >> > > > > > > > > checkout your alivalbe slots" ?
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > The main property of AdaptiveScheduler is that it can
> adapt to
> >> > > > > > "available
> >> > > > > > > > resources," which means you're still able to make
> progress even
> >> > > > > though
> >> > > > > > > you
> >> > > > > > > > didn't get all the slots you've asked for. Let's break
> down the pros
> >> > > > > > and
> >> > > > > > > > cons of this property.
> >> > > > > > > >
> >> > > > > > > > - (plus) If you lose a TM for some reason, you can still
> recover even
> >> > > > > > if
> >> > > > > > > it
> >> > > > > > > > doesn't come back. We still need to give it some time to
> eliminate
> >> > > > > > > > unnecessary rescaling, which can be controlled by setting
> >> > > > > > > > "resource-stabilization-timeout."
> >> > > > > > > > - (plus) The resources can arrive with a significant
> delay. For
> >> > > > > > example,
> >> > > > > > > > you're unable to spawn enough TMs on time because you've
> run out of
> >> > > > > > > > resources in your k8s cluster, and you need to wait for
> the cluster
> >> > > > > > auto
> >> > > > > > > > scaler to kick in and add new nodes to the cluster. In
> this scenario,
> >> > > > > > > > you'll be able to start making progress faster, at the
> cost of
> >> > > > > multiple
> >> > > > > > > > rescalings (once the remaining resources arrive).
> >> > > > > > > > - (plus) This plays well with the declarative manner of
> today's
> >> > > > > > > > infrastructure. For example, you tell k8s that you need
> 10 TMs, and
> >> > > > > > > you'll
> >> > > > > > > > eventually get them.
> >> > > > > > > > - (minus) In the case of large state jobs, the cost of
> multiple
> >> > > > > > > rescalings
> >> > > > > > > > might outweigh the above.
> >> > > > > > > >
> >> > > > > > > > We've already touched on the solution to this problem on
> the FLIP.
> >> > > > > > Please
> >> > > > > > > > notice the parallelism knob being a range with a lower
> and upper
> >> > > > > bound.
> >> > > > > > > > Setting both the lower and upper bound to the same value
> could give
> >> > > > > the
> >> > > > > > > > behavior you're describing at the cost of giving up some
> properties
> >> > > > > > that
> >> > > > > > > AS
> >> > > > > > > > gives you (you'd be falling back to the
> DefaultScheduler's behavior).
> >> > > > > > > >
> >> > > > > > > > when user upgrade job-vertx-parallelism . I want to have
> an interface
> >> > > > > > to
> >> > > > > > > > > query the current update parallel execution status, so
> that the
> >> > > > > user
> >> > > > > > or
> >> > > > > > > > > program can understand the current status
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > This is a misunderstanding. We're not introducing the
> RESCALE
> >> > > > > endpoint.
> >> > > > > > > > This endpoint allows you to re-declare the resources
> needed to run
> >> > > > > the
> >> > > > > > > job.
> >> > > > > > > > Once you reach the desired resources (you get more
> resources than the
> >> > > > > > > lower
> >> > > > > > > > bound defines), your job will run.
> >> > > > > > > >
> >> > > > > > > > We can expose a similar endpoint to "resource
> requirements" to give
> >> > > > > you
> >> > > > > > > an
> >> > > > > > > > overview of the resources the vertices already have. You
> can already
> >> > > > > > get
> >> > > > > > > > this from the REST API, so exposing this in yet another
> way should be
> >> > > > > > > > considered carefully.
> >> > > > > > > >
> >> > > > > > > > *@Matthias*
> >> > > > > > > >
> >> > > > > > > > I'm wondering whether it makes sense to add some kind of
> resource ID
> >> > > > > to
> >> > > > > > > the
> >> > > > > > > > > REST API.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > That's a good question. I want to think about that and
> get back to
> >> > > > > the
> >> > > > > > > > question later. My main struggle when thinking about this
> is, "if
> >> > > > > this
> >> > > > > > > > would be an idempotent POST endpoint," would it be any
> different?
> >> > > > > > > >
> >> > > > > > > > How often do we allow resource requirements to be changed?
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > There shall be no rate limiting on the FLINK side. If
> this is
> >> > > > > something
> >> > > > > > > > your environment needs, you can achieve it on a different
> layer ("we
> >> > > > > > > can't
> >> > > > > > > > have FLINK to do everything").
> >> > > > > > > >
> >> > > > > > > > Versioning the JobGraph in the JobGraphStore rather than
> overwriting
> >> > > > > it
> >> > > > > > > > > might be an idea.
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > This sounds interesting since it would be closer to the
> JobGraph
> >> > > > > being
> >> > > > > > > > immutable. The main problem I see here is that this would
> introduce a
> >> > > > > > > > BW-incompatible change so it might be a topic for
> follow-up FLIP.
> >> > > > > > > >
> >> > > > > > > > I'm just wondering whether we bundle two things together
> that are
> >> > > > > > > actually
> >> > > > > > > > > separate
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Yup, this is how we think about it as well. The main
> question is,
> >> > > > > "who
> >> > > > > > > > should be responsible for bookkeeping 1) the JobGraph and
> 2) the
> >> > > > > > > > JobResourceRequirements". The JobMaster would be the
> right place for
> >> > > > > > > both,
> >> > > > > > > > but it's currently not the case, and we're tightly
> coupling the
> >> > > > > > > dispatcher
> >> > > > > > > > with the JobMaster.
> >> > > > > > > >
> >> > > > > > > > Initially, we tried to introduce a separate HA component
> in JobMaster
> >> > > > > > for
> >> > > > > > > > bookkeeping the JobResourceRequirements, but that proved
> to be a more
> >> > > > > > > > significant effort adding additional mess to the already
> messy HA
> >> > > > > > > > ecosystem. Another approach we've discussed was mutating
> the JobGraph
> >> > > > > > and
> >> > > > > > > > setting JRR into the JobGraph structure itself.
> >> > > > > > > >
> >> > > > > > > > The middle ground for keeping this effort reasonably
> sized and not
> >> > > > > > > > violating "we want to keep JG immutable" too much is
> keeping the
> >> > > > > > > > JobResourceRequirements separate as an internal config
> option in
> >> > > > > > > JobGraph's
> >> > > > > > > > configuration.
> >> > > > > > > >
> >> > > > > > > > We ultimately need to rethink the tight coupling of
> Dispatcher and
> >> > > > > > > > JobMaster, but it needs to be a separate effort.
> >> > > > > > > >
> >> > > > > > > > ...also considering the amount of data that can be stored
> in a
> >> > > > > > > > > ConfigMap/ZooKeeper node if versioning the resource
> requirement
> >> > > > > > change
> >> > > > > > > as
> >> > > > > > > > > proposed in my previous item is an option for us.
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > AFAIK we're only storing pointers to the S3 objects in HA
> metadata,
> >> > > > > so
> >> > > > > > we
> >> > > > > > > > should be okay with having larger structures for now.
> >> > > > > > > >
> >> > > > > > > > Updating the JobGraphStore means adding more requests to
> the HA
> >> > > > > backend
> >> > > > > > > > API.
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > It's fine unless you intend to override the resource
> requirements a
> >> > > > > few
> >> > > > > > > > times per second.
> >> > > > > > > >
> >> > > > > > > > *@Shammon*
> >> > > > > > > >
> >> > > > > > > > How about adding some more information such as vertex type
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Since it was intended as a "debug" endpoint, it makes
> complete sense!
> >> > > > > > > >
> >> > > > > > > >  For sql jobs, we always use a unified parallelism for
> most vertices.
> >> > > > > > Can
> >> > > > > > > > > we provide them with a more convenient setting method
> instead of
> >> > > > > each
> >> > > > > > > > one?
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > I completely feel with this. The main thoughts when
> designing the API
> >> > > > > > > were:
> >> > > > > > > > - We want to keep it clean and easy to understand.
> >> > > > > > > > - Global parallelism can be modeled using per-vertex
> parallelism but
> >> > > > > > not
> >> > > > > > > > the other way around.
> >> > > > > > > > - The API will be used by external tooling (operator,
> auto scaler).
> >> > > > > > > >
> >> > > > > > > > Can you elaborate more about how you'd intend to use the
> endpoint? I
> >> > > > > > > think
> >> > > > > > > > we can ultimately introduce a way of re-declaring
> "per-vertex
> >> > > > > > defaults,"
> >> > > > > > > > but I'd like to understand the use case bit more first.
> >> > > > > > > >
> >> > > > > > > > *@Weijie*
> >> > > > > > > >
> >> > > > > > > > What is the default value here (based on what
> configuration), or just
> >> > > > > > > > > infinite?
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Currently, for the lower bound, it's always one, and for
> the upper
> >> > > > > > bound,
> >> > > > > > > > it's either parallelism (if defined) or the
> maxParallelism of the
> >> > > > > > vertex
> >> > > > > > > in
> >> > > > > > > > JobGraph. This question might be another signal for
> making the
> >> > > > > defaults
> >> > > > > > > > explicit (see the answer to Shammon's question above).
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Thanks, everyone, for your initial thoughts!
> >> > > > > > > >
> >> > > > > > > > Best,
> >> > > > > > > > D.
> >> > > > > > > >
> >> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <
> guoweijiereswqa@gmail.com
> >> > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Thanks David for driving this. This is a very valuable
> work,
> >> > > > > > especially
> >> > > > > > > > for
> >> > > > > > > > > cloud native environment.
> >> > > > > > > > >
> >> > > > > > > > > >> How about adding some more information such as
> vertex type
> >> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> >> > > > > > > > > resource-requirements`? For users, only vertex-id may
> be difficult
> >> > > > > to
> >> > > > > > > > > understand.
> >> > > > > > > > >
> >> > > > > > > > > +1 for this suggestion, including jobvertex's name in
> the response
> >> > > > > > body
> >> > > > > > > > is
> >> > > > > > > > > more
> >> > > > > > > > > user-friendly.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > I saw this sentence in FLIP: "Setting the upper bound
> to -1 will
> >> > > > > > reset
> >> > > > > > > > the
> >> > > > > > > > > value to the default setting."  What is the default
> value here
> >> > > > > (based
> >> > > > > > > on
> >> > > > > > > > > what configuration), or just infinite?
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > Best regards,
> >> > > > > > > > >
> >> > > > > > > > > Weijie
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
> >> > > > > > > > >
> >> > > > > > > > > > Hi David
> >> > > > > > > > > >
> >> > > > > > > > > > Thanks for initiating this discussion. I think
> declaring job
> >> > > > > > resource
> >> > > > > > > > > > requirements by REST API is very valuable. I just
> left some
> >> > > > > > comments
> >> > > > > > > as
> >> > > > > > > > > > followed
> >> > > > > > > > > >
> >> > > > > > > > > > 1) How about adding some more information such as
> vertex type
> >> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get
> jobs
> >> > > > > > > > > > resource-requirements`? For users, only vertex-id may
> be
> >> > > > > difficult
> >> > > > > > to
> >> > > > > > > > > > understand.
> >> > > > > > > > > >
> >> > > > > > > > > > 2) For sql jobs, we always use a unified parallelism
> for most
> >> > > > > > > vertices.
> >> > > > > > > > > Can
> >> > > > > > > > > > we provide them with a more convenient setting method
> instead of
> >> > > > > > each
> >> > > > > > > > > one?
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > Best,
> >> > > > > > > > > > Shammon
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
> >> > > > > > matthias.pohl@aiven.io
> >> > > > > > > > > > .invalid>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > Thanks David for creating this FLIP. It sounds
> promising and
> >> > > > > > useful
> >> > > > > > > > to
> >> > > > > > > > > > > have. Here are some thoughts from my side (some of
> them might
> >> > > > > be
> >> > > > > > > > > rather a
> >> > > > > > > > > > > follow-up and not necessarily part of this FLIP):
> >> > > > > > > > > > > - I'm wondering whether it makes sense to add some
> kind of
> >> > > > > > resource
> >> > > > > > > > ID
> >> > > > > > > > > to
> >> > > > > > > > > > > the REST API. This would give Flink a tool to
> verify the PATCH
> >> > > > > > > > request
> >> > > > > > > > > of
> >> > > > > > > > > > > the external system in a compare-and-set kind of
> manner. AFAIU,
> >> > > > > > the
> >> > > > > > > > > > process
> >> > > > > > > > > > > requires the external system to retrieve the
> resource
> >> > > > > > requirements
> >> > > > > > > > > first
> >> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID <ABC>
> would be sent
> >> > > > > > > along
> >> > > > > > > > > as
> >> > > > > > > > > > a
> >> > > > > > > > > > > unique identifier for the provided setup. It's
> essentially the
> >> > > > > > > > version
> >> > > > > > > > > ID
> >> > > > > > > > > > > of the currently deployed resource requirement
> configuration.
> >> > > > > > Flink
> >> > > > > > > > > > doesn't
> >> > > > > > > > > > > know whether the external system would use the
> provided
> >> > > > > > information
> >> > > > > > > > in
> >> > > > > > > > > > some
> >> > > > > > > > > > > way to derive a new set of resource requirements
> for this job.
> >> > > > > > The
> >> > > > > > > > > > > subsequent PATCH request with updated resource
> requirements
> >> > > > > would
> >> > > > > > > > > include
> >> > > > > > > > > > > the previously retrieved resource ID <ABC>. The
> PATCH call
> >> > > > > would
> >> > > > > > > fail
> >> > > > > > > > > if
> >> > > > > > > > > > > there was a concurrent PATCH call in between
> indicating to the
> >> > > > > > > > external
> >> > > > > > > > > > > system that the resource requirements were
> concurrently
> >> > > > > updated.
> >> > > > > > > > > > > - How often do we allow resource requirements to be
> changed?
> >> > > > > That
> >> > > > > > > > > > question
> >> > > > > > > > > > > might make my previous comment on the resource ID
> obsolete
> >> > > > > > because
> >> > > > > > > we
> >> > > > > > > > > > could
> >> > > > > > > > > > > just make any PATCH call fail if there was a
> resource
> >> > > > > requirement
> >> > > > > > > > > update
> >> > > > > > > > > > > within a certain time frame before the request. But
> such a time
> >> > > > > > > > period
> >> > > > > > > > > is
> >> > > > > > > > > > > something we might want to make configurable then,
> I guess.
> >> > > > > > > > > > > - Versioning the JobGraph in the JobGraphStore
> rather than
> >> > > > > > > > overwriting
> >> > > > > > > > > it
> >> > > > > > > > > > > might be an idea. This would enable us to provide
> resource
> >> > > > > > > > requirement
> >> > > > > > > > > > > changes in the UI or through the REST API. It is
> related to a
> >> > > > > > > problem
> >> > > > > > > > > > > around keeping track of the exception history
> within the
> >> > > > > > > > > > AdaptiveScheduler
> >> > > > > > > > > > > and also having to consider multiple versions of a
> JobGraph.
> >> > > > > But
> >> > > > > > > for
> >> > > > > > > > > that
> >> > > > > > > > > > > one, we use the ExecutionGraphInfoStore right now.
> >> > > > > > > > > > > - Updating the JobGraph in the JobGraphStore makes
> sense. I'm
> >> > > > > > just
> >> > > > > > > > > > > wondering whether we bundle two things together
> that are
> >> > > > > actually
> >> > > > > > > > > > separate:
> >> > > > > > > > > > > The business logic and the execution configuration
> (the
> >> > > > > resource
> >> > > > > > > > > > > requirements). I'm aware that this is not a flaw of
> the current
> >> > > > > > > FLIP
> >> > > > > > > > > but
> >> > > > > > > > > > > rather something that was not necessary to address
> in the past
> >> > > > > > > > because
> >> > > > > > > > > > the
> >> > > > > > > > > > > JobGraph was kind of static. I don't remember
> whether that was
> >> > > > > > > > already
> >> > > > > > > > > > > discussed while working on the AdaptiveScheduler
> for FLIP-160
> >> > > > > > [1].
> >> > > > > > > > > Maybe,
> >> > > > > > > > > > > I'm missing some functionality here that requires
> us to have
> >> > > > > > > > everything
> >> > > > > > > > > > in
> >> > > > > > > > > > > one place. But it feels like updating the entire
> JobGraph which
> >> > > > > > > could
> >> > > > > > > > > be
> >> > > > > > > > > > > actually a "config change" is not reasonable.
> ...also
> >> > > > > considering
> >> > > > > > > the
> >> > > > > > > > > > > amount of data that can be stored in a
> ConfigMap/ZooKeeper node
> >> > > > > > if
> >> > > > > > > > > > > versioning the resource requirement change as
> proposed in my
> >> > > > > > > previous
> >> > > > > > > > > > item
> >> > > > > > > > > > > is an option for us.
> >> > > > > > > > > > > - Updating the JobGraphStore means adding more
> requests to the
> >> > > > > HA
> >> > > > > > > > > backend
> >> > > > > > > > > > > API. There were some concerns shared in the
> discussion thread
> >> > > > > [2]
> >> > > > > > > for
> >> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API server in
> the past with
> >> > > > > > too
> >> > > > > > > > many
> >> > > > > > > > > > > calls. Eventhough, it's more likely to be caused by
> >> > > > > > checkpointing,
> >> > > > > > > I
> >> > > > > > > > > > still
> >> > > > > > > > > > > wanted to bring it up. We're working on a
> standardized
> >> > > > > > performance
> >> > > > > > > > test
> >> > > > > > > > > > to
> >> > > > > > > > > > > prepare going forward with FLIP-270 [3] right now.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Best,
> >> > > > > > > > > > > Matthias
> >> > > > > > > > > > >
> >> > > > > > > > > > > [1]
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> >> > > > > > > > > > > [2]
> >> > > > > > >
> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> >> > > > > > > > > > > [3]
> >> > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <
> jam.gzczy@gmail.com
> >> > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Hi David:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > Thank you for drive this flip, which helps less
> flink
> >> > > > > shutdown
> >> > > > > > > time
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > for this flip, I would like to make a few idea on
> share
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >    - when the number of "slots" is insufficient,
> can we can
> >> > > > > > stop
> >> > > > > > > > > users
> >> > > > > > > > > > > >    rescaling or throw something to tell user
> "less avaliable
> >> > > > > > > slots
> >> > > > > > > > to
> >> > > > > > > > > > > > upgrade,
> >> > > > > > > > > > > >    please checkout your alivalbe slots" ? Or we
> could have a
> >> > > > > > > > request
> >> > > > > > > > > > > >    switch(true/false) to allow this behavior
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >    - when user upgrade job-vertx-parallelism . I
> want to have
> >> > > > > > an
> >> > > > > > > > > > > interface
> >> > > > > > > > > > > >    to query the current update parallel execution
> status, so
> >> > > > > > that
> >> > > > > > > > the
> >> > > > > > > > > > > user
> >> > > > > > > > > > > > or
> >> > > > > > > > > > > >    program can understand the current status
> >> > > > > > > > > > > >    - I want to have an interface to query the
> current update
> >> > > > > > > > > > parallelism
> >> > > > > > > > > > > >    execution status. This also helps similar to
> *[1] Flink
> >> > > > > K8S
> >> > > > > > > > > > Operator*
> >> > > > > > > > > > > >    management
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > {
> >> > > > > > > > > > > >   status: Failed
> >> > > > > > > > > > > >   reason: "less avaliable slots to upgrade,
> please checkout
> >> > > > > > your
> >> > > > > > > > > > alivalbe
> >> > > > > > > > > > > > slots"
> >> > > > > > > > > > > > }
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >    - *Pending*: this job now is join the upgrade
> queue,it
> >> > > > > will
> >> > > > > > be
> >> > > > > > > > > > update
> >> > > > > > > > > > > >    later
> >> > > > > > > > > > > >    - *Rescaling*: job now is rescaling,wait it
> finish
> >> > > > > > > > > > > >    - *Finished*: finish do it
> >> > > > > > > > > > > >    - *Failed* : something have wrong,so this job
> is not
> >> > > > > > alivable
> >> > > > > > > > > > upgrade
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > I want to supplement my above content in flip,
> what do you
> >> > > > > > think
> >> > > > > > > ?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >    1.
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > >
> >> > > > >
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五
> 16:42写道：
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Hi everyone,
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > This FLIP [1] introduces a new REST API for
> declaring
> >> > > > > > resource
> >> > > > > > > > > > > > requirements
> >> > > > > > > > > > > > > for the Adaptive Scheduler. There seems to be a
> clear need
> >> > > > > > for
> >> > > > > > > > this
> >> > > > > > > > > > API
> >> > > > > > > > > > > > > based on the discussion on the "Reworking the
> Rescale API"
> >> > > > > > [2]
> >> > > > > > > > > > thread.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Before we get started, this work is heavily
> based on the
> >> > > > > > > > prototype
> >> > > > > > > > > > [3]
> >> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP is being
> published
> >> > > > > > with
> >> > > > > > > > his
> >> > > > > > > > > > > > consent.
> >> > > > > > > > > > > > > Big shoutout to him!
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Last and not least, thanks to Chesnay and Roman
> for the
> >> > > > > > initial
> >> > > > > > > > > > reviews
> >> > > > > > > > > > > > and
> >> > > > > > > > > > > > > discussions.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > The best start would be watching a short demo
> [4] that I've
> >> > > > > > > > > recorded,
> >> > > > > > > > > > > > which
> >> > > > > > > > > > > > > illustrates newly added capabilities (rescaling
> the running
> >> > > > > > > job,
> >> > > > > > > > > > > handing
> >> > > > > > > > > > > > > back resources to the RM, and session cluster
> support).
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > The intuition behind the FLIP is being able to
> define
> >> > > > > > resource
> >> > > > > > > > > > > > requirements
> >> > > > > > > > > > > > > ("resource boundaries") externally that the
> >> > > > > AdaptiveScheduler
> >> > > > > > > can
> >> > > > > > > > > > > > navigate
> >> > > > > > > > > > > > > within. This is a building block for
> higher-level efforts
> >> > > > > > such
> >> > > > > > > as
> >> > > > > > > > > an
> >> > > > > > > > > > > > > external Autoscaler. The natural extension of
> this work
> >> > > > > would
> >> > > > > > > be
> >> > > > > > > > to
> >> > > > > > > > > > > allow
> >> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Looking forward to your thoughts; any feedback
> is
> >> > > > > > appreciated!
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > [1]
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> >> > > > > > > > > > > > > [2]
> >> > > > > > > > >
> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> >> > > > > > > > > > > > > [3]
> https://github.com/tillrohrmann/flink/tree/autoscaling
> >> > > > > > > > > > > > > [4]
> >> > > > > > > > > > > >
> >> > > > > > > > >
> >> > > > > >
> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Best,
> >> > > > > > > > > > > > > D.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > --
> >> > > > > > > > > > > > Best
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > ConradJam
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by Maximilian Michels <mx...@apache.org>.

Thanks David! If we could get the pre-allocation working as part of
the FLIP, that would be great.

Concerning the downscale case, I agree this is a special case for the
(single-job) application mode where we could re-allocate slots in a
way that could leave entire task managers unoccupied which we would
then be able to release. The goal essentially is to reduce slot
fragmentation on scale down by packing the slots efficiently. The
easiest way to add this optimization when running in application mode
would be to drop as many task managers during the restart such that
NUM_REQUIRED_SLOTS >= NUM_AVAILABLE_SLOTS stays true. We can look into
this independently of the FLIP.

Feel free to start the vote.

-Max

On Mon, Feb 20, 2023 at 9:10 AM David Morávek <dm...@apache.org> wrote:
>
> Hi everyone,
>
> Thanks for the feedback! I've updated the FLIP to use idempotent PUT API instead of PATCH and to properly handle lower bound settings, to support the "pre-allocation" of the resources.
>
> @Max
>
> > How hard would it be to address this issue in the FLIP?
>
> I've included this in the FLIP. It might not be too hard to implement this in the end.
>
> > B) drop as many superfluous task managers as needed
>
> I've intentionally left this part out for now because this ultimately needs to be the responsibility of the Resource Manager. After all, in the Session Cluster scenario, the Scheduler doesn't have the bigger picture of other tasks of other jobs running on those TMs. This will most likely be a topic for another FLIP.
>
> WDYT? If there are no other questions or concerns, I'd like to start the vote on Wednesday.
>
> Best,
> D.
>
> On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <mx...@apache.org> wrote:
>>
>> I missed that the FLIP states:
>>
>> > Currently, even though we’d expose the lower bound for clarity and API completeness, we won’t allow setting it to any other value than one until we have full support throughout the stack.
>>
>> How hard would it be to address this issue in the FLIP?
>>
>> There is not much value to offer setting a lower bound which won't be
>> respected / throw an error when it is set. If we had support for a
>> lower bound, we could enforce a resource contract externally via
>> setting lowerBound == upperBound. That ties back to the Rescale API
>> discussion we had. I want to better understand what the major concerns
>> would be around allowing this.
>>
>> Just to outline how I imagine the logic to work:
>>
>> A) The resource constraints are already met => Nothing changes
>> B) More resources available than required => Cancel the job, drop as
>> many superfluous task managers as needed, restart the job
>> C) Less resources available than required => Acquire new task
>> managers, wait for them to register, cancel and restart the job
>>
>> I'm open to helping out with the implementation.
>>
>> -Max
>>
>> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <mx...@apache.org> wrote:
>> >
>> > Based on further discussion I had with Chesnay on this PR [1], I think
>> > jobs would currently go into a restarting state after the resource
>> > requirements have changed. This wouldn't achieve what we had in mind,
>> > i.e. sticking to the old resource requirements until enough slots are
>> > available to fulfil the new resource requirements. So this may not be
>> > 100% what we need but it could be extended to do what we want.
>> >
>> > -Max
>> >
>> > [1] https://github.com/apache/flink/pull/21908#discussion_r1104792362
>> >
>> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <mx...@apache.org> wrote:
>> > >
>> > > Hi David,
>> > >
>> > > This is awesome! Great writeup and demo. This is pretty much what we
>> > > need for the autoscaler as part of the Flink Kubernetes operator [1].
>> > > Scaling Flink jobs effectively is hard but fortunately we have solved
>> > > the issue as part of the Flink Kubernetes operator. The only critical
>> > > piece we are missing is a better way to execute scaling decisions, as
>> > > discussed in [2].
>> > >
>> > > Looking at your proposal, we would set lowerBound == upperBound for
>> > > the parallelism because we want to fully determine the parallelism
>> > > externally based on the scaling metrics. Does that sound right?
>> > >
>> > > What is the timeline for these changes? Is there a JIRA?
>> > >
>> > > Cheers,
>> > > Max
>> > >
>> > > [1] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
>> > > [2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
>> > >
>> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <xi...@gmail.com> wrote:
>> > > >
>> > > > Hi David,
>> > > >
>> > > > Thanks for your reply.  I think your response totally make sense.  This
>> > > > flip targets on declaring required resource to ResourceManager instead of
>> > > > using  ResourceManager to add/remove TMs directly.
>> > > >
>> > > > Best,
>> > > > Xiangyu
>> > > >
>> > > >
>> > > >
>> > > > David Morávek <da...@gmail.com> 于2023年2月13日周一 15:46写道：
>> > > >
>> > > > > Hi everyone,
>> > > > >
>> > > > > @Shammon
>> > > > >
>> > > > > I'm not entirely sure what "config file" you're referring to. You can, of
>> > > > > course, override the default parallelism in "flink-conf.yaml", but for
>> > > > > sinks and sources, the parallelism needs to be tweaked on the connector
>> > > > > level ("WITH" statement).
>> > > > >
>> > > > > This is something that should be achieved with tooling around Flink. We
>> > > > > want to provide an API on the lowest level that generalizes well. Achieving
>> > > > > what you're describing should be straightforward with this API.
>> > > > >
>> > > > > @Xiangyu
>> > > > >
>> > > > > Is it possible for this REST API to declare TM resources in the future?
>> > > > >
>> > > > >
>> > > > > Would you like to add/remove TMs if you use an active Resource Manager?
>> > > > > This would be out of the scope of this effort since it targets the
>> > > > > scheduler component only (we make no assumptions about the used Resource
>> > > > > Manager). Also, the AdaptiveScheduler is only intended to be used for
>> > > > > Streaming.
>> > > > >
>> > > > >  And for streaming jobs, I'm wondering if there is any situation we need to
>> > > > > > rescale the TM resources of a flink cluster at first and then the
>> > > > > adaptive
>> > > > > > scheduler will rescale the per-vertex ResourceProfiles accordingly.
>> > > > > >
>> > > > >
>> > > > > We plan on adding support for the ResourceProfiles (dynamic slot
>> > > > > allocation) as the next step. Again we won't make any assumptions about the
>> > > > > used Resource Manager. In other words, this effort ends by declaring
>> > > > > desired resources to the Resource Manager.
>> > > > >
>> > > > > Does that make sense?
>> > > > >
>> > > > > @Matthias
>> > > > >
>> > > > > We've done another pass on the proposed API and currently lean towards
>> > > > > having an idempotent PUT API.
>> > > > > - We don't care too much about multiple writers' scenarios in terms of who
>> > > > > can write an authoritative payload; this is up to the user of the API to
>> > > > > figure out
>> > > > > - It's indeed tricky to achieve atomicity with PATCH API; switching to PUT
>> > > > > API seems to do the trick
>> > > > > - We won't allow partial "payloads" anymore, meaning you need to define
>> > > > > requirements for all vertices in the JobGraph; This is completely fine for
>> > > > > the programmatic workflows. For DEBUG / DEMO purposes, you can use the GET
>> > > > > endpoint and tweak the response to avoid writing the whole payload by hand.
>> > > > >
>> > > > > WDYT?
>> > > > >
>> > > > >
>> > > > > Best,
>> > > > > D.
>> > > > >
>> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <xi...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Hi David,
>> > > > > >
>> > > > > > Thanks for creating this flip. I think this work it is very useful,
>> > > > > > especially in autoscaling scenario.  I would like to share some questions
>> > > > > > from my view.
>> > > > > >
>> > > > > > 1, Is it possible for this REST API to declare TM resources in the
>> > > > > future?
>> > > > > > I'm asking because we are building the autoscaling feature for Flink OLAP
>> > > > > > Session Cluster in ByteDance. We need to rescale the cluster's resource
>> > > > > on
>> > > > > > TM level instead of Job level. It would be very helpful if we have a REST
>> > > > > > API for out external Autoscaling service to use.
>> > > > > >
>> > > > > > 2, And for streaming jobs, I'm wondering if there is any situation we
>> > > > > need
>> > > > > > to rescale the TM resources of a flink cluster at first and then the
>> > > > > > adaptive scheduler will rescale the per-vertex ResourceProfiles
>> > > > > > accordingly.
>> > > > > >
>> > > > > > best.
>> > > > > > Xiangyu
>> > > > > >
>> > > > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
>> > > > > >
>> > > > > > > Hi David
>> > > > > > >
>> > > > > > > Thanks for your answer.
>> > > > > > >
>> > > > > > > > Can you elaborate more about how you'd intend to use the endpoint? I
>> > > > > > > think we can ultimately introduce a way of re-declaring "per-vertex
>> > > > > > > defaults," but I'd like to understand the use case bit more first.
>> > > > > > >
>> > > > > > > For this issue, I mainly consider the consistency of user configuration
>> > > > > > and
>> > > > > > > job runtime. For sql jobs, users usually set specific parallelism for
>> > > > > > > source and sink, and set a global parallelism for other operators.
>> > > > > These
>> > > > > > > config items are stored in a config file. For some high-priority jobs,
>> > > > > > > users may want to manage them manually.
>> > > > > > > 1. When users need to scale the parallelism, they should update the
>> > > > > > config
>> > > > > > > file and restart flink job, which may take a long time.
>> > > > > > > 2. After providing the REST API, users can just send a request to the
>> > > > > job
>> > > > > > > via REST API quickly after updating the config file.
>> > > > > > > The configuration in the running job and config file should be the
>> > > > > same.
>> > > > > > > What do you think of this?
>> > > > > > >
>> > > > > > > best.
>> > > > > > > Shammon
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <da...@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi everyone,
>> > > > > > > >
>> > > > > > > > Let's try to answer the questions one by one.
>> > > > > > > >
>> > > > > > > > *@ConradJam*
>> > > > > > > >
>> > > > > > > > when the number of "slots" is insufficient, can we can stop users
>> > > > > > > rescaling
>> > > > > > > > > or throw something to tell user "less avaliable slots to upgrade,
>> > > > > > > please
>> > > > > > > > > checkout your alivalbe slots" ?
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > The main property of AdaptiveScheduler is that it can adapt to
>> > > > > > "available
>> > > > > > > > resources," which means you're still able to make progress even
>> > > > > though
>> > > > > > > you
>> > > > > > > > didn't get all the slots you've asked for. Let's break down the pros
>> > > > > > and
>> > > > > > > > cons of this property.
>> > > > > > > >
>> > > > > > > > - (plus) If you lose a TM for some reason, you can still recover even
>> > > > > > if
>> > > > > > > it
>> > > > > > > > doesn't come back. We still need to give it some time to eliminate
>> > > > > > > > unnecessary rescaling, which can be controlled by setting
>> > > > > > > > "resource-stabilization-timeout."
>> > > > > > > > - (plus) The resources can arrive with a significant delay. For
>> > > > > > example,
>> > > > > > > > you're unable to spawn enough TMs on time because you've run out of
>> > > > > > > > resources in your k8s cluster, and you need to wait for the cluster
>> > > > > > auto
>> > > > > > > > scaler to kick in and add new nodes to the cluster. In this scenario,
>> > > > > > > > you'll be able to start making progress faster, at the cost of
>> > > > > multiple
>> > > > > > > > rescalings (once the remaining resources arrive).
>> > > > > > > > - (plus) This plays well with the declarative manner of today's
>> > > > > > > > infrastructure. For example, you tell k8s that you need 10 TMs, and
>> > > > > > > you'll
>> > > > > > > > eventually get them.
>> > > > > > > > - (minus) In the case of large state jobs, the cost of multiple
>> > > > > > > rescalings
>> > > > > > > > might outweigh the above.
>> > > > > > > >
>> > > > > > > > We've already touched on the solution to this problem on the FLIP.
>> > > > > > Please
>> > > > > > > > notice the parallelism knob being a range with a lower and upper
>> > > > > bound.
>> > > > > > > > Setting both the lower and upper bound to the same value could give
>> > > > > the
>> > > > > > > > behavior you're describing at the cost of giving up some properties
>> > > > > > that
>> > > > > > > AS
>> > > > > > > > gives you (you'd be falling back to the DefaultScheduler's behavior).
>> > > > > > > >
>> > > > > > > > when user upgrade job-vertx-parallelism . I want to have an interface
>> > > > > > to
>> > > > > > > > > query the current update parallel execution status, so that the
>> > > > > user
>> > > > > > or
>> > > > > > > > > program can understand the current status
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > This is a misunderstanding. We're not introducing the RESCALE
>> > > > > endpoint.
>> > > > > > > > This endpoint allows you to re-declare the resources needed to run
>> > > > > the
>> > > > > > > job.
>> > > > > > > > Once you reach the desired resources (you get more resources than the
>> > > > > > > lower
>> > > > > > > > bound defines), your job will run.
>> > > > > > > >
>> > > > > > > > We can expose a similar endpoint to "resource requirements" to give
>> > > > > you
>> > > > > > > an
>> > > > > > > > overview of the resources the vertices already have. You can already
>> > > > > > get
>> > > > > > > > this from the REST API, so exposing this in yet another way should be
>> > > > > > > > considered carefully.
>> > > > > > > >
>> > > > > > > > *@Matthias*
>> > > > > > > >
>> > > > > > > > I'm wondering whether it makes sense to add some kind of resource ID
>> > > > > to
>> > > > > > > the
>> > > > > > > > > REST API.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > That's a good question. I want to think about that and get back to
>> > > > > the
>> > > > > > > > question later. My main struggle when thinking about this is, "if
>> > > > > this
>> > > > > > > > would be an idempotent POST endpoint," would it be any different?
>> > > > > > > >
>> > > > > > > > How often do we allow resource requirements to be changed?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > There shall be no rate limiting on the FLINK side. If this is
>> > > > > something
>> > > > > > > > your environment needs, you can achieve it on a different layer ("we
>> > > > > > > can't
>> > > > > > > > have FLINK to do everything").
>> > > > > > > >
>> > > > > > > > Versioning the JobGraph in the JobGraphStore rather than overwriting
>> > > > > it
>> > > > > > > > > might be an idea.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > This sounds interesting since it would be closer to the JobGraph
>> > > > > being
>> > > > > > > > immutable. The main problem I see here is that this would introduce a
>> > > > > > > > BW-incompatible change so it might be a topic for follow-up FLIP.
>> > > > > > > >
>> > > > > > > > I'm just wondering whether we bundle two things together that are
>> > > > > > > actually
>> > > > > > > > > separate
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > Yup, this is how we think about it as well. The main question is,
>> > > > > "who
>> > > > > > > > should be responsible for bookkeeping 1) the JobGraph and 2) the
>> > > > > > > > JobResourceRequirements". The JobMaster would be the right place for
>> > > > > > > both,
>> > > > > > > > but it's currently not the case, and we're tightly coupling the
>> > > > > > > dispatcher
>> > > > > > > > with the JobMaster.
>> > > > > > > >
>> > > > > > > > Initially, we tried to introduce a separate HA component in JobMaster
>> > > > > > for
>> > > > > > > > bookkeeping the JobResourceRequirements, but that proved to be a more
>> > > > > > > > significant effort adding additional mess to the already messy HA
>> > > > > > > > ecosystem. Another approach we've discussed was mutating the JobGraph
>> > > > > > and
>> > > > > > > > setting JRR into the JobGraph structure itself.
>> > > > > > > >
>> > > > > > > > The middle ground for keeping this effort reasonably sized and not
>> > > > > > > > violating "we want to keep JG immutable" too much is keeping the
>> > > > > > > > JobResourceRequirements separate as an internal config option in
>> > > > > > > JobGraph's
>> > > > > > > > configuration.
>> > > > > > > >
>> > > > > > > > We ultimately need to rethink the tight coupling of Dispatcher and
>> > > > > > > > JobMaster, but it needs to be a separate effort.
>> > > > > > > >
>> > > > > > > > ...also considering the amount of data that can be stored in a
>> > > > > > > > > ConfigMap/ZooKeeper node if versioning the resource requirement
>> > > > > > change
>> > > > > > > as
>> > > > > > > > > proposed in my previous item is an option for us.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > AFAIK we're only storing pointers to the S3 objects in HA metadata,
>> > > > > so
>> > > > > > we
>> > > > > > > > should be okay with having larger structures for now.
>> > > > > > > >
>> > > > > > > > Updating the JobGraphStore means adding more requests to the HA
>> > > > > backend
>> > > > > > > > API.
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > It's fine unless you intend to override the resource requirements a
>> > > > > few
>> > > > > > > > times per second.
>> > > > > > > >
>> > > > > > > > *@Shammon*
>> > > > > > > >
>> > > > > > > > How about adding some more information such as vertex type
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > Since it was intended as a "debug" endpoint, it makes complete sense!
>> > > > > > > >
>> > > > > > > >  For sql jobs, we always use a unified parallelism for most vertices.
>> > > > > > Can
>> > > > > > > > > we provide them with a more convenient setting method instead of
>> > > > > each
>> > > > > > > > one?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > I completely feel with this. The main thoughts when designing the API
>> > > > > > > were:
>> > > > > > > > - We want to keep it clean and easy to understand.
>> > > > > > > > - Global parallelism can be modeled using per-vertex parallelism but
>> > > > > > not
>> > > > > > > > the other way around.
>> > > > > > > > - The API will be used by external tooling (operator, auto scaler).
>> > > > > > > >
>> > > > > > > > Can you elaborate more about how you'd intend to use the endpoint? I
>> > > > > > > think
>> > > > > > > > we can ultimately introduce a way of re-declaring "per-vertex
>> > > > > > defaults,"
>> > > > > > > > but I'd like to understand the use case bit more first.
>> > > > > > > >
>> > > > > > > > *@Weijie*
>> > > > > > > >
>> > > > > > > > What is the default value here (based on what configuration), or just
>> > > > > > > > > infinite?
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > > Currently, for the lower bound, it's always one, and for the upper
>> > > > > > bound,
>> > > > > > > > it's either parallelism (if defined) or the maxParallelism of the
>> > > > > > vertex
>> > > > > > > in
>> > > > > > > > JobGraph. This question might be another signal for making the
>> > > > > defaults
>> > > > > > > > explicit (see the answer to Shammon's question above).
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Thanks, everyone, for your initial thoughts!
>> > > > > > > >
>> > > > > > > > Best,
>> > > > > > > > D.
>> > > > > > > >
>> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <guoweijiereswqa@gmail.com
>> > > > > >
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Thanks David for driving this. This is a very valuable work,
>> > > > > > especially
>> > > > > > > > for
>> > > > > > > > > cloud native environment.
>> > > > > > > > >
>> > > > > > > > > >> How about adding some more information such as vertex type
>> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
>> > > > > > > > > resource-requirements`? For users, only vertex-id may be difficult
>> > > > > to
>> > > > > > > > > understand.
>> > > > > > > > >
>> > > > > > > > > +1 for this suggestion, including jobvertex's name in the response
>> > > > > > body
>> > > > > > > > is
>> > > > > > > > > more
>> > > > > > > > > user-friendly.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > I saw this sentence in FLIP: "Setting the upper bound to -1 will
>> > > > > > reset
>> > > > > > > > the
>> > > > > > > > > value to the default setting."  What is the default value here
>> > > > > (based
>> > > > > > > on
>> > > > > > > > > what configuration), or just infinite?
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Best regards,
>> > > > > > > > >
>> > > > > > > > > Weijie
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
>> > > > > > > > >
>> > > > > > > > > > Hi David
>> > > > > > > > > >
>> > > > > > > > > > Thanks for initiating this discussion. I think declaring job
>> > > > > > resource
>> > > > > > > > > > requirements by REST API is very valuable. I just left some
>> > > > > > comments
>> > > > > > > as
>> > > > > > > > > > followed
>> > > > > > > > > >
>> > > > > > > > > > 1) How about adding some more information such as vertex type
>> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
>> > > > > > > > > > resource-requirements`? For users, only vertex-id may be
>> > > > > difficult
>> > > > > > to
>> > > > > > > > > > understand.
>> > > > > > > > > >
>> > > > > > > > > > 2) For sql jobs, we always use a unified parallelism for most
>> > > > > > > vertices.
>> > > > > > > > > Can
>> > > > > > > > > > we provide them with a more convenient setting method instead of
>> > > > > > each
>> > > > > > > > > one?
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > Best,
>> > > > > > > > > > Shammon
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
>> > > > > > matthias.pohl@aiven.io
>> > > > > > > > > > .invalid>
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Thanks David for creating this FLIP. It sounds promising and
>> > > > > > useful
>> > > > > > > > to
>> > > > > > > > > > > have. Here are some thoughts from my side (some of them might
>> > > > > be
>> > > > > > > > > rather a
>> > > > > > > > > > > follow-up and not necessarily part of this FLIP):
>> > > > > > > > > > > - I'm wondering whether it makes sense to add some kind of
>> > > > > > resource
>> > > > > > > > ID
>> > > > > > > > > to
>> > > > > > > > > > > the REST API. This would give Flink a tool to verify the PATCH
>> > > > > > > > request
>> > > > > > > > > of
>> > > > > > > > > > > the external system in a compare-and-set kind of manner. AFAIU,
>> > > > > > the
>> > > > > > > > > > process
>> > > > > > > > > > > requires the external system to retrieve the resource
>> > > > > > requirements
>> > > > > > > > > first
>> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID <ABC> would be sent
>> > > > > > > along
>> > > > > > > > > as
>> > > > > > > > > > a
>> > > > > > > > > > > unique identifier for the provided setup. It's essentially the
>> > > > > > > > version
>> > > > > > > > > ID
>> > > > > > > > > > > of the currently deployed resource requirement configuration.
>> > > > > > Flink
>> > > > > > > > > > doesn't
>> > > > > > > > > > > know whether the external system would use the provided
>> > > > > > information
>> > > > > > > > in
>> > > > > > > > > > some
>> > > > > > > > > > > way to derive a new set of resource requirements for this job.
>> > > > > > The
>> > > > > > > > > > > subsequent PATCH request with updated resource requirements
>> > > > > would
>> > > > > > > > > include
>> > > > > > > > > > > the previously retrieved resource ID <ABC>. The PATCH call
>> > > > > would
>> > > > > > > fail
>> > > > > > > > > if
>> > > > > > > > > > > there was a concurrent PATCH call in between indicating to the
>> > > > > > > > external
>> > > > > > > > > > > system that the resource requirements were concurrently
>> > > > > updated.
>> > > > > > > > > > > - How often do we allow resource requirements to be changed?
>> > > > > That
>> > > > > > > > > > question
>> > > > > > > > > > > might make my previous comment on the resource ID obsolete
>> > > > > > because
>> > > > > > > we
>> > > > > > > > > > could
>> > > > > > > > > > > just make any PATCH call fail if there was a resource
>> > > > > requirement
>> > > > > > > > > update
>> > > > > > > > > > > within a certain time frame before the request. But such a time
>> > > > > > > > period
>> > > > > > > > > is
>> > > > > > > > > > > something we might want to make configurable then, I guess.
>> > > > > > > > > > > - Versioning the JobGraph in the JobGraphStore rather than
>> > > > > > > > overwriting
>> > > > > > > > > it
>> > > > > > > > > > > might be an idea. This would enable us to provide resource
>> > > > > > > > requirement
>> > > > > > > > > > > changes in the UI or through the REST API. It is related to a
>> > > > > > > problem
>> > > > > > > > > > > around keeping track of the exception history within the
>> > > > > > > > > > AdaptiveScheduler
>> > > > > > > > > > > and also having to consider multiple versions of a JobGraph.
>> > > > > But
>> > > > > > > for
>> > > > > > > > > that
>> > > > > > > > > > > one, we use the ExecutionGraphInfoStore right now.
>> > > > > > > > > > > - Updating the JobGraph in the JobGraphStore makes sense. I'm
>> > > > > > just
>> > > > > > > > > > > wondering whether we bundle two things together that are
>> > > > > actually
>> > > > > > > > > > separate:
>> > > > > > > > > > > The business logic and the execution configuration (the
>> > > > > resource
>> > > > > > > > > > > requirements). I'm aware that this is not a flaw of the current
>> > > > > > > FLIP
>> > > > > > > > > but
>> > > > > > > > > > > rather something that was not necessary to address in the past
>> > > > > > > > because
>> > > > > > > > > > the
>> > > > > > > > > > > JobGraph was kind of static. I don't remember whether that was
>> > > > > > > > already
>> > > > > > > > > > > discussed while working on the AdaptiveScheduler for FLIP-160
>> > > > > > [1].
>> > > > > > > > > Maybe,
>> > > > > > > > > > > I'm missing some functionality here that requires us to have
>> > > > > > > > everything
>> > > > > > > > > > in
>> > > > > > > > > > > one place. But it feels like updating the entire JobGraph which
>> > > > > > > could
>> > > > > > > > > be
>> > > > > > > > > > > actually a "config change" is not reasonable. ...also
>> > > > > considering
>> > > > > > > the
>> > > > > > > > > > > amount of data that can be stored in a ConfigMap/ZooKeeper node
>> > > > > > if
>> > > > > > > > > > > versioning the resource requirement change as proposed in my
>> > > > > > > previous
>> > > > > > > > > > item
>> > > > > > > > > > > is an option for us.
>> > > > > > > > > > > - Updating the JobGraphStore means adding more requests to the
>> > > > > HA
>> > > > > > > > > backend
>> > > > > > > > > > > API. There were some concerns shared in the discussion thread
>> > > > > [2]
>> > > > > > > for
>> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API server in the past with
>> > > > > > too
>> > > > > > > > many
>> > > > > > > > > > > calls. Eventhough, it's more likely to be caused by
>> > > > > > checkpointing,
>> > > > > > > I
>> > > > > > > > > > still
>> > > > > > > > > > > wanted to bring it up. We're working on a standardized
>> > > > > > performance
>> > > > > > > > test
>> > > > > > > > > > to
>> > > > > > > > > > > prepare going forward with FLIP-270 [3] right now.
>> > > > > > > > > > >
>> > > > > > > > > > > Best,
>> > > > > > > > > > > Matthias
>> > > > > > > > > > >
>> > > > > > > > > > > [1]
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
>> > > > > > > > > > > [2]
>> > > > > > > https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
>> > > > > > > > > > > [3]
>> > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
>> > > > > > > > > > >
>> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <jam.gzczy@gmail.com
>> > > > > >
>> > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Hi David:
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thank you for drive this flip, which helps less flink
>> > > > > shutdown
>> > > > > > > time
>> > > > > > > > > > > >
>> > > > > > > > > > > > for this flip, I would like to make a few idea on share
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >    - when the number of "slots" is insufficient, can we can
>> > > > > > stop
>> > > > > > > > > users
>> > > > > > > > > > > >    rescaling or throw something to tell user "less avaliable
>> > > > > > > slots
>> > > > > > > > to
>> > > > > > > > > > > > upgrade,
>> > > > > > > > > > > >    please checkout your alivalbe slots" ? Or we could have a
>> > > > > > > > request
>> > > > > > > > > > > >    switch(true/false) to allow this behavior
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >    - when user upgrade job-vertx-parallelism . I want to have
>> > > > > > an
>> > > > > > > > > > > interface
>> > > > > > > > > > > >    to query the current update parallel execution status, so
>> > > > > > that
>> > > > > > > > the
>> > > > > > > > > > > user
>> > > > > > > > > > > > or
>> > > > > > > > > > > >    program can understand the current status
>> > > > > > > > > > > >    - I want to have an interface to query the current update
>> > > > > > > > > > parallelism
>> > > > > > > > > > > >    execution status. This also helps similar to *[1] Flink
>> > > > > K8S
>> > > > > > > > > > Operator*
>> > > > > > > > > > > >    management
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > {
>> > > > > > > > > > > >   status: Failed
>> > > > > > > > > > > >   reason: "less avaliable slots to upgrade, please checkout
>> > > > > > your
>> > > > > > > > > > alivalbe
>> > > > > > > > > > > > slots"
>> > > > > > > > > > > > }
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >    - *Pending*: this job now is join the upgrade queue,it
>> > > > > will
>> > > > > > be
>> > > > > > > > > > update
>> > > > > > > > > > > >    later
>> > > > > > > > > > > >    - *Rescaling*: job now is rescaling,wait it finish
>> > > > > > > > > > > >    - *Finished*: finish do it
>> > > > > > > > > > > >    - *Failed* : something have wrong,so this job is not
>> > > > > > alivable
>> > > > > > > > > > upgrade
>> > > > > > > > > > > >
>> > > > > > > > > > > > I want to supplement my above content in flip, what do you
>> > > > > > think
>> > > > > > > ?
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >    1.
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > >
>> > > > > > >
>> > > > > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Hi everyone,
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > This FLIP [1] introduces a new REST API for declaring
>> > > > > > resource
>> > > > > > > > > > > > requirements
>> > > > > > > > > > > > > for the Adaptive Scheduler. There seems to be a clear need
>> > > > > > for
>> > > > > > > > this
>> > > > > > > > > > API
>> > > > > > > > > > > > > based on the discussion on the "Reworking the Rescale API"
>> > > > > > [2]
>> > > > > > > > > > thread.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Before we get started, this work is heavily based on the
>> > > > > > > > prototype
>> > > > > > > > > > [3]
>> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP is being published
>> > > > > > with
>> > > > > > > > his
>> > > > > > > > > > > > consent.
>> > > > > > > > > > > > > Big shoutout to him!
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Last and not least, thanks to Chesnay and Roman for the
>> > > > > > initial
>> > > > > > > > > > reviews
>> > > > > > > > > > > > and
>> > > > > > > > > > > > > discussions.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > The best start would be watching a short demo [4] that I've
>> > > > > > > > > recorded,
>> > > > > > > > > > > > which
>> > > > > > > > > > > > > illustrates newly added capabilities (rescaling the running
>> > > > > > > job,
>> > > > > > > > > > > handing
>> > > > > > > > > > > > > back resources to the RM, and session cluster support).
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > The intuition behind the FLIP is being able to define
>> > > > > > resource
>> > > > > > > > > > > > requirements
>> > > > > > > > > > > > > ("resource boundaries") externally that the
>> > > > > AdaptiveScheduler
>> > > > > > > can
>> > > > > > > > > > > > navigate
>> > > > > > > > > > > > > within. This is a building block for higher-level efforts
>> > > > > > such
>> > > > > > > as
>> > > > > > > > > an
>> > > > > > > > > > > > > external Autoscaler. The natural extension of this work
>> > > > > would
>> > > > > > > be
>> > > > > > > > to
>> > > > > > > > > > > allow
>> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Looking forward to your thoughts; any feedback is
>> > > > > > appreciated!
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > [1]
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
>> > > > > > > > > > > > > [2]
>> > > > > > > > > https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
>> > > > > > > > > > > > > [3] https://github.com/tillrohrmann/flink/tree/autoscaling
>> > > > > > > > > > > > > [4]
>> > > > > > > > > > > >
>> > > > > > > > >
>> > > > > > https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Best,
>> > > > > > > > > > > > > D.
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > --
>> > > > > > > > > > > > Best
>> > > > > > > > > > > >
>> > > > > > > > > > > > ConradJam
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by David Morávek <dm...@apache.org>.

Hi everyone,

Thanks for the feedback! I've updated the FLIP to use idempotent PUT API
instead of PATCH and to properly handle lower bound settings, to support
the "pre-allocation" of the resources.

@Max

> How hard would it be to address this issue in the FLIP?

I've included this in the FLIP. It might not be too hard to implement this
in the end.

> B) drop as many superfluous task managers as needed

I've intentionally left this part out for now because this ultimately needs
to be the responsibility of the Resource Manager. After all, in the Session
Cluster scenario, the Scheduler doesn't have the bigger picture of other
tasks of other jobs running on those TMs. This will most likely be a topic
for another FLIP.

WDYT? If there are no other questions or concerns, I'd like to start the
vote on Wednesday.

Best,
D.

On Wed, Feb 15, 2023 at 3:34 PM Maximilian Michels <mx...@apache.org> wrote:

> I missed that the FLIP states:
>
> > Currently, even though we’d expose the lower bound for clarity and API
> completeness, we won’t allow setting it to any other value than one until
> we have full support throughout the stack.
>
> How hard would it be to address this issue in the FLIP?
>
> There is not much value to offer setting a lower bound which won't be
> respected / throw an error when it is set. If we had support for a
> lower bound, we could enforce a resource contract externally via
> setting lowerBound == upperBound. That ties back to the Rescale API
> discussion we had. I want to better understand what the major concerns
> would be around allowing this.
>
> Just to outline how I imagine the logic to work:
>
> A) The resource constraints are already met => Nothing changes
> B) More resources available than required => Cancel the job, drop as
> many superfluous task managers as needed, restart the job
> C) Less resources available than required => Acquire new task
> managers, wait for them to register, cancel and restart the job
>
> I'm open to helping out with the implementation.
>
> -Max
>
> On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <mx...@apache.org> wrote:
> >
> > Based on further discussion I had with Chesnay on this PR [1], I think
> > jobs would currently go into a restarting state after the resource
> > requirements have changed. This wouldn't achieve what we had in mind,
> > i.e. sticking to the old resource requirements until enough slots are
> > available to fulfil the new resource requirements. So this may not be
> > 100% what we need but it could be extended to do what we want.
> >
> > -Max
> >
> > [1] https://github.com/apache/flink/pull/21908#discussion_r1104792362
> >
> > On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <mx...@apache.org>
> wrote:
> > >
> > > Hi David,
> > >
> > > This is awesome! Great writeup and demo. This is pretty much what we
> > > need for the autoscaler as part of the Flink Kubernetes operator [1].
> > > Scaling Flink jobs effectively is hard but fortunately we have solved
> > > the issue as part of the Flink Kubernetes operator. The only critical
> > > piece we are missing is a better way to execute scaling decisions, as
> > > discussed in [2].
> > >
> > > Looking at your proposal, we would set lowerBound == upperBound for
> > > the parallelism because we want to fully determine the parallelism
> > > externally based on the scaling metrics. Does that sound right?
> > >
> > > What is the timeline for these changes? Is there a JIRA?
> > >
> > > Cheers,
> > > Max
> > >
> > > [1]
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
> > > [2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > >
> > > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <xi...@gmail.com>
> wrote:
> > > >
> > > > Hi David,
> > > >
> > > > Thanks for your reply.  I think your response totally make sense.
> This
> > > > flip targets on declaring required resource to ResourceManager
> instead of
> > > > using  ResourceManager to add/remove TMs directly.
> > > >
> > > > Best,
> > > > Xiangyu
> > > >
> > > >
> > > >
> > > > David Morávek <da...@gmail.com> 于2023年2月13日周一 15:46写道：
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > @Shammon
> > > > >
> > > > > I'm not entirely sure what "config file" you're referring to. You
> can, of
> > > > > course, override the default parallelism in "flink-conf.yaml", but
> for
> > > > > sinks and sources, the parallelism needs to be tweaked on the
> connector
> > > > > level ("WITH" statement).
> > > > >
> > > > > This is something that should be achieved with tooling around
> Flink. We
> > > > > want to provide an API on the lowest level that generalizes well.
> Achieving
> > > > > what you're describing should be straightforward with this API.
> > > > >
> > > > > @Xiangyu
> > > > >
> > > > > Is it possible for this REST API to declare TM resources in the
> future?
> > > > >
> > > > >
> > > > > Would you like to add/remove TMs if you use an active Resource
> Manager?
> > > > > This would be out of the scope of this effort since it targets the
> > > > > scheduler component only (we make no assumptions about the used
> Resource
> > > > > Manager). Also, the AdaptiveScheduler is only intended to be used
> for
> > > > > Streaming.
> > > > >
> > > > >  And for streaming jobs, I'm wondering if there is any situation
> we need to
> > > > > > rescale the TM resources of a flink cluster at first and then the
> > > > > adaptive
> > > > > > scheduler will rescale the per-vertex ResourceProfiles
> accordingly.
> > > > > >
> > > > >
> > > > > We plan on adding support for the ResourceProfiles (dynamic slot
> > > > > allocation) as the next step. Again we won't make any assumptions
> about the
> > > > > used Resource Manager. In other words, this effort ends by
> declaring
> > > > > desired resources to the Resource Manager.
> > > > >
> > > > > Does that make sense?
> > > > >
> > > > > @Matthias
> > > > >
> > > > > We've done another pass on the proposed API and currently lean
> towards
> > > > > having an idempotent PUT API.
> > > > > - We don't care too much about multiple writers' scenarios in
> terms of who
> > > > > can write an authoritative payload; this is up to the user of the
> API to
> > > > > figure out
> > > > > - It's indeed tricky to achieve atomicity with PATCH API;
> switching to PUT
> > > > > API seems to do the trick
> > > > > - We won't allow partial "payloads" anymore, meaning you need to
> define
> > > > > requirements for all vertices in the JobGraph; This is completely
> fine for
> > > > > the programmatic workflows. For DEBUG / DEMO purposes, you can use
> the GET
> > > > > endpoint and tweak the response to avoid writing the whole payload
> by hand.
> > > > >
> > > > > WDYT?
> > > > >
> > > > >
> > > > > Best,
> > > > > D.
> > > > >
> > > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <
> xiangyu0xf@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi David,
> > > > > >
> > > > > > Thanks for creating this flip. I think this work it is very
> useful,
> > > > > > especially in autoscaling scenario.  I would like to share some
> questions
> > > > > > from my view.
> > > > > >
> > > > > > 1, Is it possible for this REST API to declare TM resources in
> the
> > > > > future?
> > > > > > I'm asking because we are building the autoscaling feature for
> Flink OLAP
> > > > > > Session Cluster in ByteDance. We need to rescale the cluster's
> resource
> > > > > on
> > > > > > TM level instead of Job level. It would be very helpful if we
> have a REST
> > > > > > API for out external Autoscaling service to use.
> > > > > >
> > > > > > 2, And for streaming jobs, I'm wondering if there is any
> situation we
> > > > > need
> > > > > > to rescale the TM resources of a flink cluster at first and then
> the
> > > > > > adaptive scheduler will rescale the per-vertex ResourceProfiles
> > > > > > accordingly.
> > > > > >
> > > > > > best.
> > > > > > Xiangyu
> > > > > >
> > > > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
> > > > > >
> > > > > > > Hi David
> > > > > > >
> > > > > > > Thanks for your answer.
> > > > > > >
> > > > > > > > Can you elaborate more about how you'd intend to use the
> endpoint? I
> > > > > > > think we can ultimately introduce a way of re-declaring
> "per-vertex
> > > > > > > defaults," but I'd like to understand the use case bit more
> first.
> > > > > > >
> > > > > > > For this issue, I mainly consider the consistency of user
> configuration
> > > > > > and
> > > > > > > job runtime. For sql jobs, users usually set specific
> parallelism for
> > > > > > > source and sink, and set a global parallelism for other
> operators.
> > > > > These
> > > > > > > config items are stored in a config file. For some
> high-priority jobs,
> > > > > > > users may want to manage them manually.
> > > > > > > 1. When users need to scale the parallelism, they should
> update the
> > > > > > config
> > > > > > > file and restart flink job, which may take a long time.
> > > > > > > 2. After providing the REST API, users can just send a request
> to the
> > > > > job
> > > > > > > via REST API quickly after updating the config file.
> > > > > > > The configuration in the running job and config file should be
> the
> > > > > same.
> > > > > > > What do you think of this?
> > > > > > >
> > > > > > > best.
> > > > > > > Shammon
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <
> david.moravek@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > Let's try to answer the questions one by one.
> > > > > > > >
> > > > > > > > *@ConradJam*
> > > > > > > >
> > > > > > > > when the number of "slots" is insufficient, can we can stop
> users
> > > > > > > rescaling
> > > > > > > > > or throw something to tell user "less avaliable slots to
> upgrade,
> > > > > > > please
> > > > > > > > > checkout your alivalbe slots" ?
> > > > > > > > >
> > > > > > > >
> > > > > > > > The main property of AdaptiveScheduler is that it can adapt
> to
> > > > > > "available
> > > > > > > > resources," which means you're still able to make progress
> even
> > > > > though
> > > > > > > you
> > > > > > > > didn't get all the slots you've asked for. Let's break down
> the pros
> > > > > > and
> > > > > > > > cons of this property.
> > > > > > > >
> > > > > > > > - (plus) If you lose a TM for some reason, you can still
> recover even
> > > > > > if
> > > > > > > it
> > > > > > > > doesn't come back. We still need to give it some time to
> eliminate
> > > > > > > > unnecessary rescaling, which can be controlled by setting
> > > > > > > > "resource-stabilization-timeout."
> > > > > > > > - (plus) The resources can arrive with a significant delay.
> For
> > > > > > example,
> > > > > > > > you're unable to spawn enough TMs on time because you've run
> out of
> > > > > > > > resources in your k8s cluster, and you need to wait for the
> cluster
> > > > > > auto
> > > > > > > > scaler to kick in and add new nodes to the cluster. In this
> scenario,
> > > > > > > > you'll be able to start making progress faster, at the cost
> of
> > > > > multiple
> > > > > > > > rescalings (once the remaining resources arrive).
> > > > > > > > - (plus) This plays well with the declarative manner of
> today's
> > > > > > > > infrastructure. For example, you tell k8s that you need 10
> TMs, and
> > > > > > > you'll
> > > > > > > > eventually get them.
> > > > > > > > - (minus) In the case of large state jobs, the cost of
> multiple
> > > > > > > rescalings
> > > > > > > > might outweigh the above.
> > > > > > > >
> > > > > > > > We've already touched on the solution to this problem on the
> FLIP.
> > > > > > Please
> > > > > > > > notice the parallelism knob being a range with a lower and
> upper
> > > > > bound.
> > > > > > > > Setting both the lower and upper bound to the same value
> could give
> > > > > the
> > > > > > > > behavior you're describing at the cost of giving up some
> properties
> > > > > > that
> > > > > > > AS
> > > > > > > > gives you (you'd be falling back to the DefaultScheduler's
> behavior).
> > > > > > > >
> > > > > > > > when user upgrade job-vertx-parallelism . I want to have an
> interface
> > > > > > to
> > > > > > > > > query the current update parallel execution status, so
> that the
> > > > > user
> > > > > > or
> > > > > > > > > program can understand the current status
> > > > > > > > >
> > > > > > > >
> > > > > > > > This is a misunderstanding. We're not introducing the RESCALE
> > > > > endpoint.
> > > > > > > > This endpoint allows you to re-declare the resources needed
> to run
> > > > > the
> > > > > > > job.
> > > > > > > > Once you reach the desired resources (you get more resources
> than the
> > > > > > > lower
> > > > > > > > bound defines), your job will run.
> > > > > > > >
> > > > > > > > We can expose a similar endpoint to "resource requirements"
> to give
> > > > > you
> > > > > > > an
> > > > > > > > overview of the resources the vertices already have. You can
> already
> > > > > > get
> > > > > > > > this from the REST API, so exposing this in yet another way
> should be
> > > > > > > > considered carefully.
> > > > > > > >
> > > > > > > > *@Matthias*
> > > > > > > >
> > > > > > > > I'm wondering whether it makes sense to add some kind of
> resource ID
> > > > > to
> > > > > > > the
> > > > > > > > > REST API.
> > > > > > > >
> > > > > > > >
> > > > > > > > That's a good question. I want to think about that and get
> back to
> > > > > the
> > > > > > > > question later. My main struggle when thinking about this
> is, "if
> > > > > this
> > > > > > > > would be an idempotent POST endpoint," would it be any
> different?
> > > > > > > >
> > > > > > > > How often do we allow resource requirements to be changed?
> > > > > > > >
> > > > > > > >
> > > > > > > > There shall be no rate limiting on the FLINK side. If this is
> > > > > something
> > > > > > > > your environment needs, you can achieve it on a different
> layer ("we
> > > > > > > can't
> > > > > > > > have FLINK to do everything").
> > > > > > > >
> > > > > > > > Versioning the JobGraph in the JobGraphStore rather than
> overwriting
> > > > > it
> > > > > > > > > might be an idea.
> > > > > > > > >
> > > > > > > >
> > > > > > > > This sounds interesting since it would be closer to the
> JobGraph
> > > > > being
> > > > > > > > immutable. The main problem I see here is that this would
> introduce a
> > > > > > > > BW-incompatible change so it might be a topic for follow-up
> FLIP.
> > > > > > > >
> > > > > > > > I'm just wondering whether we bundle two things together
> that are
> > > > > > > actually
> > > > > > > > > separate
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yup, this is how we think about it as well. The main
> question is,
> > > > > "who
> > > > > > > > should be responsible for bookkeeping 1) the JobGraph and 2)
> the
> > > > > > > > JobResourceRequirements". The JobMaster would be the right
> place for
> > > > > > > both,
> > > > > > > > but it's currently not the case, and we're tightly coupling
> the
> > > > > > > dispatcher
> > > > > > > > with the JobMaster.
> > > > > > > >
> > > > > > > > Initially, we tried to introduce a separate HA component in
> JobMaster
> > > > > > for
> > > > > > > > bookkeeping the JobResourceRequirements, but that proved to
> be a more
> > > > > > > > significant effort adding additional mess to the already
> messy HA
> > > > > > > > ecosystem. Another approach we've discussed was mutating the
> JobGraph
> > > > > > and
> > > > > > > > setting JRR into the JobGraph structure itself.
> > > > > > > >
> > > > > > > > The middle ground for keeping this effort reasonably sized
> and not
> > > > > > > > violating "we want to keep JG immutable" too much is keeping
> the
> > > > > > > > JobResourceRequirements separate as an internal config
> option in
> > > > > > > JobGraph's
> > > > > > > > configuration.
> > > > > > > >
> > > > > > > > We ultimately need to rethink the tight coupling of
> Dispatcher and
> > > > > > > > JobMaster, but it needs to be a separate effort.
> > > > > > > >
> > > > > > > > ...also considering the amount of data that can be stored in
> a
> > > > > > > > > ConfigMap/ZooKeeper node if versioning the resource
> requirement
> > > > > > change
> > > > > > > as
> > > > > > > > > proposed in my previous item is an option for us.
> > > > > > > > >
> > > > > > > >
> > > > > > > > AFAIK we're only storing pointers to the S3 objects in HA
> metadata,
> > > > > so
> > > > > > we
> > > > > > > > should be okay with having larger structures for now.
> > > > > > > >
> > > > > > > > Updating the JobGraphStore means adding more requests to the
> HA
> > > > > backend
> > > > > > > > API.
> > > > > > > > >
> > > > > > > >
> > > > > > > > It's fine unless you intend to override the resource
> requirements a
> > > > > few
> > > > > > > > times per second.
> > > > > > > >
> > > > > > > > *@Shammon*
> > > > > > > >
> > > > > > > > How about adding some more information such as vertex type
> > > > > > > > >
> > > > > > > >
> > > > > > > > Since it was intended as a "debug" endpoint, it makes
> complete sense!
> > > > > > > >
> > > > > > > >  For sql jobs, we always use a unified parallelism for most
> vertices.
> > > > > > Can
> > > > > > > > > we provide them with a more convenient setting method
> instead of
> > > > > each
> > > > > > > > one?
> > > > > > > >
> > > > > > > >
> > > > > > > > I completely feel with this. The main thoughts when
> designing the API
> > > > > > > were:
> > > > > > > > - We want to keep it clean and easy to understand.
> > > > > > > > - Global parallelism can be modeled using per-vertex
> parallelism but
> > > > > > not
> > > > > > > > the other way around.
> > > > > > > > - The API will be used by external tooling (operator, auto
> scaler).
> > > > > > > >
> > > > > > > > Can you elaborate more about how you'd intend to use the
> endpoint? I
> > > > > > > think
> > > > > > > > we can ultimately introduce a way of re-declaring "per-vertex
> > > > > > defaults,"
> > > > > > > > but I'd like to understand the use case bit more first.
> > > > > > > >
> > > > > > > > *@Weijie*
> > > > > > > >
> > > > > > > > What is the default value here (based on what
> configuration), or just
> > > > > > > > > infinite?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Currently, for the lower bound, it's always one, and for the
> upper
> > > > > > bound,
> > > > > > > > it's either parallelism (if defined) or the maxParallelism
> of the
> > > > > > vertex
> > > > > > > in
> > > > > > > > JobGraph. This question might be another signal for making
> the
> > > > > defaults
> > > > > > > > explicit (see the answer to Shammon's question above).
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks, everyone, for your initial thoughts!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > D.
> > > > > > > >
> > > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <
> guoweijiereswqa@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks David for driving this. This is a very valuable
> work,
> > > > > > especially
> > > > > > > > for
> > > > > > > > > cloud native environment.
> > > > > > > > >
> > > > > > > > > >> How about adding some more information such as vertex
> type
> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > > > > > resource-requirements`? For users, only vertex-id may be
> difficult
> > > > > to
> > > > > > > > > understand.
> > > > > > > > >
> > > > > > > > > +1 for this suggestion, including jobvertex's name in the
> response
> > > > > > body
> > > > > > > > is
> > > > > > > > > more
> > > > > > > > > user-friendly.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I saw this sentence in FLIP: "Setting the upper bound to
> -1 will
> > > > > > reset
> > > > > > > > the
> > > > > > > > > value to the default setting."  What is the default value
> here
> > > > > (based
> > > > > > > on
> > > > > > > > > what configuration), or just infinite?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > >
> > > > > > > > > Weijie
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
> > > > > > > > >
> > > > > > > > > > Hi David
> > > > > > > > > >
> > > > > > > > > > Thanks for initiating this discussion. I think declaring
> job
> > > > > > resource
> > > > > > > > > > requirements by REST API is very valuable. I just left
> some
> > > > > > comments
> > > > > > > as
> > > > > > > > > > followed
> > > > > > > > > >
> > > > > > > > > > 1) How about adding some more information such as vertex
> type
> > > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > > > > > > resource-requirements`? For users, only vertex-id may be
> > > > > difficult
> > > > > > to
> > > > > > > > > > understand.
> > > > > > > > > >
> > > > > > > > > > 2) For sql jobs, we always use a unified parallelism for
> most
> > > > > > > vertices.
> > > > > > > > > Can
> > > > > > > > > > we provide them with a more convenient setting method
> instead of
> > > > > > each
> > > > > > > > > one?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Shammon
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
> > > > > > matthias.pohl@aiven.io
> > > > > > > > > > .invalid>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks David for creating this FLIP. It sounds
> promising and
> > > > > > useful
> > > > > > > > to
> > > > > > > > > > > have. Here are some thoughts from my side (some of
> them might
> > > > > be
> > > > > > > > > rather a
> > > > > > > > > > > follow-up and not necessarily part of this FLIP):
> > > > > > > > > > > - I'm wondering whether it makes sense to add some
> kind of
> > > > > > resource
> > > > > > > > ID
> > > > > > > > > to
> > > > > > > > > > > the REST API. This would give Flink a tool to verify
> the PATCH
> > > > > > > > request
> > > > > > > > > of
> > > > > > > > > > > the external system in a compare-and-set kind of
> manner. AFAIU,
> > > > > > the
> > > > > > > > > > process
> > > > > > > > > > > requires the external system to retrieve the resource
> > > > > > requirements
> > > > > > > > > first
> > > > > > > > > > > (to retrieve the vertex IDs). A resource ID <ABC>
> would be sent
> > > > > > > along
> > > > > > > > > as
> > > > > > > > > > a
> > > > > > > > > > > unique identifier for the provided setup. It's
> essentially the
> > > > > > > > version
> > > > > > > > > ID
> > > > > > > > > > > of the currently deployed resource requirement
> configuration.
> > > > > > Flink
> > > > > > > > > > doesn't
> > > > > > > > > > > know whether the external system would use the provided
> > > > > > information
> > > > > > > > in
> > > > > > > > > > some
> > > > > > > > > > > way to derive a new set of resource requirements for
> this job.
> > > > > > The
> > > > > > > > > > > subsequent PATCH request with updated resource
> requirements
> > > > > would
> > > > > > > > > include
> > > > > > > > > > > the previously retrieved resource ID <ABC>. The PATCH
> call
> > > > > would
> > > > > > > fail
> > > > > > > > > if
> > > > > > > > > > > there was a concurrent PATCH call in between
> indicating to the
> > > > > > > > external
> > > > > > > > > > > system that the resource requirements were concurrently
> > > > > updated.
> > > > > > > > > > > - How often do we allow resource requirements to be
> changed?
> > > > > That
> > > > > > > > > > question
> > > > > > > > > > > might make my previous comment on the resource ID
> obsolete
> > > > > > because
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > > just make any PATCH call fail if there was a resource
> > > > > requirement
> > > > > > > > > update
> > > > > > > > > > > within a certain time frame before the request. But
> such a time
> > > > > > > > period
> > > > > > > > > is
> > > > > > > > > > > something we might want to make configurable then, I
> guess.
> > > > > > > > > > > - Versioning the JobGraph in the JobGraphStore rather
> than
> > > > > > > > overwriting
> > > > > > > > > it
> > > > > > > > > > > might be an idea. This would enable us to provide
> resource
> > > > > > > > requirement
> > > > > > > > > > > changes in the UI or through the REST API. It is
> related to a
> > > > > > > problem
> > > > > > > > > > > around keeping track of the exception history within
> the
> > > > > > > > > > AdaptiveScheduler
> > > > > > > > > > > and also having to consider multiple versions of a
> JobGraph.
> > > > > But
> > > > > > > for
> > > > > > > > > that
> > > > > > > > > > > one, we use the ExecutionGraphInfoStore right now.
> > > > > > > > > > > - Updating the JobGraph in the JobGraphStore makes
> sense. I'm
> > > > > > just
> > > > > > > > > > > wondering whether we bundle two things together that
> are
> > > > > actually
> > > > > > > > > > separate:
> > > > > > > > > > > The business logic and the execution configuration (the
> > > > > resource
> > > > > > > > > > > requirements). I'm aware that this is not a flaw of
> the current
> > > > > > > FLIP
> > > > > > > > > but
> > > > > > > > > > > rather something that was not necessary to address in
> the past
> > > > > > > > because
> > > > > > > > > > the
> > > > > > > > > > > JobGraph was kind of static. I don't remember whether
> that was
> > > > > > > > already
> > > > > > > > > > > discussed while working on the AdaptiveScheduler for
> FLIP-160
> > > > > > [1].
> > > > > > > > > Maybe,
> > > > > > > > > > > I'm missing some functionality here that requires us
> to have
> > > > > > > > everything
> > > > > > > > > > in
> > > > > > > > > > > one place. But it feels like updating the entire
> JobGraph which
> > > > > > > could
> > > > > > > > > be
> > > > > > > > > > > actually a "config change" is not reasonable. ...also
> > > > > considering
> > > > > > > the
> > > > > > > > > > > amount of data that can be stored in a
> ConfigMap/ZooKeeper node
> > > > > > if
> > > > > > > > > > > versioning the resource requirement change as proposed
> in my
> > > > > > > previous
> > > > > > > > > > item
> > > > > > > > > > > is an option for us.
> > > > > > > > > > > - Updating the JobGraphStore means adding more
> requests to the
> > > > > HA
> > > > > > > > > backend
> > > > > > > > > > > API. There were some concerns shared in the discussion
> thread
> > > > > [2]
> > > > > > > for
> > > > > > > > > > > FLIP-270 [3] on pressuring the k8s API server in the
> past with
> > > > > > too
> > > > > > > > many
> > > > > > > > > > > calls. Eventhough, it's more likely to be caused by
> > > > > > checkpointing,
> > > > > > > I
> > > > > > > > > > still
> > > > > > > > > > > wanted to bring it up. We're working on a standardized
> > > > > > performance
> > > > > > > > test
> > > > > > > > > > to
> > > > > > > > > > > prepare going forward with FLIP-270 [3] right now.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Matthias
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > > > > > > > > > > [2]
> > > > > > >
> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > > > > > > > > > > [3]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <
> jam.gzczy@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi David:
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you for drive this flip, which helps less flink
> > > > > shutdown
> > > > > > > time
> > > > > > > > > > > >
> > > > > > > > > > > > for this flip, I would like to make a few idea on
> share
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - when the number of "slots" is insufficient, can
> we can
> > > > > > stop
> > > > > > > > > users
> > > > > > > > > > > >    rescaling or throw something to tell user "less
> avaliable
> > > > > > > slots
> > > > > > > > to
> > > > > > > > > > > > upgrade,
> > > > > > > > > > > >    please checkout your alivalbe slots" ? Or we
> could have a
> > > > > > > > request
> > > > > > > > > > > >    switch(true/false) to allow this behavior
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - when user upgrade job-vertx-parallelism . I
> want to have
> > > > > > an
> > > > > > > > > > > interface
> > > > > > > > > > > >    to query the current update parallel execution
> status, so
> > > > > > that
> > > > > > > > the
> > > > > > > > > > > user
> > > > > > > > > > > > or
> > > > > > > > > > > >    program can understand the current status
> > > > > > > > > > > >    - I want to have an interface to query the
> current update
> > > > > > > > > > parallelism
> > > > > > > > > > > >    execution status. This also helps similar to *[1]
> Flink
> > > > > K8S
> > > > > > > > > > Operator*
> > > > > > > > > > > >    management
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > {
> > > > > > > > > > > >   status: Failed
> > > > > > > > > > > >   reason: "less avaliable slots to upgrade, please
> checkout
> > > > > > your
> > > > > > > > > > alivalbe
> > > > > > > > > > > > slots"
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    - *Pending*: this job now is join the upgrade
> queue,it
> > > > > will
> > > > > > be
> > > > > > > > > > update
> > > > > > > > > > > >    later
> > > > > > > > > > > >    - *Rescaling*: job now is rescaling,wait it finish
> > > > > > > > > > > >    - *Finished*: finish do it
> > > > > > > > > > > >    - *Failed* : something have wrong,so this job is
> not
> > > > > > alivable
> > > > > > > > > > upgrade
> > > > > > > > > > > >
> > > > > > > > > > > > I want to supplement my above content in flip, what
> do you
> > > > > > think
> > > > > > > ?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >    1.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五
> 16:42写道：
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > >
> > > > > > > > > > > > > This FLIP [1] introduces a new REST API for
> declaring
> > > > > > resource
> > > > > > > > > > > > requirements
> > > > > > > > > > > > > for the Adaptive Scheduler. There seems to be a
> clear need
> > > > > > for
> > > > > > > > this
> > > > > > > > > > API
> > > > > > > > > > > > > based on the discussion on the "Reworking the
> Rescale API"
> > > > > > [2]
> > > > > > > > > > thread.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Before we get started, this work is heavily based
> on the
> > > > > > > > prototype
> > > > > > > > > > [3]
> > > > > > > > > > > > > created by Till Rohrmann, and the FLIP is being
> published
> > > > > > with
> > > > > > > > his
> > > > > > > > > > > > consent.
> > > > > > > > > > > > > Big shoutout to him!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Last and not least, thanks to Chesnay and Roman
> for the
> > > > > > initial
> > > > > > > > > > reviews
> > > > > > > > > > > > and
> > > > > > > > > > > > > discussions.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The best start would be watching a short demo [4]
> that I've
> > > > > > > > > recorded,
> > > > > > > > > > > > which
> > > > > > > > > > > > > illustrates newly added capabilities (rescaling
> the running
> > > > > > > job,
> > > > > > > > > > > handing
> > > > > > > > > > > > > back resources to the RM, and session cluster
> support).
> > > > > > > > > > > > >
> > > > > > > > > > > > > The intuition behind the FLIP is being able to
> define
> > > > > > resource
> > > > > > > > > > > > requirements
> > > > > > > > > > > > > ("resource boundaries") externally that the
> > > > > AdaptiveScheduler
> > > > > > > can
> > > > > > > > > > > > navigate
> > > > > > > > > > > > > within. This is a building block for higher-level
> efforts
> > > > > > such
> > > > > > > as
> > > > > > > > > an
> > > > > > > > > > > > > external Autoscaler. The natural extension of this
> work
> > > > > would
> > > > > > > be
> > > > > > > > to
> > > > > > > > > > > allow
> > > > > > > > > > > > > to specify per-vertex ResourceProfiles.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Looking forward to your thoughts; any feedback is
> > > > > > appreciated!
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > > > > > > > > > [2]
> > > > > > > > >
> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > > > > > > > > > > > > [3]
> https://github.com/tillrohrmann/flink/tree/autoscaling
> > > > > > > > > > > > > [4]
> > > > > > > > > > > >
> > > > > > > > >
> > > > > >
> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > D.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best
> > > > > > > > > > > >
> > > > > > > > > > > > ConradJam
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by Maximilian Michels <mx...@apache.org>.

I missed that the FLIP states:

> Currently, even though we’d expose the lower bound for clarity and API completeness, we won’t allow setting it to any other value than one until we have full support throughout the stack.

How hard would it be to address this issue in the FLIP?

There is not much value to offer setting a lower bound which won't be
respected / throw an error when it is set. If we had support for a
lower bound, we could enforce a resource contract externally via
setting lowerBound == upperBound. That ties back to the Rescale API
discussion we had. I want to better understand what the major concerns
would be around allowing this.

Just to outline how I imagine the logic to work:

A) The resource constraints are already met => Nothing changes
B) More resources available than required => Cancel the job, drop as
many superfluous task managers as needed, restart the job
C) Less resources available than required => Acquire new task
managers, wait for them to register, cancel and restart the job

I'm open to helping out with the implementation.

-Max

On Mon, Feb 13, 2023 at 7:45 PM Maximilian Michels <mx...@apache.org> wrote:
>
> Based on further discussion I had with Chesnay on this PR [1], I think
> jobs would currently go into a restarting state after the resource
> requirements have changed. This wouldn't achieve what we had in mind,
> i.e. sticking to the old resource requirements until enough slots are
> available to fulfil the new resource requirements. So this may not be
> 100% what we need but it could be extended to do what we want.
>
> -Max
>
> [1] https://github.com/apache/flink/pull/21908#discussion_r1104792362
>
> On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <mx...@apache.org> wrote:
> >
> > Hi David,
> >
> > This is awesome! Great writeup and demo. This is pretty much what we
> > need for the autoscaler as part of the Flink Kubernetes operator [1].
> > Scaling Flink jobs effectively is hard but fortunately we have solved
> > the issue as part of the Flink Kubernetes operator. The only critical
> > piece we are missing is a better way to execute scaling decisions, as
> > discussed in [2].
> >
> > Looking at your proposal, we would set lowerBound == upperBound for
> > the parallelism because we want to fully determine the parallelism
> > externally based on the scaling metrics. Does that sound right?
> >
> > What is the timeline for these changes? Is there a JIRA?
> >
> > Cheers,
> > Max
> >
> > [1] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
> > [2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> >
> > On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <xi...@gmail.com> wrote:
> > >
> > > Hi David,
> > >
> > > Thanks for your reply.  I think your response totally make sense.  This
> > > flip targets on declaring required resource to ResourceManager instead of
> > > using  ResourceManager to add/remove TMs directly.
> > >
> > > Best,
> > > Xiangyu
> > >
> > >
> > >
> > > David Morávek <da...@gmail.com> 于2023年2月13日周一 15:46写道：
> > >
> > > > Hi everyone,
> > > >
> > > > @Shammon
> > > >
> > > > I'm not entirely sure what "config file" you're referring to. You can, of
> > > > course, override the default parallelism in "flink-conf.yaml", but for
> > > > sinks and sources, the parallelism needs to be tweaked on the connector
> > > > level ("WITH" statement).
> > > >
> > > > This is something that should be achieved with tooling around Flink. We
> > > > want to provide an API on the lowest level that generalizes well. Achieving
> > > > what you're describing should be straightforward with this API.
> > > >
> > > > @Xiangyu
> > > >
> > > > Is it possible for this REST API to declare TM resources in the future?
> > > >
> > > >
> > > > Would you like to add/remove TMs if you use an active Resource Manager?
> > > > This would be out of the scope of this effort since it targets the
> > > > scheduler component only (we make no assumptions about the used Resource
> > > > Manager). Also, the AdaptiveScheduler is only intended to be used for
> > > > Streaming.
> > > >
> > > >  And for streaming jobs, I'm wondering if there is any situation we need to
> > > > > rescale the TM resources of a flink cluster at first and then the
> > > > adaptive
> > > > > scheduler will rescale the per-vertex ResourceProfiles accordingly.
> > > > >
> > > >
> > > > We plan on adding support for the ResourceProfiles (dynamic slot
> > > > allocation) as the next step. Again we won't make any assumptions about the
> > > > used Resource Manager. In other words, this effort ends by declaring
> > > > desired resources to the Resource Manager.
> > > >
> > > > Does that make sense?
> > > >
> > > > @Matthias
> > > >
> > > > We've done another pass on the proposed API and currently lean towards
> > > > having an idempotent PUT API.
> > > > - We don't care too much about multiple writers' scenarios in terms of who
> > > > can write an authoritative payload; this is up to the user of the API to
> > > > figure out
> > > > - It's indeed tricky to achieve atomicity with PATCH API; switching to PUT
> > > > API seems to do the trick
> > > > - We won't allow partial "payloads" anymore, meaning you need to define
> > > > requirements for all vertices in the JobGraph; This is completely fine for
> > > > the programmatic workflows. For DEBUG / DEMO purposes, you can use the GET
> > > > endpoint and tweak the response to avoid writing the whole payload by hand.
> > > >
> > > > WDYT?
> > > >
> > > >
> > > > Best,
> > > > D.
> > > >
> > > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <xi...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi David,
> > > > >
> > > > > Thanks for creating this flip. I think this work it is very useful,
> > > > > especially in autoscaling scenario.  I would like to share some questions
> > > > > from my view.
> > > > >
> > > > > 1, Is it possible for this REST API to declare TM resources in the
> > > > future?
> > > > > I'm asking because we are building the autoscaling feature for Flink OLAP
> > > > > Session Cluster in ByteDance. We need to rescale the cluster's resource
> > > > on
> > > > > TM level instead of Job level. It would be very helpful if we have a REST
> > > > > API for out external Autoscaling service to use.
> > > > >
> > > > > 2, And for streaming jobs, I'm wondering if there is any situation we
> > > > need
> > > > > to rescale the TM resources of a flink cluster at first and then the
> > > > > adaptive scheduler will rescale the per-vertex ResourceProfiles
> > > > > accordingly.
> > > > >
> > > > > best.
> > > > > Xiangyu
> > > > >
> > > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
> > > > >
> > > > > > Hi David
> > > > > >
> > > > > > Thanks for your answer.
> > > > > >
> > > > > > > Can you elaborate more about how you'd intend to use the endpoint? I
> > > > > > think we can ultimately introduce a way of re-declaring "per-vertex
> > > > > > defaults," but I'd like to understand the use case bit more first.
> > > > > >
> > > > > > For this issue, I mainly consider the consistency of user configuration
> > > > > and
> > > > > > job runtime. For sql jobs, users usually set specific parallelism for
> > > > > > source and sink, and set a global parallelism for other operators.
> > > > These
> > > > > > config items are stored in a config file. For some high-priority jobs,
> > > > > > users may want to manage them manually.
> > > > > > 1. When users need to scale the parallelism, they should update the
> > > > > config
> > > > > > file and restart flink job, which may take a long time.
> > > > > > 2. After providing the REST API, users can just send a request to the
> > > > job
> > > > > > via REST API quickly after updating the config file.
> > > > > > The configuration in the running job and config file should be the
> > > > same.
> > > > > > What do you think of this?
> > > > > >
> > > > > > best.
> > > > > > Shammon
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <da...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > Let's try to answer the questions one by one.
> > > > > > >
> > > > > > > *@ConradJam*
> > > > > > >
> > > > > > > when the number of "slots" is insufficient, can we can stop users
> > > > > > rescaling
> > > > > > > > or throw something to tell user "less avaliable slots to upgrade,
> > > > > > please
> > > > > > > > checkout your alivalbe slots" ?
> > > > > > > >
> > > > > > >
> > > > > > > The main property of AdaptiveScheduler is that it can adapt to
> > > > > "available
> > > > > > > resources," which means you're still able to make progress even
> > > > though
> > > > > > you
> > > > > > > didn't get all the slots you've asked for. Let's break down the pros
> > > > > and
> > > > > > > cons of this property.
> > > > > > >
> > > > > > > - (plus) If you lose a TM for some reason, you can still recover even
> > > > > if
> > > > > > it
> > > > > > > doesn't come back. We still need to give it some time to eliminate
> > > > > > > unnecessary rescaling, which can be controlled by setting
> > > > > > > "resource-stabilization-timeout."
> > > > > > > - (plus) The resources can arrive with a significant delay. For
> > > > > example,
> > > > > > > you're unable to spawn enough TMs on time because you've run out of
> > > > > > > resources in your k8s cluster, and you need to wait for the cluster
> > > > > auto
> > > > > > > scaler to kick in and add new nodes to the cluster. In this scenario,
> > > > > > > you'll be able to start making progress faster, at the cost of
> > > > multiple
> > > > > > > rescalings (once the remaining resources arrive).
> > > > > > > - (plus) This plays well with the declarative manner of today's
> > > > > > > infrastructure. For example, you tell k8s that you need 10 TMs, and
> > > > > > you'll
> > > > > > > eventually get them.
> > > > > > > - (minus) In the case of large state jobs, the cost of multiple
> > > > > > rescalings
> > > > > > > might outweigh the above.
> > > > > > >
> > > > > > > We've already touched on the solution to this problem on the FLIP.
> > > > > Please
> > > > > > > notice the parallelism knob being a range with a lower and upper
> > > > bound.
> > > > > > > Setting both the lower and upper bound to the same value could give
> > > > the
> > > > > > > behavior you're describing at the cost of giving up some properties
> > > > > that
> > > > > > AS
> > > > > > > gives you (you'd be falling back to the DefaultScheduler's behavior).
> > > > > > >
> > > > > > > when user upgrade job-vertx-parallelism . I want to have an interface
> > > > > to
> > > > > > > > query the current update parallel execution status, so that the
> > > > user
> > > > > or
> > > > > > > > program can understand the current status
> > > > > > > >
> > > > > > >
> > > > > > > This is a misunderstanding. We're not introducing the RESCALE
> > > > endpoint.
> > > > > > > This endpoint allows you to re-declare the resources needed to run
> > > > the
> > > > > > job.
> > > > > > > Once you reach the desired resources (you get more resources than the
> > > > > > lower
> > > > > > > bound defines), your job will run.
> > > > > > >
> > > > > > > We can expose a similar endpoint to "resource requirements" to give
> > > > you
> > > > > > an
> > > > > > > overview of the resources the vertices already have. You can already
> > > > > get
> > > > > > > this from the REST API, so exposing this in yet another way should be
> > > > > > > considered carefully.
> > > > > > >
> > > > > > > *@Matthias*
> > > > > > >
> > > > > > > I'm wondering whether it makes sense to add some kind of resource ID
> > > > to
> > > > > > the
> > > > > > > > REST API.
> > > > > > >
> > > > > > >
> > > > > > > That's a good question. I want to think about that and get back to
> > > > the
> > > > > > > question later. My main struggle when thinking about this is, "if
> > > > this
> > > > > > > would be an idempotent POST endpoint," would it be any different?
> > > > > > >
> > > > > > > How often do we allow resource requirements to be changed?
> > > > > > >
> > > > > > >
> > > > > > > There shall be no rate limiting on the FLINK side. If this is
> > > > something
> > > > > > > your environment needs, you can achieve it on a different layer ("we
> > > > > > can't
> > > > > > > have FLINK to do everything").
> > > > > > >
> > > > > > > Versioning the JobGraph in the JobGraphStore rather than overwriting
> > > > it
> > > > > > > > might be an idea.
> > > > > > > >
> > > > > > >
> > > > > > > This sounds interesting since it would be closer to the JobGraph
> > > > being
> > > > > > > immutable. The main problem I see here is that this would introduce a
> > > > > > > BW-incompatible change so it might be a topic for follow-up FLIP.
> > > > > > >
> > > > > > > I'm just wondering whether we bundle two things together that are
> > > > > > actually
> > > > > > > > separate
> > > > > > > >
> > > > > > >
> > > > > > > Yup, this is how we think about it as well. The main question is,
> > > > "who
> > > > > > > should be responsible for bookkeeping 1) the JobGraph and 2) the
> > > > > > > JobResourceRequirements". The JobMaster would be the right place for
> > > > > > both,
> > > > > > > but it's currently not the case, and we're tightly coupling the
> > > > > > dispatcher
> > > > > > > with the JobMaster.
> > > > > > >
> > > > > > > Initially, we tried to introduce a separate HA component in JobMaster
> > > > > for
> > > > > > > bookkeeping the JobResourceRequirements, but that proved to be a more
> > > > > > > significant effort adding additional mess to the already messy HA
> > > > > > > ecosystem. Another approach we've discussed was mutating the JobGraph
> > > > > and
> > > > > > > setting JRR into the JobGraph structure itself.
> > > > > > >
> > > > > > > The middle ground for keeping this effort reasonably sized and not
> > > > > > > violating "we want to keep JG immutable" too much is keeping the
> > > > > > > JobResourceRequirements separate as an internal config option in
> > > > > > JobGraph's
> > > > > > > configuration.
> > > > > > >
> > > > > > > We ultimately need to rethink the tight coupling of Dispatcher and
> > > > > > > JobMaster, but it needs to be a separate effort.
> > > > > > >
> > > > > > > ...also considering the amount of data that can be stored in a
> > > > > > > > ConfigMap/ZooKeeper node if versioning the resource requirement
> > > > > change
> > > > > > as
> > > > > > > > proposed in my previous item is an option for us.
> > > > > > > >
> > > > > > >
> > > > > > > AFAIK we're only storing pointers to the S3 objects in HA metadata,
> > > > so
> > > > > we
> > > > > > > should be okay with having larger structures for now.
> > > > > > >
> > > > > > > Updating the JobGraphStore means adding more requests to the HA
> > > > backend
> > > > > > > API.
> > > > > > > >
> > > > > > >
> > > > > > > It's fine unless you intend to override the resource requirements a
> > > > few
> > > > > > > times per second.
> > > > > > >
> > > > > > > *@Shammon*
> > > > > > >
> > > > > > > How about adding some more information such as vertex type
> > > > > > > >
> > > > > > >
> > > > > > > Since it was intended as a "debug" endpoint, it makes complete sense!
> > > > > > >
> > > > > > >  For sql jobs, we always use a unified parallelism for most vertices.
> > > > > Can
> > > > > > > > we provide them with a more convenient setting method instead of
> > > > each
> > > > > > > one?
> > > > > > >
> > > > > > >
> > > > > > > I completely feel with this. The main thoughts when designing the API
> > > > > > were:
> > > > > > > - We want to keep it clean and easy to understand.
> > > > > > > - Global parallelism can be modeled using per-vertex parallelism but
> > > > > not
> > > > > > > the other way around.
> > > > > > > - The API will be used by external tooling (operator, auto scaler).
> > > > > > >
> > > > > > > Can you elaborate more about how you'd intend to use the endpoint? I
> > > > > > think
> > > > > > > we can ultimately introduce a way of re-declaring "per-vertex
> > > > > defaults,"
> > > > > > > but I'd like to understand the use case bit more first.
> > > > > > >
> > > > > > > *@Weijie*
> > > > > > >
> > > > > > > What is the default value here (based on what configuration), or just
> > > > > > > > infinite?
> > > > > > > >
> > > > > > >
> > > > > > > Currently, for the lower bound, it's always one, and for the upper
> > > > > bound,
> > > > > > > it's either parallelism (if defined) or the maxParallelism of the
> > > > > vertex
> > > > > > in
> > > > > > > JobGraph. This question might be another signal for making the
> > > > defaults
> > > > > > > explicit (see the answer to Shammon's question above).
> > > > > > >
> > > > > > >
> > > > > > > Thanks, everyone, for your initial thoughts!
> > > > > > >
> > > > > > > Best,
> > > > > > > D.
> > > > > > >
> > > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <guoweijiereswqa@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks David for driving this. This is a very valuable work,
> > > > > especially
> > > > > > > for
> > > > > > > > cloud native environment.
> > > > > > > >
> > > > > > > > >> How about adding some more information such as vertex type
> > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > > > > resource-requirements`? For users, only vertex-id may be difficult
> > > > to
> > > > > > > > understand.
> > > > > > > >
> > > > > > > > +1 for this suggestion, including jobvertex's name in the response
> > > > > body
> > > > > > > is
> > > > > > > > more
> > > > > > > > user-friendly.
> > > > > > > >
> > > > > > > >
> > > > > > > > I saw this sentence in FLIP: "Setting the upper bound to -1 will
> > > > > reset
> > > > > > > the
> > > > > > > > value to the default setting."  What is the default value here
> > > > (based
> > > > > > on
> > > > > > > > what configuration), or just infinite?
> > > > > > > >
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > >
> > > > > > > > Weijie
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
> > > > > > > >
> > > > > > > > > Hi David
> > > > > > > > >
> > > > > > > > > Thanks for initiating this discussion. I think declaring job
> > > > > resource
> > > > > > > > > requirements by REST API is very valuable. I just left some
> > > > > comments
> > > > > > as
> > > > > > > > > followed
> > > > > > > > >
> > > > > > > > > 1) How about adding some more information such as vertex type
> > > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > > > > > resource-requirements`? For users, only vertex-id may be
> > > > difficult
> > > > > to
> > > > > > > > > understand.
> > > > > > > > >
> > > > > > > > > 2) For sql jobs, we always use a unified parallelism for most
> > > > > > vertices.
> > > > > > > > Can
> > > > > > > > > we provide them with a more convenient setting method instead of
> > > > > each
> > > > > > > > one?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Shammon
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
> > > > > matthias.pohl@aiven.io
> > > > > > > > > .invalid>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks David for creating this FLIP. It sounds promising and
> > > > > useful
> > > > > > > to
> > > > > > > > > > have. Here are some thoughts from my side (some of them might
> > > > be
> > > > > > > > rather a
> > > > > > > > > > follow-up and not necessarily part of this FLIP):
> > > > > > > > > > - I'm wondering whether it makes sense to add some kind of
> > > > > resource
> > > > > > > ID
> > > > > > > > to
> > > > > > > > > > the REST API. This would give Flink a tool to verify the PATCH
> > > > > > > request
> > > > > > > > of
> > > > > > > > > > the external system in a compare-and-set kind of manner. AFAIU,
> > > > > the
> > > > > > > > > process
> > > > > > > > > > requires the external system to retrieve the resource
> > > > > requirements
> > > > > > > > first
> > > > > > > > > > (to retrieve the vertex IDs). A resource ID <ABC> would be sent
> > > > > > along
> > > > > > > > as
> > > > > > > > > a
> > > > > > > > > > unique identifier for the provided setup. It's essentially the
> > > > > > > version
> > > > > > > > ID
> > > > > > > > > > of the currently deployed resource requirement configuration.
> > > > > Flink
> > > > > > > > > doesn't
> > > > > > > > > > know whether the external system would use the provided
> > > > > information
> > > > > > > in
> > > > > > > > > some
> > > > > > > > > > way to derive a new set of resource requirements for this job.
> > > > > The
> > > > > > > > > > subsequent PATCH request with updated resource requirements
> > > > would
> > > > > > > > include
> > > > > > > > > > the previously retrieved resource ID <ABC>. The PATCH call
> > > > would
> > > > > > fail
> > > > > > > > if
> > > > > > > > > > there was a concurrent PATCH call in between indicating to the
> > > > > > > external
> > > > > > > > > > system that the resource requirements were concurrently
> > > > updated.
> > > > > > > > > > - How often do we allow resource requirements to be changed?
> > > > That
> > > > > > > > > question
> > > > > > > > > > might make my previous comment on the resource ID obsolete
> > > > > because
> > > > > > we
> > > > > > > > > could
> > > > > > > > > > just make any PATCH call fail if there was a resource
> > > > requirement
> > > > > > > > update
> > > > > > > > > > within a certain time frame before the request. But such a time
> > > > > > > period
> > > > > > > > is
> > > > > > > > > > something we might want to make configurable then, I guess.
> > > > > > > > > > - Versioning the JobGraph in the JobGraphStore rather than
> > > > > > > overwriting
> > > > > > > > it
> > > > > > > > > > might be an idea. This would enable us to provide resource
> > > > > > > requirement
> > > > > > > > > > changes in the UI or through the REST API. It is related to a
> > > > > > problem
> > > > > > > > > > around keeping track of the exception history within the
> > > > > > > > > AdaptiveScheduler
> > > > > > > > > > and also having to consider multiple versions of a JobGraph.
> > > > But
> > > > > > for
> > > > > > > > that
> > > > > > > > > > one, we use the ExecutionGraphInfoStore right now.
> > > > > > > > > > - Updating the JobGraph in the JobGraphStore makes sense. I'm
> > > > > just
> > > > > > > > > > wondering whether we bundle two things together that are
> > > > actually
> > > > > > > > > separate:
> > > > > > > > > > The business logic and the execution configuration (the
> > > > resource
> > > > > > > > > > requirements). I'm aware that this is not a flaw of the current
> > > > > > FLIP
> > > > > > > > but
> > > > > > > > > > rather something that was not necessary to address in the past
> > > > > > > because
> > > > > > > > > the
> > > > > > > > > > JobGraph was kind of static. I don't remember whether that was
> > > > > > > already
> > > > > > > > > > discussed while working on the AdaptiveScheduler for FLIP-160
> > > > > [1].
> > > > > > > > Maybe,
> > > > > > > > > > I'm missing some functionality here that requires us to have
> > > > > > > everything
> > > > > > > > > in
> > > > > > > > > > one place. But it feels like updating the entire JobGraph which
> > > > > > could
> > > > > > > > be
> > > > > > > > > > actually a "config change" is not reasonable. ...also
> > > > considering
> > > > > > the
> > > > > > > > > > amount of data that can be stored in a ConfigMap/ZooKeeper node
> > > > > if
> > > > > > > > > > versioning the resource requirement change as proposed in my
> > > > > > previous
> > > > > > > > > item
> > > > > > > > > > is an option for us.
> > > > > > > > > > - Updating the JobGraphStore means adding more requests to the
> > > > HA
> > > > > > > > backend
> > > > > > > > > > API. There were some concerns shared in the discussion thread
> > > > [2]
> > > > > > for
> > > > > > > > > > FLIP-270 [3] on pressuring the k8s API server in the past with
> > > > > too
> > > > > > > many
> > > > > > > > > > calls. Eventhough, it's more likely to be caused by
> > > > > checkpointing,
> > > > > > I
> > > > > > > > > still
> > > > > > > > > > wanted to bring it up. We're working on a standardized
> > > > > performance
> > > > > > > test
> > > > > > > > > to
> > > > > > > > > > prepare going forward with FLIP-270 [3] right now.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Matthias
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > > > > > > > > > [2]
> > > > > > https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > > > > > > > > > [3]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <jam.gzczy@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi David:
> > > > > > > > > > >
> > > > > > > > > > > Thank you for drive this flip, which helps less flink
> > > > shutdown
> > > > > > time
> > > > > > > > > > >
> > > > > > > > > > > for this flip, I would like to make a few idea on share
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - when the number of "slots" is insufficient, can we can
> > > > > stop
> > > > > > > > users
> > > > > > > > > > >    rescaling or throw something to tell user "less avaliable
> > > > > > slots
> > > > > > > to
> > > > > > > > > > > upgrade,
> > > > > > > > > > >    please checkout your alivalbe slots" ? Or we could have a
> > > > > > > request
> > > > > > > > > > >    switch(true/false) to allow this behavior
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - when user upgrade job-vertx-parallelism . I want to have
> > > > > an
> > > > > > > > > > interface
> > > > > > > > > > >    to query the current update parallel execution status, so
> > > > > that
> > > > > > > the
> > > > > > > > > > user
> > > > > > > > > > > or
> > > > > > > > > > >    program can understand the current status
> > > > > > > > > > >    - I want to have an interface to query the current update
> > > > > > > > > parallelism
> > > > > > > > > > >    execution status. This also helps similar to *[1] Flink
> > > > K8S
> > > > > > > > > Operator*
> > > > > > > > > > >    management
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > {
> > > > > > > > > > >   status: Failed
> > > > > > > > > > >   reason: "less avaliable slots to upgrade, please checkout
> > > > > your
> > > > > > > > > alivalbe
> > > > > > > > > > > slots"
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - *Pending*: this job now is join the upgrade queue,it
> > > > will
> > > > > be
> > > > > > > > > update
> > > > > > > > > > >    later
> > > > > > > > > > >    - *Rescaling*: job now is rescaling,wait it finish
> > > > > > > > > > >    - *Finished*: finish do it
> > > > > > > > > > >    - *Failed* : something have wrong,so this job is not
> > > > > alivable
> > > > > > > > > upgrade
> > > > > > > > > > >
> > > > > > > > > > > I want to supplement my above content in flip, what do you
> > > > > think
> > > > > > ?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    1.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：
> > > > > > > > > > >
> > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > >
> > > > > > > > > > > > This FLIP [1] introduces a new REST API for declaring
> > > > > resource
> > > > > > > > > > > requirements
> > > > > > > > > > > > for the Adaptive Scheduler. There seems to be a clear need
> > > > > for
> > > > > > > this
> > > > > > > > > API
> > > > > > > > > > > > based on the discussion on the "Reworking the Rescale API"
> > > > > [2]
> > > > > > > > > thread.
> > > > > > > > > > > >
> > > > > > > > > > > > Before we get started, this work is heavily based on the
> > > > > > > prototype
> > > > > > > > > [3]
> > > > > > > > > > > > created by Till Rohrmann, and the FLIP is being published
> > > > > with
> > > > > > > his
> > > > > > > > > > > consent.
> > > > > > > > > > > > Big shoutout to him!
> > > > > > > > > > > >
> > > > > > > > > > > > Last and not least, thanks to Chesnay and Roman for the
> > > > > initial
> > > > > > > > > reviews
> > > > > > > > > > > and
> > > > > > > > > > > > discussions.
> > > > > > > > > > > >
> > > > > > > > > > > > The best start would be watching a short demo [4] that I've
> > > > > > > > recorded,
> > > > > > > > > > > which
> > > > > > > > > > > > illustrates newly added capabilities (rescaling the running
> > > > > > job,
> > > > > > > > > > handing
> > > > > > > > > > > > back resources to the RM, and session cluster support).
> > > > > > > > > > > >
> > > > > > > > > > > > The intuition behind the FLIP is being able to define
> > > > > resource
> > > > > > > > > > > requirements
> > > > > > > > > > > > ("resource boundaries") externally that the
> > > > AdaptiveScheduler
> > > > > > can
> > > > > > > > > > > navigate
> > > > > > > > > > > > within. This is a building block for higher-level efforts
> > > > > such
> > > > > > as
> > > > > > > > an
> > > > > > > > > > > > external Autoscaler. The natural extension of this work
> > > > would
> > > > > > be
> > > > > > > to
> > > > > > > > > > allow
> > > > > > > > > > > > to specify per-vertex ResourceProfiles.
> > > > > > > > > > > >
> > > > > > > > > > > > Looking forward to your thoughts; any feedback is
> > > > > appreciated!
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > > > > > > > > [2]
> > > > > > > > https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > > > > > > > > > > > [3] https://github.com/tillrohrmann/flink/tree/autoscaling
> > > > > > > > > > > > [4]
> > > > > > > > > > >
> > > > > > > >
> > > > > https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > D.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Best
> > > > > > > > > > >
> > > > > > > > > > > ConradJam
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by Maximilian Michels <mx...@apache.org>.

Based on further discussion I had with Chesnay on this PR [1], I think
jobs would currently go into a restarting state after the resource
requirements have changed. This wouldn't achieve what we had in mind,
i.e. sticking to the old resource requirements until enough slots are
available to fulfil the new resource requirements. So this may not be
100% what we need but it could be extended to do what we want.

-Max

[1] https://github.com/apache/flink/pull/21908#discussion_r1104792362

On Mon, Feb 13, 2023 at 7:16 PM Maximilian Michels <mx...@apache.org> wrote:
>
> Hi David,
>
> This is awesome! Great writeup and demo. This is pretty much what we
> need for the autoscaler as part of the Flink Kubernetes operator [1].
> Scaling Flink jobs effectively is hard but fortunately we have solved
> the issue as part of the Flink Kubernetes operator. The only critical
> piece we are missing is a better way to execute scaling decisions, as
> discussed in [2].
>
> Looking at your proposal, we would set lowerBound == upperBound for
> the parallelism because we want to fully determine the parallelism
> externally based on the scaling metrics. Does that sound right?
>
> What is the timeline for these changes? Is there a JIRA?
>
> Cheers,
> Max
>
> [1] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
> [2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
>
> On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <xi...@gmail.com> wrote:
> >
> > Hi David,
> >
> > Thanks for your reply.  I think your response totally make sense.  This
> > flip targets on declaring required resource to ResourceManager instead of
> > using  ResourceManager to add/remove TMs directly.
> >
> > Best,
> > Xiangyu
> >
> >
> >
> > David Morávek <da...@gmail.com> 于2023年2月13日周一 15:46写道：
> >
> > > Hi everyone,
> > >
> > > @Shammon
> > >
> > > I'm not entirely sure what "config file" you're referring to. You can, of
> > > course, override the default parallelism in "flink-conf.yaml", but for
> > > sinks and sources, the parallelism needs to be tweaked on the connector
> > > level ("WITH" statement).
> > >
> > > This is something that should be achieved with tooling around Flink. We
> > > want to provide an API on the lowest level that generalizes well. Achieving
> > > what you're describing should be straightforward with this API.
> > >
> > > @Xiangyu
> > >
> > > Is it possible for this REST API to declare TM resources in the future?
> > >
> > >
> > > Would you like to add/remove TMs if you use an active Resource Manager?
> > > This would be out of the scope of this effort since it targets the
> > > scheduler component only (we make no assumptions about the used Resource
> > > Manager). Also, the AdaptiveScheduler is only intended to be used for
> > > Streaming.
> > >
> > >  And for streaming jobs, I'm wondering if there is any situation we need to
> > > > rescale the TM resources of a flink cluster at first and then the
> > > adaptive
> > > > scheduler will rescale the per-vertex ResourceProfiles accordingly.
> > > >
> > >
> > > We plan on adding support for the ResourceProfiles (dynamic slot
> > > allocation) as the next step. Again we won't make any assumptions about the
> > > used Resource Manager. In other words, this effort ends by declaring
> > > desired resources to the Resource Manager.
> > >
> > > Does that make sense?
> > >
> > > @Matthias
> > >
> > > We've done another pass on the proposed API and currently lean towards
> > > having an idempotent PUT API.
> > > - We don't care too much about multiple writers' scenarios in terms of who
> > > can write an authoritative payload; this is up to the user of the API to
> > > figure out
> > > - It's indeed tricky to achieve atomicity with PATCH API; switching to PUT
> > > API seems to do the trick
> > > - We won't allow partial "payloads" anymore, meaning you need to define
> > > requirements for all vertices in the JobGraph; This is completely fine for
> > > the programmatic workflows. For DEBUG / DEMO purposes, you can use the GET
> > > endpoint and tweak the response to avoid writing the whole payload by hand.
> > >
> > > WDYT?
> > >
> > >
> > > Best,
> > > D.
> > >
> > > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <xi...@gmail.com>
> > > wrote:
> > >
> > > > Hi David,
> > > >
> > > > Thanks for creating this flip. I think this work it is very useful,
> > > > especially in autoscaling scenario.  I would like to share some questions
> > > > from my view.
> > > >
> > > > 1, Is it possible for this REST API to declare TM resources in the
> > > future?
> > > > I'm asking because we are building the autoscaling feature for Flink OLAP
> > > > Session Cluster in ByteDance. We need to rescale the cluster's resource
> > > on
> > > > TM level instead of Job level. It would be very helpful if we have a REST
> > > > API for out external Autoscaling service to use.
> > > >
> > > > 2, And for streaming jobs, I'm wondering if there is any situation we
> > > need
> > > > to rescale the TM resources of a flink cluster at first and then the
> > > > adaptive scheduler will rescale the per-vertex ResourceProfiles
> > > > accordingly.
> > > >
> > > > best.
> > > > Xiangyu
> > > >
> > > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
> > > >
> > > > > Hi David
> > > > >
> > > > > Thanks for your answer.
> > > > >
> > > > > > Can you elaborate more about how you'd intend to use the endpoint? I
> > > > > think we can ultimately introduce a way of re-declaring "per-vertex
> > > > > defaults," but I'd like to understand the use case bit more first.
> > > > >
> > > > > For this issue, I mainly consider the consistency of user configuration
> > > > and
> > > > > job runtime. For sql jobs, users usually set specific parallelism for
> > > > > source and sink, and set a global parallelism for other operators.
> > > These
> > > > > config items are stored in a config file. For some high-priority jobs,
> > > > > users may want to manage them manually.
> > > > > 1. When users need to scale the parallelism, they should update the
> > > > config
> > > > > file and restart flink job, which may take a long time.
> > > > > 2. After providing the REST API, users can just send a request to the
> > > job
> > > > > via REST API quickly after updating the config file.
> > > > > The configuration in the running job and config file should be the
> > > same.
> > > > > What do you think of this?
> > > > >
> > > > > best.
> > > > > Shammon
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <da...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > Let's try to answer the questions one by one.
> > > > > >
> > > > > > *@ConradJam*
> > > > > >
> > > > > > when the number of "slots" is insufficient, can we can stop users
> > > > > rescaling
> > > > > > > or throw something to tell user "less avaliable slots to upgrade,
> > > > > please
> > > > > > > checkout your alivalbe slots" ?
> > > > > > >
> > > > > >
> > > > > > The main property of AdaptiveScheduler is that it can adapt to
> > > > "available
> > > > > > resources," which means you're still able to make progress even
> > > though
> > > > > you
> > > > > > didn't get all the slots you've asked for. Let's break down the pros
> > > > and
> > > > > > cons of this property.
> > > > > >
> > > > > > - (plus) If you lose a TM for some reason, you can still recover even
> > > > if
> > > > > it
> > > > > > doesn't come back. We still need to give it some time to eliminate
> > > > > > unnecessary rescaling, which can be controlled by setting
> > > > > > "resource-stabilization-timeout."
> > > > > > - (plus) The resources can arrive with a significant delay. For
> > > > example,
> > > > > > you're unable to spawn enough TMs on time because you've run out of
> > > > > > resources in your k8s cluster, and you need to wait for the cluster
> > > > auto
> > > > > > scaler to kick in and add new nodes to the cluster. In this scenario,
> > > > > > you'll be able to start making progress faster, at the cost of
> > > multiple
> > > > > > rescalings (once the remaining resources arrive).
> > > > > > - (plus) This plays well with the declarative manner of today's
> > > > > > infrastructure. For example, you tell k8s that you need 10 TMs, and
> > > > > you'll
> > > > > > eventually get them.
> > > > > > - (minus) In the case of large state jobs, the cost of multiple
> > > > > rescalings
> > > > > > might outweigh the above.
> > > > > >
> > > > > > We've already touched on the solution to this problem on the FLIP.
> > > > Please
> > > > > > notice the parallelism knob being a range with a lower and upper
> > > bound.
> > > > > > Setting both the lower and upper bound to the same value could give
> > > the
> > > > > > behavior you're describing at the cost of giving up some properties
> > > > that
> > > > > AS
> > > > > > gives you (you'd be falling back to the DefaultScheduler's behavior).
> > > > > >
> > > > > > when user upgrade job-vertx-parallelism . I want to have an interface
> > > > to
> > > > > > > query the current update parallel execution status, so that the
> > > user
> > > > or
> > > > > > > program can understand the current status
> > > > > > >
> > > > > >
> > > > > > This is a misunderstanding. We're not introducing the RESCALE
> > > endpoint.
> > > > > > This endpoint allows you to re-declare the resources needed to run
> > > the
> > > > > job.
> > > > > > Once you reach the desired resources (you get more resources than the
> > > > > lower
> > > > > > bound defines), your job will run.
> > > > > >
> > > > > > We can expose a similar endpoint to "resource requirements" to give
> > > you
> > > > > an
> > > > > > overview of the resources the vertices already have. You can already
> > > > get
> > > > > > this from the REST API, so exposing this in yet another way should be
> > > > > > considered carefully.
> > > > > >
> > > > > > *@Matthias*
> > > > > >
> > > > > > I'm wondering whether it makes sense to add some kind of resource ID
> > > to
> > > > > the
> > > > > > > REST API.
> > > > > >
> > > > > >
> > > > > > That's a good question. I want to think about that and get back to
> > > the
> > > > > > question later. My main struggle when thinking about this is, "if
> > > this
> > > > > > would be an idempotent POST endpoint," would it be any different?
> > > > > >
> > > > > > How often do we allow resource requirements to be changed?
> > > > > >
> > > > > >
> > > > > > There shall be no rate limiting on the FLINK side. If this is
> > > something
> > > > > > your environment needs, you can achieve it on a different layer ("we
> > > > > can't
> > > > > > have FLINK to do everything").
> > > > > >
> > > > > > Versioning the JobGraph in the JobGraphStore rather than overwriting
> > > it
> > > > > > > might be an idea.
> > > > > > >
> > > > > >
> > > > > > This sounds interesting since it would be closer to the JobGraph
> > > being
> > > > > > immutable. The main problem I see here is that this would introduce a
> > > > > > BW-incompatible change so it might be a topic for follow-up FLIP.
> > > > > >
> > > > > > I'm just wondering whether we bundle two things together that are
> > > > > actually
> > > > > > > separate
> > > > > > >
> > > > > >
> > > > > > Yup, this is how we think about it as well. The main question is,
> > > "who
> > > > > > should be responsible for bookkeeping 1) the JobGraph and 2) the
> > > > > > JobResourceRequirements". The JobMaster would be the right place for
> > > > > both,
> > > > > > but it's currently not the case, and we're tightly coupling the
> > > > > dispatcher
> > > > > > with the JobMaster.
> > > > > >
> > > > > > Initially, we tried to introduce a separate HA component in JobMaster
> > > > for
> > > > > > bookkeeping the JobResourceRequirements, but that proved to be a more
> > > > > > significant effort adding additional mess to the already messy HA
> > > > > > ecosystem. Another approach we've discussed was mutating the JobGraph
> > > > and
> > > > > > setting JRR into the JobGraph structure itself.
> > > > > >
> > > > > > The middle ground for keeping this effort reasonably sized and not
> > > > > > violating "we want to keep JG immutable" too much is keeping the
> > > > > > JobResourceRequirements separate as an internal config option in
> > > > > JobGraph's
> > > > > > configuration.
> > > > > >
> > > > > > We ultimately need to rethink the tight coupling of Dispatcher and
> > > > > > JobMaster, but it needs to be a separate effort.
> > > > > >
> > > > > > ...also considering the amount of data that can be stored in a
> > > > > > > ConfigMap/ZooKeeper node if versioning the resource requirement
> > > > change
> > > > > as
> > > > > > > proposed in my previous item is an option for us.
> > > > > > >
> > > > > >
> > > > > > AFAIK we're only storing pointers to the S3 objects in HA metadata,
> > > so
> > > > we
> > > > > > should be okay with having larger structures for now.
> > > > > >
> > > > > > Updating the JobGraphStore means adding more requests to the HA
> > > backend
> > > > > > API.
> > > > > > >
> > > > > >
> > > > > > It's fine unless you intend to override the resource requirements a
> > > few
> > > > > > times per second.
> > > > > >
> > > > > > *@Shammon*
> > > > > >
> > > > > > How about adding some more information such as vertex type
> > > > > > >
> > > > > >
> > > > > > Since it was intended as a "debug" endpoint, it makes complete sense!
> > > > > >
> > > > > >  For sql jobs, we always use a unified parallelism for most vertices.
> > > > Can
> > > > > > > we provide them with a more convenient setting method instead of
> > > each
> > > > > > one?
> > > > > >
> > > > > >
> > > > > > I completely feel with this. The main thoughts when designing the API
> > > > > were:
> > > > > > - We want to keep it clean and easy to understand.
> > > > > > - Global parallelism can be modeled using per-vertex parallelism but
> > > > not
> > > > > > the other way around.
> > > > > > - The API will be used by external tooling (operator, auto scaler).
> > > > > >
> > > > > > Can you elaborate more about how you'd intend to use the endpoint? I
> > > > > think
> > > > > > we can ultimately introduce a way of re-declaring "per-vertex
> > > > defaults,"
> > > > > > but I'd like to understand the use case bit more first.
> > > > > >
> > > > > > *@Weijie*
> > > > > >
> > > > > > What is the default value here (based on what configuration), or just
> > > > > > > infinite?
> > > > > > >
> > > > > >
> > > > > > Currently, for the lower bound, it's always one, and for the upper
> > > > bound,
> > > > > > it's either parallelism (if defined) or the maxParallelism of the
> > > > vertex
> > > > > in
> > > > > > JobGraph. This question might be another signal for making the
> > > defaults
> > > > > > explicit (see the answer to Shammon's question above).
> > > > > >
> > > > > >
> > > > > > Thanks, everyone, for your initial thoughts!
> > > > > >
> > > > > > Best,
> > > > > > D.
> > > > > >
> > > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <guoweijiereswqa@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks David for driving this. This is a very valuable work,
> > > > especially
> > > > > > for
> > > > > > > cloud native environment.
> > > > > > >
> > > > > > > >> How about adding some more information such as vertex type
> > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > > > resource-requirements`? For users, only vertex-id may be difficult
> > > to
> > > > > > > understand.
> > > > > > >
> > > > > > > +1 for this suggestion, including jobvertex's name in the response
> > > > body
> > > > > > is
> > > > > > > more
> > > > > > > user-friendly.
> > > > > > >
> > > > > > >
> > > > > > > I saw this sentence in FLIP: "Setting the upper bound to -1 will
> > > > reset
> > > > > > the
> > > > > > > value to the default setting."  What is the default value here
> > > (based
> > > > > on
> > > > > > > what configuration), or just infinite?
> > > > > > >
> > > > > > >
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Weijie
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
> > > > > > >
> > > > > > > > Hi David
> > > > > > > >
> > > > > > > > Thanks for initiating this discussion. I think declaring job
> > > > resource
> > > > > > > > requirements by REST API is very valuable. I just left some
> > > > comments
> > > > > as
> > > > > > > > followed
> > > > > > > >
> > > > > > > > 1) How about adding some more information such as vertex type
> > > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > > > > resource-requirements`? For users, only vertex-id may be
> > > difficult
> > > > to
> > > > > > > > understand.
> > > > > > > >
> > > > > > > > 2) For sql jobs, we always use a unified parallelism for most
> > > > > vertices.
> > > > > > > Can
> > > > > > > > we provide them with a more convenient setting method instead of
> > > > each
> > > > > > > one?
> > > > > > > >
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Shammon
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
> > > > matthias.pohl@aiven.io
> > > > > > > > .invalid>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks David for creating this FLIP. It sounds promising and
> > > > useful
> > > > > > to
> > > > > > > > > have. Here are some thoughts from my side (some of them might
> > > be
> > > > > > > rather a
> > > > > > > > > follow-up and not necessarily part of this FLIP):
> > > > > > > > > - I'm wondering whether it makes sense to add some kind of
> > > > resource
> > > > > > ID
> > > > > > > to
> > > > > > > > > the REST API. This would give Flink a tool to verify the PATCH
> > > > > > request
> > > > > > > of
> > > > > > > > > the external system in a compare-and-set kind of manner. AFAIU,
> > > > the
> > > > > > > > process
> > > > > > > > > requires the external system to retrieve the resource
> > > > requirements
> > > > > > > first
> > > > > > > > > (to retrieve the vertex IDs). A resource ID <ABC> would be sent
> > > > > along
> > > > > > > as
> > > > > > > > a
> > > > > > > > > unique identifier for the provided setup. It's essentially the
> > > > > > version
> > > > > > > ID
> > > > > > > > > of the currently deployed resource requirement configuration.
> > > > Flink
> > > > > > > > doesn't
> > > > > > > > > know whether the external system would use the provided
> > > > information
> > > > > > in
> > > > > > > > some
> > > > > > > > > way to derive a new set of resource requirements for this job.
> > > > The
> > > > > > > > > subsequent PATCH request with updated resource requirements
> > > would
> > > > > > > include
> > > > > > > > > the previously retrieved resource ID <ABC>. The PATCH call
> > > would
> > > > > fail
> > > > > > > if
> > > > > > > > > there was a concurrent PATCH call in between indicating to the
> > > > > > external
> > > > > > > > > system that the resource requirements were concurrently
> > > updated.
> > > > > > > > > - How often do we allow resource requirements to be changed?
> > > That
> > > > > > > > question
> > > > > > > > > might make my previous comment on the resource ID obsolete
> > > > because
> > > > > we
> > > > > > > > could
> > > > > > > > > just make any PATCH call fail if there was a resource
> > > requirement
> > > > > > > update
> > > > > > > > > within a certain time frame before the request. But such a time
> > > > > > period
> > > > > > > is
> > > > > > > > > something we might want to make configurable then, I guess.
> > > > > > > > > - Versioning the JobGraph in the JobGraphStore rather than
> > > > > > overwriting
> > > > > > > it
> > > > > > > > > might be an idea. This would enable us to provide resource
> > > > > > requirement
> > > > > > > > > changes in the UI or through the REST API. It is related to a
> > > > > problem
> > > > > > > > > around keeping track of the exception history within the
> > > > > > > > AdaptiveScheduler
> > > > > > > > > and also having to consider multiple versions of a JobGraph.
> > > But
> > > > > for
> > > > > > > that
> > > > > > > > > one, we use the ExecutionGraphInfoStore right now.
> > > > > > > > > - Updating the JobGraph in the JobGraphStore makes sense. I'm
> > > > just
> > > > > > > > > wondering whether we bundle two things together that are
> > > actually
> > > > > > > > separate:
> > > > > > > > > The business logic and the execution configuration (the
> > > resource
> > > > > > > > > requirements). I'm aware that this is not a flaw of the current
> > > > > FLIP
> > > > > > > but
> > > > > > > > > rather something that was not necessary to address in the past
> > > > > > because
> > > > > > > > the
> > > > > > > > > JobGraph was kind of static. I don't remember whether that was
> > > > > > already
> > > > > > > > > discussed while working on the AdaptiveScheduler for FLIP-160
> > > > [1].
> > > > > > > Maybe,
> > > > > > > > > I'm missing some functionality here that requires us to have
> > > > > > everything
> > > > > > > > in
> > > > > > > > > one place. But it feels like updating the entire JobGraph which
> > > > > could
> > > > > > > be
> > > > > > > > > actually a "config change" is not reasonable. ...also
> > > considering
> > > > > the
> > > > > > > > > amount of data that can be stored in a ConfigMap/ZooKeeper node
> > > > if
> > > > > > > > > versioning the resource requirement change as proposed in my
> > > > > previous
> > > > > > > > item
> > > > > > > > > is an option for us.
> > > > > > > > > - Updating the JobGraphStore means adding more requests to the
> > > HA
> > > > > > > backend
> > > > > > > > > API. There were some concerns shared in the discussion thread
> > > [2]
> > > > > for
> > > > > > > > > FLIP-270 [3] on pressuring the k8s API server in the past with
> > > > too
> > > > > > many
> > > > > > > > > calls. Eventhough, it's more likely to be caused by
> > > > checkpointing,
> > > > > I
> > > > > > > > still
> > > > > > > > > wanted to bring it up. We're working on a standardized
> > > > performance
> > > > > > test
> > > > > > > > to
> > > > > > > > > prepare going forward with FLIP-270 [3] right now.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Matthias
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > > > > > > > > [2]
> > > > > https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> > > > > > > > >
> > > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <jam.gzczy@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi David:
> > > > > > > > > >
> > > > > > > > > > Thank you for drive this flip, which helps less flink
> > > shutdown
> > > > > time
> > > > > > > > > >
> > > > > > > > > > for this flip, I would like to make a few idea on share
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    - when the number of "slots" is insufficient, can we can
> > > > stop
> > > > > > > users
> > > > > > > > > >    rescaling or throw something to tell user "less avaliable
> > > > > slots
> > > > > > to
> > > > > > > > > > upgrade,
> > > > > > > > > >    please checkout your alivalbe slots" ? Or we could have a
> > > > > > request
> > > > > > > > > >    switch(true/false) to allow this behavior
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    - when user upgrade job-vertx-parallelism . I want to have
> > > > an
> > > > > > > > > interface
> > > > > > > > > >    to query the current update parallel execution status, so
> > > > that
> > > > > > the
> > > > > > > > > user
> > > > > > > > > > or
> > > > > > > > > >    program can understand the current status
> > > > > > > > > >    - I want to have an interface to query the current update
> > > > > > > > parallelism
> > > > > > > > > >    execution status. This also helps similar to *[1] Flink
> > > K8S
> > > > > > > > Operator*
> > > > > > > > > >    management
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > {
> > > > > > > > > >   status: Failed
> > > > > > > > > >   reason: "less avaliable slots to upgrade, please checkout
> > > > your
> > > > > > > > alivalbe
> > > > > > > > > > slots"
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    - *Pending*: this job now is join the upgrade queue,it
> > > will
> > > > be
> > > > > > > > update
> > > > > > > > > >    later
> > > > > > > > > >    - *Rescaling*: job now is rescaling,wait it finish
> > > > > > > > > >    - *Finished*: finish do it
> > > > > > > > > >    - *Failed* : something have wrong,so this job is not
> > > > alivable
> > > > > > > > upgrade
> > > > > > > > > >
> > > > > > > > > > I want to supplement my above content in flip, what do you
> > > > think
> > > > > ?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    1.
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：
> > > > > > > > > >
> > > > > > > > > > > Hi everyone,
> > > > > > > > > > >
> > > > > > > > > > > This FLIP [1] introduces a new REST API for declaring
> > > > resource
> > > > > > > > > > requirements
> > > > > > > > > > > for the Adaptive Scheduler. There seems to be a clear need
> > > > for
> > > > > > this
> > > > > > > > API
> > > > > > > > > > > based on the discussion on the "Reworking the Rescale API"
> > > > [2]
> > > > > > > > thread.
> > > > > > > > > > >
> > > > > > > > > > > Before we get started, this work is heavily based on the
> > > > > > prototype
> > > > > > > > [3]
> > > > > > > > > > > created by Till Rohrmann, and the FLIP is being published
> > > > with
> > > > > > his
> > > > > > > > > > consent.
> > > > > > > > > > > Big shoutout to him!
> > > > > > > > > > >
> > > > > > > > > > > Last and not least, thanks to Chesnay and Roman for the
> > > > initial
> > > > > > > > reviews
> > > > > > > > > > and
> > > > > > > > > > > discussions.
> > > > > > > > > > >
> > > > > > > > > > > The best start would be watching a short demo [4] that I've
> > > > > > > recorded,
> > > > > > > > > > which
> > > > > > > > > > > illustrates newly added capabilities (rescaling the running
> > > > > job,
> > > > > > > > > handing
> > > > > > > > > > > back resources to the RM, and session cluster support).
> > > > > > > > > > >
> > > > > > > > > > > The intuition behind the FLIP is being able to define
> > > > resource
> > > > > > > > > > requirements
> > > > > > > > > > > ("resource boundaries") externally that the
> > > AdaptiveScheduler
> > > > > can
> > > > > > > > > > navigate
> > > > > > > > > > > within. This is a building block for higher-level efforts
> > > > such
> > > > > as
> > > > > > > an
> > > > > > > > > > > external Autoscaler. The natural extension of this work
> > > would
> > > > > be
> > > > > > to
> > > > > > > > > allow
> > > > > > > > > > > to specify per-vertex ResourceProfiles.
> > > > > > > > > > >
> > > > > > > > > > > Looking forward to your thoughts; any feedback is
> > > > appreciated!
> > > > > > > > > > >
> > > > > > > > > > > [1]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > > > > > > > [2]
> > > > > > > https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > > > > > > > > > > [3] https://github.com/tillrohrmann/flink/tree/autoscaling
> > > > > > > > > > > [4]
> > > > > > > > > >
> > > > > > >
> > > > https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > D.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best
> > > > > > > > > >
> > > > > > > > > > ConradJam
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by Maximilian Michels <mx...@apache.org>.

Hi David,

This is awesome! Great writeup and demo. This is pretty much what we
need for the autoscaler as part of the Flink Kubernetes operator [1].
Scaling Flink jobs effectively is hard but fortunately we have solved
the issue as part of the Flink Kubernetes operator. The only critical
piece we are missing is a better way to execute scaling decisions, as
discussed in [2].

Looking at your proposal, we would set lowerBound == upperBound for
the parallelism because we want to fully determine the parallelism
externally based on the scaling metrics. Does that sound right?

What is the timeline for these changes? Is there a JIRA?

Cheers,
Max

[1] https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autoscaler/
[2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5

On Mon, Feb 13, 2023 at 1:16 PM feng xiangyu <xi...@gmail.com> wrote:
>
> Hi David,
>
> Thanks for your reply.  I think your response totally make sense.  This
> flip targets on declaring required resource to ResourceManager instead of
> using  ResourceManager to add/remove TMs directly.
>
> Best,
> Xiangyu
>
>
>
> David Morávek <da...@gmail.com> 于2023年2月13日周一 15:46写道：
>
> > Hi everyone,
> >
> > @Shammon
> >
> > I'm not entirely sure what "config file" you're referring to. You can, of
> > course, override the default parallelism in "flink-conf.yaml", but for
> > sinks and sources, the parallelism needs to be tweaked on the connector
> > level ("WITH" statement).
> >
> > This is something that should be achieved with tooling around Flink. We
> > want to provide an API on the lowest level that generalizes well. Achieving
> > what you're describing should be straightforward with this API.
> >
> > @Xiangyu
> >
> > Is it possible for this REST API to declare TM resources in the future?
> >
> >
> > Would you like to add/remove TMs if you use an active Resource Manager?
> > This would be out of the scope of this effort since it targets the
> > scheduler component only (we make no assumptions about the used Resource
> > Manager). Also, the AdaptiveScheduler is only intended to be used for
> > Streaming.
> >
> >  And for streaming jobs, I'm wondering if there is any situation we need to
> > > rescale the TM resources of a flink cluster at first and then the
> > adaptive
> > > scheduler will rescale the per-vertex ResourceProfiles accordingly.
> > >
> >
> > We plan on adding support for the ResourceProfiles (dynamic slot
> > allocation) as the next step. Again we won't make any assumptions about the
> > used Resource Manager. In other words, this effort ends by declaring
> > desired resources to the Resource Manager.
> >
> > Does that make sense?
> >
> > @Matthias
> >
> > We've done another pass on the proposed API and currently lean towards
> > having an idempotent PUT API.
> > - We don't care too much about multiple writers' scenarios in terms of who
> > can write an authoritative payload; this is up to the user of the API to
> > figure out
> > - It's indeed tricky to achieve atomicity with PATCH API; switching to PUT
> > API seems to do the trick
> > - We won't allow partial "payloads" anymore, meaning you need to define
> > requirements for all vertices in the JobGraph; This is completely fine for
> > the programmatic workflows. For DEBUG / DEMO purposes, you can use the GET
> > endpoint and tweak the response to avoid writing the whole payload by hand.
> >
> > WDYT?
> >
> >
> > Best,
> > D.
> >
> > On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <xi...@gmail.com>
> > wrote:
> >
> > > Hi David,
> > >
> > > Thanks for creating this flip. I think this work it is very useful,
> > > especially in autoscaling scenario.  I would like to share some questions
> > > from my view.
> > >
> > > 1, Is it possible for this REST API to declare TM resources in the
> > future?
> > > I'm asking because we are building the autoscaling feature for Flink OLAP
> > > Session Cluster in ByteDance. We need to rescale the cluster's resource
> > on
> > > TM level instead of Job level. It would be very helpful if we have a REST
> > > API for out external Autoscaling service to use.
> > >
> > > 2, And for streaming jobs, I'm wondering if there is any situation we
> > need
> > > to rescale the TM resources of a flink cluster at first and then the
> > > adaptive scheduler will rescale the per-vertex ResourceProfiles
> > > accordingly.
> > >
> > > best.
> > > Xiangyu
> > >
> > > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
> > >
> > > > Hi David
> > > >
> > > > Thanks for your answer.
> > > >
> > > > > Can you elaborate more about how you'd intend to use the endpoint? I
> > > > think we can ultimately introduce a way of re-declaring "per-vertex
> > > > defaults," but I'd like to understand the use case bit more first.
> > > >
> > > > For this issue, I mainly consider the consistency of user configuration
> > > and
> > > > job runtime. For sql jobs, users usually set specific parallelism for
> > > > source and sink, and set a global parallelism for other operators.
> > These
> > > > config items are stored in a config file. For some high-priority jobs,
> > > > users may want to manage them manually.
> > > > 1. When users need to scale the parallelism, they should update the
> > > config
> > > > file and restart flink job, which may take a long time.
> > > > 2. After providing the REST API, users can just send a request to the
> > job
> > > > via REST API quickly after updating the config file.
> > > > The configuration in the running job and config file should be the
> > same.
> > > > What do you think of this?
> > > >
> > > > best.
> > > > Shammon
> > > >
> > > >
> > > >
> > > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <da...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > Let's try to answer the questions one by one.
> > > > >
> > > > > *@ConradJam*
> > > > >
> > > > > when the number of "slots" is insufficient, can we can stop users
> > > > rescaling
> > > > > > or throw something to tell user "less avaliable slots to upgrade,
> > > > please
> > > > > > checkout your alivalbe slots" ?
> > > > > >
> > > > >
> > > > > The main property of AdaptiveScheduler is that it can adapt to
> > > "available
> > > > > resources," which means you're still able to make progress even
> > though
> > > > you
> > > > > didn't get all the slots you've asked for. Let's break down the pros
> > > and
> > > > > cons of this property.
> > > > >
> > > > > - (plus) If you lose a TM for some reason, you can still recover even
> > > if
> > > > it
> > > > > doesn't come back. We still need to give it some time to eliminate
> > > > > unnecessary rescaling, which can be controlled by setting
> > > > > "resource-stabilization-timeout."
> > > > > - (plus) The resources can arrive with a significant delay. For
> > > example,
> > > > > you're unable to spawn enough TMs on time because you've run out of
> > > > > resources in your k8s cluster, and you need to wait for the cluster
> > > auto
> > > > > scaler to kick in and add new nodes to the cluster. In this scenario,
> > > > > you'll be able to start making progress faster, at the cost of
> > multiple
> > > > > rescalings (once the remaining resources arrive).
> > > > > - (plus) This plays well with the declarative manner of today's
> > > > > infrastructure. For example, you tell k8s that you need 10 TMs, and
> > > > you'll
> > > > > eventually get them.
> > > > > - (minus) In the case of large state jobs, the cost of multiple
> > > > rescalings
> > > > > might outweigh the above.
> > > > >
> > > > > We've already touched on the solution to this problem on the FLIP.
> > > Please
> > > > > notice the parallelism knob being a range with a lower and upper
> > bound.
> > > > > Setting both the lower and upper bound to the same value could give
> > the
> > > > > behavior you're describing at the cost of giving up some properties
> > > that
> > > > AS
> > > > > gives you (you'd be falling back to the DefaultScheduler's behavior).
> > > > >
> > > > > when user upgrade job-vertx-parallelism . I want to have an interface
> > > to
> > > > > > query the current update parallel execution status, so that the
> > user
> > > or
> > > > > > program can understand the current status
> > > > > >
> > > > >
> > > > > This is a misunderstanding. We're not introducing the RESCALE
> > endpoint.
> > > > > This endpoint allows you to re-declare the resources needed to run
> > the
> > > > job.
> > > > > Once you reach the desired resources (you get more resources than the
> > > > lower
> > > > > bound defines), your job will run.
> > > > >
> > > > > We can expose a similar endpoint to "resource requirements" to give
> > you
> > > > an
> > > > > overview of the resources the vertices already have. You can already
> > > get
> > > > > this from the REST API, so exposing this in yet another way should be
> > > > > considered carefully.
> > > > >
> > > > > *@Matthias*
> > > > >
> > > > > I'm wondering whether it makes sense to add some kind of resource ID
> > to
> > > > the
> > > > > > REST API.
> > > > >
> > > > >
> > > > > That's a good question. I want to think about that and get back to
> > the
> > > > > question later. My main struggle when thinking about this is, "if
> > this
> > > > > would be an idempotent POST endpoint," would it be any different?
> > > > >
> > > > > How often do we allow resource requirements to be changed?
> > > > >
> > > > >
> > > > > There shall be no rate limiting on the FLINK side. If this is
> > something
> > > > > your environment needs, you can achieve it on a different layer ("we
> > > > can't
> > > > > have FLINK to do everything").
> > > > >
> > > > > Versioning the JobGraph in the JobGraphStore rather than overwriting
> > it
> > > > > > might be an idea.
> > > > > >
> > > > >
> > > > > This sounds interesting since it would be closer to the JobGraph
> > being
> > > > > immutable. The main problem I see here is that this would introduce a
> > > > > BW-incompatible change so it might be a topic for follow-up FLIP.
> > > > >
> > > > > I'm just wondering whether we bundle two things together that are
> > > > actually
> > > > > > separate
> > > > > >
> > > > >
> > > > > Yup, this is how we think about it as well. The main question is,
> > "who
> > > > > should be responsible for bookkeeping 1) the JobGraph and 2) the
> > > > > JobResourceRequirements". The JobMaster would be the right place for
> > > > both,
> > > > > but it's currently not the case, and we're tightly coupling the
> > > > dispatcher
> > > > > with the JobMaster.
> > > > >
> > > > > Initially, we tried to introduce a separate HA component in JobMaster
> > > for
> > > > > bookkeeping the JobResourceRequirements, but that proved to be a more
> > > > > significant effort adding additional mess to the already messy HA
> > > > > ecosystem. Another approach we've discussed was mutating the JobGraph
> > > and
> > > > > setting JRR into the JobGraph structure itself.
> > > > >
> > > > > The middle ground for keeping this effort reasonably sized and not
> > > > > violating "we want to keep JG immutable" too much is keeping the
> > > > > JobResourceRequirements separate as an internal config option in
> > > > JobGraph's
> > > > > configuration.
> > > > >
> > > > > We ultimately need to rethink the tight coupling of Dispatcher and
> > > > > JobMaster, but it needs to be a separate effort.
> > > > >
> > > > > ...also considering the amount of data that can be stored in a
> > > > > > ConfigMap/ZooKeeper node if versioning the resource requirement
> > > change
> > > > as
> > > > > > proposed in my previous item is an option for us.
> > > > > >
> > > > >
> > > > > AFAIK we're only storing pointers to the S3 objects in HA metadata,
> > so
> > > we
> > > > > should be okay with having larger structures for now.
> > > > >
> > > > > Updating the JobGraphStore means adding more requests to the HA
> > backend
> > > > > API.
> > > > > >
> > > > >
> > > > > It's fine unless you intend to override the resource requirements a
> > few
> > > > > times per second.
> > > > >
> > > > > *@Shammon*
> > > > >
> > > > > How about adding some more information such as vertex type
> > > > > >
> > > > >
> > > > > Since it was intended as a "debug" endpoint, it makes complete sense!
> > > > >
> > > > >  For sql jobs, we always use a unified parallelism for most vertices.
> > > Can
> > > > > > we provide them with a more convenient setting method instead of
> > each
> > > > > one?
> > > > >
> > > > >
> > > > > I completely feel with this. The main thoughts when designing the API
> > > > were:
> > > > > - We want to keep it clean and easy to understand.
> > > > > - Global parallelism can be modeled using per-vertex parallelism but
> > > not
> > > > > the other way around.
> > > > > - The API will be used by external tooling (operator, auto scaler).
> > > > >
> > > > > Can you elaborate more about how you'd intend to use the endpoint? I
> > > > think
> > > > > we can ultimately introduce a way of re-declaring "per-vertex
> > > defaults,"
> > > > > but I'd like to understand the use case bit more first.
> > > > >
> > > > > *@Weijie*
> > > > >
> > > > > What is the default value here (based on what configuration), or just
> > > > > > infinite?
> > > > > >
> > > > >
> > > > > Currently, for the lower bound, it's always one, and for the upper
> > > bound,
> > > > > it's either parallelism (if defined) or the maxParallelism of the
> > > vertex
> > > > in
> > > > > JobGraph. This question might be another signal for making the
> > defaults
> > > > > explicit (see the answer to Shammon's question above).
> > > > >
> > > > >
> > > > > Thanks, everyone, for your initial thoughts!
> > > > >
> > > > > Best,
> > > > > D.
> > > > >
> > > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <guoweijiereswqa@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Thanks David for driving this. This is a very valuable work,
> > > especially
> > > > > for
> > > > > > cloud native environment.
> > > > > >
> > > > > > >> How about adding some more information such as vertex type
> > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > > resource-requirements`? For users, only vertex-id may be difficult
> > to
> > > > > > understand.
> > > > > >
> > > > > > +1 for this suggestion, including jobvertex's name in the response
> > > body
> > > > > is
> > > > > > more
> > > > > > user-friendly.
> > > > > >
> > > > > >
> > > > > > I saw this sentence in FLIP: "Setting the upper bound to -1 will
> > > reset
> > > > > the
> > > > > > value to the default setting."  What is the default value here
> > (based
> > > > on
> > > > > > what configuration), or just infinite?
> > > > > >
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Weijie
> > > > > >
> > > > > >
> > > > > >
> > > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
> > > > > >
> > > > > > > Hi David
> > > > > > >
> > > > > > > Thanks for initiating this discussion. I think declaring job
> > > resource
> > > > > > > requirements by REST API is very valuable. I just left some
> > > comments
> > > > as
> > > > > > > followed
> > > > > > >
> > > > > > > 1) How about adding some more information such as vertex type
> > > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > > > resource-requirements`? For users, only vertex-id may be
> > difficult
> > > to
> > > > > > > understand.
> > > > > > >
> > > > > > > 2) For sql jobs, we always use a unified parallelism for most
> > > > vertices.
> > > > > > Can
> > > > > > > we provide them with a more convenient setting method instead of
> > > each
> > > > > > one?
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Shammon
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
> > > matthias.pohl@aiven.io
> > > > > > > .invalid>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks David for creating this FLIP. It sounds promising and
> > > useful
> > > > > to
> > > > > > > > have. Here are some thoughts from my side (some of them might
> > be
> > > > > > rather a
> > > > > > > > follow-up and not necessarily part of this FLIP):
> > > > > > > > - I'm wondering whether it makes sense to add some kind of
> > > resource
> > > > > ID
> > > > > > to
> > > > > > > > the REST API. This would give Flink a tool to verify the PATCH
> > > > > request
> > > > > > of
> > > > > > > > the external system in a compare-and-set kind of manner. AFAIU,
> > > the
> > > > > > > process
> > > > > > > > requires the external system to retrieve the resource
> > > requirements
> > > > > > first
> > > > > > > > (to retrieve the vertex IDs). A resource ID <ABC> would be sent
> > > > along
> > > > > > as
> > > > > > > a
> > > > > > > > unique identifier for the provided setup. It's essentially the
> > > > > version
> > > > > > ID
> > > > > > > > of the currently deployed resource requirement configuration.
> > > Flink
> > > > > > > doesn't
> > > > > > > > know whether the external system would use the provided
> > > information
> > > > > in
> > > > > > > some
> > > > > > > > way to derive a new set of resource requirements for this job.
> > > The
> > > > > > > > subsequent PATCH request with updated resource requirements
> > would
> > > > > > include
> > > > > > > > the previously retrieved resource ID <ABC>. The PATCH call
> > would
> > > > fail
> > > > > > if
> > > > > > > > there was a concurrent PATCH call in between indicating to the
> > > > > external
> > > > > > > > system that the resource requirements were concurrently
> > updated.
> > > > > > > > - How often do we allow resource requirements to be changed?
> > That
> > > > > > > question
> > > > > > > > might make my previous comment on the resource ID obsolete
> > > because
> > > > we
> > > > > > > could
> > > > > > > > just make any PATCH call fail if there was a resource
> > requirement
> > > > > > update
> > > > > > > > within a certain time frame before the request. But such a time
> > > > > period
> > > > > > is
> > > > > > > > something we might want to make configurable then, I guess.
> > > > > > > > - Versioning the JobGraph in the JobGraphStore rather than
> > > > > overwriting
> > > > > > it
> > > > > > > > might be an idea. This would enable us to provide resource
> > > > > requirement
> > > > > > > > changes in the UI or through the REST API. It is related to a
> > > > problem
> > > > > > > > around keeping track of the exception history within the
> > > > > > > AdaptiveScheduler
> > > > > > > > and also having to consider multiple versions of a JobGraph.
> > But
> > > > for
> > > > > > that
> > > > > > > > one, we use the ExecutionGraphInfoStore right now.
> > > > > > > > - Updating the JobGraph in the JobGraphStore makes sense. I'm
> > > just
> > > > > > > > wondering whether we bundle two things together that are
> > actually
> > > > > > > separate:
> > > > > > > > The business logic and the execution configuration (the
> > resource
> > > > > > > > requirements). I'm aware that this is not a flaw of the current
> > > > FLIP
> > > > > > but
> > > > > > > > rather something that was not necessary to address in the past
> > > > > because
> > > > > > > the
> > > > > > > > JobGraph was kind of static. I don't remember whether that was
> > > > > already
> > > > > > > > discussed while working on the AdaptiveScheduler for FLIP-160
> > > [1].
> > > > > > Maybe,
> > > > > > > > I'm missing some functionality here that requires us to have
> > > > > everything
> > > > > > > in
> > > > > > > > one place. But it feels like updating the entire JobGraph which
> > > > could
> > > > > > be
> > > > > > > > actually a "config change" is not reasonable. ...also
> > considering
> > > > the
> > > > > > > > amount of data that can be stored in a ConfigMap/ZooKeeper node
> > > if
> > > > > > > > versioning the resource requirement change as proposed in my
> > > > previous
> > > > > > > item
> > > > > > > > is an option for us.
> > > > > > > > - Updating the JobGraphStore means adding more requests to the
> > HA
> > > > > > backend
> > > > > > > > API. There were some concerns shared in the discussion thread
> > [2]
> > > > for
> > > > > > > > FLIP-270 [3] on pressuring the k8s API server in the past with
> > > too
> > > > > many
> > > > > > > > calls. Eventhough, it's more likely to be caused by
> > > checkpointing,
> > > > I
> > > > > > > still
> > > > > > > > wanted to bring it up. We're working on a standardized
> > > performance
> > > > > test
> > > > > > > to
> > > > > > > > prepare going forward with FLIP-270 [3] right now.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Matthias
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > > > > > > > [2]
> > > > https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > > > > > > > [3]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> > > > > > > >
> > > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <jam.gzczy@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi David:
> > > > > > > > >
> > > > > > > > > Thank you for drive this flip, which helps less flink
> > shutdown
> > > > time
> > > > > > > > >
> > > > > > > > > for this flip, I would like to make a few idea on share
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    - when the number of "slots" is insufficient, can we can
> > > stop
> > > > > > users
> > > > > > > > >    rescaling or throw something to tell user "less avaliable
> > > > slots
> > > > > to
> > > > > > > > > upgrade,
> > > > > > > > >    please checkout your alivalbe slots" ? Or we could have a
> > > > > request
> > > > > > > > >    switch(true/false) to allow this behavior
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    - when user upgrade job-vertx-parallelism . I want to have
> > > an
> > > > > > > > interface
> > > > > > > > >    to query the current update parallel execution status, so
> > > that
> > > > > the
> > > > > > > > user
> > > > > > > > > or
> > > > > > > > >    program can understand the current status
> > > > > > > > >    - I want to have an interface to query the current update
> > > > > > > parallelism
> > > > > > > > >    execution status. This also helps similar to *[1] Flink
> > K8S
> > > > > > > Operator*
> > > > > > > > >    management
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > {
> > > > > > > > >   status: Failed
> > > > > > > > >   reason: "less avaliable slots to upgrade, please checkout
> > > your
> > > > > > > alivalbe
> > > > > > > > > slots"
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    - *Pending*: this job now is join the upgrade queue,it
> > will
> > > be
> > > > > > > update
> > > > > > > > >    later
> > > > > > > > >    - *Rescaling*: job now is rescaling,wait it finish
> > > > > > > > >    - *Finished*: finish do it
> > > > > > > > >    - *Failed* : something have wrong,so this job is not
> > > alivable
> > > > > > > upgrade
> > > > > > > > >
> > > > > > > > > I want to supplement my above content in flip, what do you
> > > think
> > > > ?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >    1.
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：
> > > > > > > > >
> > > > > > > > > > Hi everyone,
> > > > > > > > > >
> > > > > > > > > > This FLIP [1] introduces a new REST API for declaring
> > > resource
> > > > > > > > > requirements
> > > > > > > > > > for the Adaptive Scheduler. There seems to be a clear need
> > > for
> > > > > this
> > > > > > > API
> > > > > > > > > > based on the discussion on the "Reworking the Rescale API"
> > > [2]
> > > > > > > thread.
> > > > > > > > > >
> > > > > > > > > > Before we get started, this work is heavily based on the
> > > > > prototype
> > > > > > > [3]
> > > > > > > > > > created by Till Rohrmann, and the FLIP is being published
> > > with
> > > > > his
> > > > > > > > > consent.
> > > > > > > > > > Big shoutout to him!
> > > > > > > > > >
> > > > > > > > > > Last and not least, thanks to Chesnay and Roman for the
> > > initial
> > > > > > > reviews
> > > > > > > > > and
> > > > > > > > > > discussions.
> > > > > > > > > >
> > > > > > > > > > The best start would be watching a short demo [4] that I've
> > > > > > recorded,
> > > > > > > > > which
> > > > > > > > > > illustrates newly added capabilities (rescaling the running
> > > > job,
> > > > > > > > handing
> > > > > > > > > > back resources to the RM, and session cluster support).
> > > > > > > > > >
> > > > > > > > > > The intuition behind the FLIP is being able to define
> > > resource
> > > > > > > > > requirements
> > > > > > > > > > ("resource boundaries") externally that the
> > AdaptiveScheduler
> > > > can
> > > > > > > > > navigate
> > > > > > > > > > within. This is a building block for higher-level efforts
> > > such
> > > > as
> > > > > > an
> > > > > > > > > > external Autoscaler. The natural extension of this work
> > would
> > > > be
> > > > > to
> > > > > > > > allow
> > > > > > > > > > to specify per-vertex ResourceProfiles.
> > > > > > > > > >
> > > > > > > > > > Looking forward to your thoughts; any feedback is
> > > appreciated!
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > > > > > > [2]
> > > > > > https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > > > > > > > > > [3] https://github.com/tillrohrmann/flink/tree/autoscaling
> > > > > > > > > > [4]
> > > > > > > > >
> > > > > >
> > > https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > D.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best
> > > > > > > > >
> > > > > > > > > ConradJam
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by feng xiangyu <xi...@gmail.com>.

Hi David,

Thanks for your reply.  I think your response totally make sense.  This
flip targets on declaring required resource to ResourceManager instead of
using  ResourceManager to add/remove TMs directly.

Best,
Xiangyu



David Morávek <da...@gmail.com> 于2023年2月13日周一 15:46写道：

> Hi everyone,
>
> @Shammon
>
> I'm not entirely sure what "config file" you're referring to. You can, of
> course, override the default parallelism in "flink-conf.yaml", but for
> sinks and sources, the parallelism needs to be tweaked on the connector
> level ("WITH" statement).
>
> This is something that should be achieved with tooling around Flink. We
> want to provide an API on the lowest level that generalizes well. Achieving
> what you're describing should be straightforward with this API.
>
> @Xiangyu
>
> Is it possible for this REST API to declare TM resources in the future?
>
>
> Would you like to add/remove TMs if you use an active Resource Manager?
> This would be out of the scope of this effort since it targets the
> scheduler component only (we make no assumptions about the used Resource
> Manager). Also, the AdaptiveScheduler is only intended to be used for
> Streaming.
>
>  And for streaming jobs, I'm wondering if there is any situation we need to
> > rescale the TM resources of a flink cluster at first and then the
> adaptive
> > scheduler will rescale the per-vertex ResourceProfiles accordingly.
> >
>
> We plan on adding support for the ResourceProfiles (dynamic slot
> allocation) as the next step. Again we won't make any assumptions about the
> used Resource Manager. In other words, this effort ends by declaring
> desired resources to the Resource Manager.
>
> Does that make sense?
>
> @Matthias
>
> We've done another pass on the proposed API and currently lean towards
> having an idempotent PUT API.
> - We don't care too much about multiple writers' scenarios in terms of who
> can write an authoritative payload; this is up to the user of the API to
> figure out
> - It's indeed tricky to achieve atomicity with PATCH API; switching to PUT
> API seems to do the trick
> - We won't allow partial "payloads" anymore, meaning you need to define
> requirements for all vertices in the JobGraph; This is completely fine for
> the programmatic workflows. For DEBUG / DEMO purposes, you can use the GET
> endpoint and tweak the response to avoid writing the whole payload by hand.
>
> WDYT?
>
>
> Best,
> D.
>
> On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <xi...@gmail.com>
> wrote:
>
> > Hi David,
> >
> > Thanks for creating this flip. I think this work it is very useful,
> > especially in autoscaling scenario.  I would like to share some questions
> > from my view.
> >
> > 1, Is it possible for this REST API to declare TM resources in the
> future?
> > I'm asking because we are building the autoscaling feature for Flink OLAP
> > Session Cluster in ByteDance. We need to rescale the cluster's resource
> on
> > TM level instead of Job level. It would be very helpful if we have a REST
> > API for out external Autoscaling service to use.
> >
> > 2, And for streaming jobs, I'm wondering if there is any situation we
> need
> > to rescale the TM resources of a flink cluster at first and then the
> > adaptive scheduler will rescale the per-vertex ResourceProfiles
> > accordingly.
> >
> > best.
> > Xiangyu
> >
> > Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
> >
> > > Hi David
> > >
> > > Thanks for your answer.
> > >
> > > > Can you elaborate more about how you'd intend to use the endpoint? I
> > > think we can ultimately introduce a way of re-declaring "per-vertex
> > > defaults," but I'd like to understand the use case bit more first.
> > >
> > > For this issue, I mainly consider the consistency of user configuration
> > and
> > > job runtime. For sql jobs, users usually set specific parallelism for
> > > source and sink, and set a global parallelism for other operators.
> These
> > > config items are stored in a config file. For some high-priority jobs,
> > > users may want to manage them manually.
> > > 1. When users need to scale the parallelism, they should update the
> > config
> > > file and restart flink job, which may take a long time.
> > > 2. After providing the REST API, users can just send a request to the
> job
> > > via REST API quickly after updating the config file.
> > > The configuration in the running job and config file should be the
> same.
> > > What do you think of this?
> > >
> > > best.
> > > Shammon
> > >
> > >
> > >
> > > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <da...@gmail.com>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Let's try to answer the questions one by one.
> > > >
> > > > *@ConradJam*
> > > >
> > > > when the number of "slots" is insufficient, can we can stop users
> > > rescaling
> > > > > or throw something to tell user "less avaliable slots to upgrade,
> > > please
> > > > > checkout your alivalbe slots" ?
> > > > >
> > > >
> > > > The main property of AdaptiveScheduler is that it can adapt to
> > "available
> > > > resources," which means you're still able to make progress even
> though
> > > you
> > > > didn't get all the slots you've asked for. Let's break down the pros
> > and
> > > > cons of this property.
> > > >
> > > > - (plus) If you lose a TM for some reason, you can still recover even
> > if
> > > it
> > > > doesn't come back. We still need to give it some time to eliminate
> > > > unnecessary rescaling, which can be controlled by setting
> > > > "resource-stabilization-timeout."
> > > > - (plus) The resources can arrive with a significant delay. For
> > example,
> > > > you're unable to spawn enough TMs on time because you've run out of
> > > > resources in your k8s cluster, and you need to wait for the cluster
> > auto
> > > > scaler to kick in and add new nodes to the cluster. In this scenario,
> > > > you'll be able to start making progress faster, at the cost of
> multiple
> > > > rescalings (once the remaining resources arrive).
> > > > - (plus) This plays well with the declarative manner of today's
> > > > infrastructure. For example, you tell k8s that you need 10 TMs, and
> > > you'll
> > > > eventually get them.
> > > > - (minus) In the case of large state jobs, the cost of multiple
> > > rescalings
> > > > might outweigh the above.
> > > >
> > > > We've already touched on the solution to this problem on the FLIP.
> > Please
> > > > notice the parallelism knob being a range with a lower and upper
> bound.
> > > > Setting both the lower and upper bound to the same value could give
> the
> > > > behavior you're describing at the cost of giving up some properties
> > that
> > > AS
> > > > gives you (you'd be falling back to the DefaultScheduler's behavior).
> > > >
> > > > when user upgrade job-vertx-parallelism . I want to have an interface
> > to
> > > > > query the current update parallel execution status, so that the
> user
> > or
> > > > > program can understand the current status
> > > > >
> > > >
> > > > This is a misunderstanding. We're not introducing the RESCALE
> endpoint.
> > > > This endpoint allows you to re-declare the resources needed to run
> the
> > > job.
> > > > Once you reach the desired resources (you get more resources than the
> > > lower
> > > > bound defines), your job will run.
> > > >
> > > > We can expose a similar endpoint to "resource requirements" to give
> you
> > > an
> > > > overview of the resources the vertices already have. You can already
> > get
> > > > this from the REST API, so exposing this in yet another way should be
> > > > considered carefully.
> > > >
> > > > *@Matthias*
> > > >
> > > > I'm wondering whether it makes sense to add some kind of resource ID
> to
> > > the
> > > > > REST API.
> > > >
> > > >
> > > > That's a good question. I want to think about that and get back to
> the
> > > > question later. My main struggle when thinking about this is, "if
> this
> > > > would be an idempotent POST endpoint," would it be any different?
> > > >
> > > > How often do we allow resource requirements to be changed?
> > > >
> > > >
> > > > There shall be no rate limiting on the FLINK side. If this is
> something
> > > > your environment needs, you can achieve it on a different layer ("we
> > > can't
> > > > have FLINK to do everything").
> > > >
> > > > Versioning the JobGraph in the JobGraphStore rather than overwriting
> it
> > > > > might be an idea.
> > > > >
> > > >
> > > > This sounds interesting since it would be closer to the JobGraph
> being
> > > > immutable. The main problem I see here is that this would introduce a
> > > > BW-incompatible change so it might be a topic for follow-up FLIP.
> > > >
> > > > I'm just wondering whether we bundle two things together that are
> > > actually
> > > > > separate
> > > > >
> > > >
> > > > Yup, this is how we think about it as well. The main question is,
> "who
> > > > should be responsible for bookkeeping 1) the JobGraph and 2) the
> > > > JobResourceRequirements". The JobMaster would be the right place for
> > > both,
> > > > but it's currently not the case, and we're tightly coupling the
> > > dispatcher
> > > > with the JobMaster.
> > > >
> > > > Initially, we tried to introduce a separate HA component in JobMaster
> > for
> > > > bookkeeping the JobResourceRequirements, but that proved to be a more
> > > > significant effort adding additional mess to the already messy HA
> > > > ecosystem. Another approach we've discussed was mutating the JobGraph
> > and
> > > > setting JRR into the JobGraph structure itself.
> > > >
> > > > The middle ground for keeping this effort reasonably sized and not
> > > > violating "we want to keep JG immutable" too much is keeping the
> > > > JobResourceRequirements separate as an internal config option in
> > > JobGraph's
> > > > configuration.
> > > >
> > > > We ultimately need to rethink the tight coupling of Dispatcher and
> > > > JobMaster, but it needs to be a separate effort.
> > > >
> > > > ...also considering the amount of data that can be stored in a
> > > > > ConfigMap/ZooKeeper node if versioning the resource requirement
> > change
> > > as
> > > > > proposed in my previous item is an option for us.
> > > > >
> > > >
> > > > AFAIK we're only storing pointers to the S3 objects in HA metadata,
> so
> > we
> > > > should be okay with having larger structures for now.
> > > >
> > > > Updating the JobGraphStore means adding more requests to the HA
> backend
> > > > API.
> > > > >
> > > >
> > > > It's fine unless you intend to override the resource requirements a
> few
> > > > times per second.
> > > >
> > > > *@Shammon*
> > > >
> > > > How about adding some more information such as vertex type
> > > > >
> > > >
> > > > Since it was intended as a "debug" endpoint, it makes complete sense!
> > > >
> > > >  For sql jobs, we always use a unified parallelism for most vertices.
> > Can
> > > > > we provide them with a more convenient setting method instead of
> each
> > > > one?
> > > >
> > > >
> > > > I completely feel with this. The main thoughts when designing the API
> > > were:
> > > > - We want to keep it clean and easy to understand.
> > > > - Global parallelism can be modeled using per-vertex parallelism but
> > not
> > > > the other way around.
> > > > - The API will be used by external tooling (operator, auto scaler).
> > > >
> > > > Can you elaborate more about how you'd intend to use the endpoint? I
> > > think
> > > > we can ultimately introduce a way of re-declaring "per-vertex
> > defaults,"
> > > > but I'd like to understand the use case bit more first.
> > > >
> > > > *@Weijie*
> > > >
> > > > What is the default value here (based on what configuration), or just
> > > > > infinite?
> > > > >
> > > >
> > > > Currently, for the lower bound, it's always one, and for the upper
> > bound,
> > > > it's either parallelism (if defined) or the maxParallelism of the
> > vertex
> > > in
> > > > JobGraph. This question might be another signal for making the
> defaults
> > > > explicit (see the answer to Shammon's question above).
> > > >
> > > >
> > > > Thanks, everyone, for your initial thoughts!
> > > >
> > > > Best,
> > > > D.
> > > >
> > > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <guoweijiereswqa@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Thanks David for driving this. This is a very valuable work,
> > especially
> > > > for
> > > > > cloud native environment.
> > > > >
> > > > > >> How about adding some more information such as vertex type
> > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > resource-requirements`? For users, only vertex-id may be difficult
> to
> > > > > understand.
> > > > >
> > > > > +1 for this suggestion, including jobvertex's name in the response
> > body
> > > > is
> > > > > more
> > > > > user-friendly.
> > > > >
> > > > >
> > > > > I saw this sentence in FLIP: "Setting the upper bound to -1 will
> > reset
> > > > the
> > > > > value to the default setting."  What is the default value here
> (based
> > > on
> > > > > what configuration), or just infinite?
> > > > >
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Weijie
> > > > >
> > > > >
> > > > >
> > > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
> > > > >
> > > > > > Hi David
> > > > > >
> > > > > > Thanks for initiating this discussion. I think declaring job
> > resource
> > > > > > requirements by REST API is very valuable. I just left some
> > comments
> > > as
> > > > > > followed
> > > > > >
> > > > > > 1) How about adding some more information such as vertex type
> > > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > > resource-requirements`? For users, only vertex-id may be
> difficult
> > to
> > > > > > understand.
> > > > > >
> > > > > > 2) For sql jobs, we always use a unified parallelism for most
> > > vertices.
> > > > > Can
> > > > > > we provide them with a more convenient setting method instead of
> > each
> > > > > one?
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Shammon
> > > > > >
> > > > > >
> > > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
> > matthias.pohl@aiven.io
> > > > > > .invalid>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks David for creating this FLIP. It sounds promising and
> > useful
> > > > to
> > > > > > > have. Here are some thoughts from my side (some of them might
> be
> > > > > rather a
> > > > > > > follow-up and not necessarily part of this FLIP):
> > > > > > > - I'm wondering whether it makes sense to add some kind of
> > resource
> > > > ID
> > > > > to
> > > > > > > the REST API. This would give Flink a tool to verify the PATCH
> > > > request
> > > > > of
> > > > > > > the external system in a compare-and-set kind of manner. AFAIU,
> > the
> > > > > > process
> > > > > > > requires the external system to retrieve the resource
> > requirements
> > > > > first
> > > > > > > (to retrieve the vertex IDs). A resource ID <ABC> would be sent
> > > along
> > > > > as
> > > > > > a
> > > > > > > unique identifier for the provided setup. It's essentially the
> > > > version
> > > > > ID
> > > > > > > of the currently deployed resource requirement configuration.
> > Flink
> > > > > > doesn't
> > > > > > > know whether the external system would use the provided
> > information
> > > > in
> > > > > > some
> > > > > > > way to derive a new set of resource requirements for this job.
> > The
> > > > > > > subsequent PATCH request with updated resource requirements
> would
> > > > > include
> > > > > > > the previously retrieved resource ID <ABC>. The PATCH call
> would
> > > fail
> > > > > if
> > > > > > > there was a concurrent PATCH call in between indicating to the
> > > > external
> > > > > > > system that the resource requirements were concurrently
> updated.
> > > > > > > - How often do we allow resource requirements to be changed?
> That
> > > > > > question
> > > > > > > might make my previous comment on the resource ID obsolete
> > because
> > > we
> > > > > > could
> > > > > > > just make any PATCH call fail if there was a resource
> requirement
> > > > > update
> > > > > > > within a certain time frame before the request. But such a time
> > > > period
> > > > > is
> > > > > > > something we might want to make configurable then, I guess.
> > > > > > > - Versioning the JobGraph in the JobGraphStore rather than
> > > > overwriting
> > > > > it
> > > > > > > might be an idea. This would enable us to provide resource
> > > > requirement
> > > > > > > changes in the UI or through the REST API. It is related to a
> > > problem
> > > > > > > around keeping track of the exception history within the
> > > > > > AdaptiveScheduler
> > > > > > > and also having to consider multiple versions of a JobGraph.
> But
> > > for
> > > > > that
> > > > > > > one, we use the ExecutionGraphInfoStore right now.
> > > > > > > - Updating the JobGraph in the JobGraphStore makes sense. I'm
> > just
> > > > > > > wondering whether we bundle two things together that are
> actually
> > > > > > separate:
> > > > > > > The business logic and the execution configuration (the
> resource
> > > > > > > requirements). I'm aware that this is not a flaw of the current
> > > FLIP
> > > > > but
> > > > > > > rather something that was not necessary to address in the past
> > > > because
> > > > > > the
> > > > > > > JobGraph was kind of static. I don't remember whether that was
> > > > already
> > > > > > > discussed while working on the AdaptiveScheduler for FLIP-160
> > [1].
> > > > > Maybe,
> > > > > > > I'm missing some functionality here that requires us to have
> > > > everything
> > > > > > in
> > > > > > > one place. But it feels like updating the entire JobGraph which
> > > could
> > > > > be
> > > > > > > actually a "config change" is not reasonable. ...also
> considering
> > > the
> > > > > > > amount of data that can be stored in a ConfigMap/ZooKeeper node
> > if
> > > > > > > versioning the resource requirement change as proposed in my
> > > previous
> > > > > > item
> > > > > > > is an option for us.
> > > > > > > - Updating the JobGraphStore means adding more requests to the
> HA
> > > > > backend
> > > > > > > API. There were some concerns shared in the discussion thread
> [2]
> > > for
> > > > > > > FLIP-270 [3] on pressuring the k8s API server in the past with
> > too
> > > > many
> > > > > > > calls. Eventhough, it's more likely to be caused by
> > checkpointing,
> > > I
> > > > > > still
> > > > > > > wanted to bring it up. We're working on a standardized
> > performance
> > > > test
> > > > > > to
> > > > > > > prepare going forward with FLIP-270 [3] right now.
> > > > > > >
> > > > > > > Best,
> > > > > > > Matthias
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > > > > > > [2]
> > > https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > > > > > > [3]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> > > > > > >
> > > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <jam.gzczy@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi David:
> > > > > > > >
> > > > > > > > Thank you for drive this flip, which helps less flink
> shutdown
> > > time
> > > > > > > >
> > > > > > > > for this flip, I would like to make a few idea on share
> > > > > > > >
> > > > > > > >
> > > > > > > >    - when the number of "slots" is insufficient, can we can
> > stop
> > > > > users
> > > > > > > >    rescaling or throw something to tell user "less avaliable
> > > slots
> > > > to
> > > > > > > > upgrade,
> > > > > > > >    please checkout your alivalbe slots" ? Or we could have a
> > > > request
> > > > > > > >    switch(true/false) to allow this behavior
> > > > > > > >
> > > > > > > >
> > > > > > > >    - when user upgrade job-vertx-parallelism . I want to have
> > an
> > > > > > > interface
> > > > > > > >    to query the current update parallel execution status, so
> > that
> > > > the
> > > > > > > user
> > > > > > > > or
> > > > > > > >    program can understand the current status
> > > > > > > >    - I want to have an interface to query the current update
> > > > > > parallelism
> > > > > > > >    execution status. This also helps similar to *[1] Flink
> K8S
> > > > > > Operator*
> > > > > > > >    management
> > > > > > > >
> > > > > > > >
> > > > > > > > {
> > > > > > > >   status: Failed
> > > > > > > >   reason: "less avaliable slots to upgrade, please checkout
> > your
> > > > > > alivalbe
> > > > > > > > slots"
> > > > > > > > }
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >    - *Pending*: this job now is join the upgrade queue,it
> will
> > be
> > > > > > update
> > > > > > > >    later
> > > > > > > >    - *Rescaling*: job now is rescaling,wait it finish
> > > > > > > >    - *Finished*: finish do it
> > > > > > > >    - *Failed* : something have wrong,so this job is not
> > alivable
> > > > > > upgrade
> > > > > > > >
> > > > > > > > I want to supplement my above content in flip, what do you
> > think
> > > ?
> > > > > > > >
> > > > > > > >
> > > > > > > >    1.
> > > > > > > >
> > > > > > >
> > > > >
> > >
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > > > > > > >
> > > > > > > >
> > > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：
> > > > > > > >
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > This FLIP [1] introduces a new REST API for declaring
> > resource
> > > > > > > > requirements
> > > > > > > > > for the Adaptive Scheduler. There seems to be a clear need
> > for
> > > > this
> > > > > > API
> > > > > > > > > based on the discussion on the "Reworking the Rescale API"
> > [2]
> > > > > > thread.
> > > > > > > > >
> > > > > > > > > Before we get started, this work is heavily based on the
> > > > prototype
> > > > > > [3]
> > > > > > > > > created by Till Rohrmann, and the FLIP is being published
> > with
> > > > his
> > > > > > > > consent.
> > > > > > > > > Big shoutout to him!
> > > > > > > > >
> > > > > > > > > Last and not least, thanks to Chesnay and Roman for the
> > initial
> > > > > > reviews
> > > > > > > > and
> > > > > > > > > discussions.
> > > > > > > > >
> > > > > > > > > The best start would be watching a short demo [4] that I've
> > > > > recorded,
> > > > > > > > which
> > > > > > > > > illustrates newly added capabilities (rescaling the running
> > > job,
> > > > > > > handing
> > > > > > > > > back resources to the RM, and session cluster support).
> > > > > > > > >
> > > > > > > > > The intuition behind the FLIP is being able to define
> > resource
> > > > > > > > requirements
> > > > > > > > > ("resource boundaries") externally that the
> AdaptiveScheduler
> > > can
> > > > > > > > navigate
> > > > > > > > > within. This is a building block for higher-level efforts
> > such
> > > as
> > > > > an
> > > > > > > > > external Autoscaler. The natural extension of this work
> would
> > > be
> > > > to
> > > > > > > allow
> > > > > > > > > to specify per-vertex ResourceProfiles.
> > > > > > > > >
> > > > > > > > > Looking forward to your thoughts; any feedback is
> > appreciated!
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > > > > > [2]
> > > > > https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > > > > > > > > [3] https://github.com/tillrohrmann/flink/tree/autoscaling
> > > > > > > > > [4]
> > > > > > > >
> > > > >
> > https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > D.
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best
> > > > > > > >
> > > > > > > > ConradJam
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by David Morávek <da...@gmail.com>.

Hi everyone,

@Shammon

I'm not entirely sure what "config file" you're referring to. You can, of
course, override the default parallelism in "flink-conf.yaml", but for
sinks and sources, the parallelism needs to be tweaked on the connector
level ("WITH" statement).

This is something that should be achieved with tooling around Flink. We
want to provide an API on the lowest level that generalizes well. Achieving
what you're describing should be straightforward with this API.

@Xiangyu

Is it possible for this REST API to declare TM resources in the future?


Would you like to add/remove TMs if you use an active Resource Manager?
This would be out of the scope of this effort since it targets the
scheduler component only (we make no assumptions about the used Resource
Manager). Also, the AdaptiveScheduler is only intended to be used for
Streaming.

 And for streaming jobs, I'm wondering if there is any situation we need to
> rescale the TM resources of a flink cluster at first and then the adaptive
> scheduler will rescale the per-vertex ResourceProfiles accordingly.
>

We plan on adding support for the ResourceProfiles (dynamic slot
allocation) as the next step. Again we won't make any assumptions about the
used Resource Manager. In other words, this effort ends by declaring
desired resources to the Resource Manager.

Does that make sense?

@Matthias

We've done another pass on the proposed API and currently lean towards
having an idempotent PUT API.
- We don't care too much about multiple writers' scenarios in terms of who
can write an authoritative payload; this is up to the user of the API to
figure out
- It's indeed tricky to achieve atomicity with PATCH API; switching to PUT
API seems to do the trick
- We won't allow partial "payloads" anymore, meaning you need to define
requirements for all vertices in the JobGraph; This is completely fine for
the programmatic workflows. For DEBUG / DEMO purposes, you can use the GET
endpoint and tweak the response to avoid writing the whole payload by hand.

WDYT?


Best,
D.

On Fri, Feb 10, 2023 at 11:21 AM feng xiangyu <xi...@gmail.com> wrote:

> Hi David,
>
> Thanks for creating this flip. I think this work it is very useful,
> especially in autoscaling scenario.  I would like to share some questions
> from my view.
>
> 1, Is it possible for this REST API to declare TM resources in the future?
> I'm asking because we are building the autoscaling feature for Flink OLAP
> Session Cluster in ByteDance. We need to rescale the cluster's resource on
> TM level instead of Job level. It would be very helpful if we have a REST
> API for out external Autoscaling service to use.
>
> 2, And for streaming jobs, I'm wondering if there is any situation we need
> to rescale the TM resources of a flink cluster at first and then the
> adaptive scheduler will rescale the per-vertex ResourceProfiles
> accordingly.
>
> best.
> Xiangyu
>
> Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：
>
> > Hi David
> >
> > Thanks for your answer.
> >
> > > Can you elaborate more about how you'd intend to use the endpoint? I
> > think we can ultimately introduce a way of re-declaring "per-vertex
> > defaults," but I'd like to understand the use case bit more first.
> >
> > For this issue, I mainly consider the consistency of user configuration
> and
> > job runtime. For sql jobs, users usually set specific parallelism for
> > source and sink, and set a global parallelism for other operators. These
> > config items are stored in a config file. For some high-priority jobs,
> > users may want to manage them manually.
> > 1. When users need to scale the parallelism, they should update the
> config
> > file and restart flink job, which may take a long time.
> > 2. After providing the REST API, users can just send a request to the job
> > via REST API quickly after updating the config file.
> > The configuration in the running job and config file should be the same.
> > What do you think of this?
> >
> > best.
> > Shammon
> >
> >
> >
> > On Tue, Feb 7, 2023 at 4:51 PM David Morávek <da...@gmail.com>
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Let's try to answer the questions one by one.
> > >
> > > *@ConradJam*
> > >
> > > when the number of "slots" is insufficient, can we can stop users
> > rescaling
> > > > or throw something to tell user "less avaliable slots to upgrade,
> > please
> > > > checkout your alivalbe slots" ?
> > > >
> > >
> > > The main property of AdaptiveScheduler is that it can adapt to
> "available
> > > resources," which means you're still able to make progress even though
> > you
> > > didn't get all the slots you've asked for. Let's break down the pros
> and
> > > cons of this property.
> > >
> > > - (plus) If you lose a TM for some reason, you can still recover even
> if
> > it
> > > doesn't come back. We still need to give it some time to eliminate
> > > unnecessary rescaling, which can be controlled by setting
> > > "resource-stabilization-timeout."
> > > - (plus) The resources can arrive with a significant delay. For
> example,
> > > you're unable to spawn enough TMs on time because you've run out of
> > > resources in your k8s cluster, and you need to wait for the cluster
> auto
> > > scaler to kick in and add new nodes to the cluster. In this scenario,
> > > you'll be able to start making progress faster, at the cost of multiple
> > > rescalings (once the remaining resources arrive).
> > > - (plus) This plays well with the declarative manner of today's
> > > infrastructure. For example, you tell k8s that you need 10 TMs, and
> > you'll
> > > eventually get them.
> > > - (minus) In the case of large state jobs, the cost of multiple
> > rescalings
> > > might outweigh the above.
> > >
> > > We've already touched on the solution to this problem on the FLIP.
> Please
> > > notice the parallelism knob being a range with a lower and upper bound.
> > > Setting both the lower and upper bound to the same value could give the
> > > behavior you're describing at the cost of giving up some properties
> that
> > AS
> > > gives you (you'd be falling back to the DefaultScheduler's behavior).
> > >
> > > when user upgrade job-vertx-parallelism . I want to have an interface
> to
> > > > query the current update parallel execution status, so that the user
> or
> > > > program can understand the current status
> > > >
> > >
> > > This is a misunderstanding. We're not introducing the RESCALE endpoint.
> > > This endpoint allows you to re-declare the resources needed to run the
> > job.
> > > Once you reach the desired resources (you get more resources than the
> > lower
> > > bound defines), your job will run.
> > >
> > > We can expose a similar endpoint to "resource requirements" to give you
> > an
> > > overview of the resources the vertices already have. You can already
> get
> > > this from the REST API, so exposing this in yet another way should be
> > > considered carefully.
> > >
> > > *@Matthias*
> > >
> > > I'm wondering whether it makes sense to add some kind of resource ID to
> > the
> > > > REST API.
> > >
> > >
> > > That's a good question. I want to think about that and get back to the
> > > question later. My main struggle when thinking about this is, "if this
> > > would be an idempotent POST endpoint," would it be any different?
> > >
> > > How often do we allow resource requirements to be changed?
> > >
> > >
> > > There shall be no rate limiting on the FLINK side. If this is something
> > > your environment needs, you can achieve it on a different layer ("we
> > can't
> > > have FLINK to do everything").
> > >
> > > Versioning the JobGraph in the JobGraphStore rather than overwriting it
> > > > might be an idea.
> > > >
> > >
> > > This sounds interesting since it would be closer to the JobGraph being
> > > immutable. The main problem I see here is that this would introduce a
> > > BW-incompatible change so it might be a topic for follow-up FLIP.
> > >
> > > I'm just wondering whether we bundle two things together that are
> > actually
> > > > separate
> > > >
> > >
> > > Yup, this is how we think about it as well. The main question is, "who
> > > should be responsible for bookkeeping 1) the JobGraph and 2) the
> > > JobResourceRequirements". The JobMaster would be the right place for
> > both,
> > > but it's currently not the case, and we're tightly coupling the
> > dispatcher
> > > with the JobMaster.
> > >
> > > Initially, we tried to introduce a separate HA component in JobMaster
> for
> > > bookkeeping the JobResourceRequirements, but that proved to be a more
> > > significant effort adding additional mess to the already messy HA
> > > ecosystem. Another approach we've discussed was mutating the JobGraph
> and
> > > setting JRR into the JobGraph structure itself.
> > >
> > > The middle ground for keeping this effort reasonably sized and not
> > > violating "we want to keep JG immutable" too much is keeping the
> > > JobResourceRequirements separate as an internal config option in
> > JobGraph's
> > > configuration.
> > >
> > > We ultimately need to rethink the tight coupling of Dispatcher and
> > > JobMaster, but it needs to be a separate effort.
> > >
> > > ...also considering the amount of data that can be stored in a
> > > > ConfigMap/ZooKeeper node if versioning the resource requirement
> change
> > as
> > > > proposed in my previous item is an option for us.
> > > >
> > >
> > > AFAIK we're only storing pointers to the S3 objects in HA metadata, so
> we
> > > should be okay with having larger structures for now.
> > >
> > > Updating the JobGraphStore means adding more requests to the HA backend
> > > API.
> > > >
> > >
> > > It's fine unless you intend to override the resource requirements a few
> > > times per second.
> > >
> > > *@Shammon*
> > >
> > > How about adding some more information such as vertex type
> > > >
> > >
> > > Since it was intended as a "debug" endpoint, it makes complete sense!
> > >
> > >  For sql jobs, we always use a unified parallelism for most vertices.
> Can
> > > > we provide them with a more convenient setting method instead of each
> > > one?
> > >
> > >
> > > I completely feel with this. The main thoughts when designing the API
> > were:
> > > - We want to keep it clean and easy to understand.
> > > - Global parallelism can be modeled using per-vertex parallelism but
> not
> > > the other way around.
> > > - The API will be used by external tooling (operator, auto scaler).
> > >
> > > Can you elaborate more about how you'd intend to use the endpoint? I
> > think
> > > we can ultimately introduce a way of re-declaring "per-vertex
> defaults,"
> > > but I'd like to understand the use case bit more first.
> > >
> > > *@Weijie*
> > >
> > > What is the default value here (based on what configuration), or just
> > > > infinite?
> > > >
> > >
> > > Currently, for the lower bound, it's always one, and for the upper
> bound,
> > > it's either parallelism (if defined) or the maxParallelism of the
> vertex
> > in
> > > JobGraph. This question might be another signal for making the defaults
> > > explicit (see the answer to Shammon's question above).
> > >
> > >
> > > Thanks, everyone, for your initial thoughts!
> > >
> > > Best,
> > > D.
> > >
> > > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <gu...@gmail.com>
> > > wrote:
> > >
> > > > Thanks David for driving this. This is a very valuable work,
> especially
> > > for
> > > > cloud native environment.
> > > >
> > > > >> How about adding some more information such as vertex type
> > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > resource-requirements`? For users, only vertex-id may be difficult to
> > > > understand.
> > > >
> > > > +1 for this suggestion, including jobvertex's name in the response
> body
> > > is
> > > > more
> > > > user-friendly.
> > > >
> > > >
> > > > I saw this sentence in FLIP: "Setting the upper bound to -1 will
> reset
> > > the
> > > > value to the default setting."  What is the default value here (based
> > on
> > > > what configuration), or just infinite?
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Weijie
> > > >
> > > >
> > > >
> > > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
> > > >
> > > > > Hi David
> > > > >
> > > > > Thanks for initiating this discussion. I think declaring job
> resource
> > > > > requirements by REST API is very valuable. I just left some
> comments
> > as
> > > > > followed
> > > > >
> > > > > 1) How about adding some more information such as vertex type
> > > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > > resource-requirements`? For users, only vertex-id may be difficult
> to
> > > > > understand.
> > > > >
> > > > > 2) For sql jobs, we always use a unified parallelism for most
> > vertices.
> > > > Can
> > > > > we provide them with a more convenient setting method instead of
> each
> > > > one?
> > > > >
> > > > >
> > > > > Best,
> > > > > Shammon
> > > > >
> > > > >
> > > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <
> matthias.pohl@aiven.io
> > > > > .invalid>
> > > > > wrote:
> > > > >
> > > > > > Thanks David for creating this FLIP. It sounds promising and
> useful
> > > to
> > > > > > have. Here are some thoughts from my side (some of them might be
> > > > rather a
> > > > > > follow-up and not necessarily part of this FLIP):
> > > > > > - I'm wondering whether it makes sense to add some kind of
> resource
> > > ID
> > > > to
> > > > > > the REST API. This would give Flink a tool to verify the PATCH
> > > request
> > > > of
> > > > > > the external system in a compare-and-set kind of manner. AFAIU,
> the
> > > > > process
> > > > > > requires the external system to retrieve the resource
> requirements
> > > > first
> > > > > > (to retrieve the vertex IDs). A resource ID <ABC> would be sent
> > along
> > > > as
> > > > > a
> > > > > > unique identifier for the provided setup. It's essentially the
> > > version
> > > > ID
> > > > > > of the currently deployed resource requirement configuration.
> Flink
> > > > > doesn't
> > > > > > know whether the external system would use the provided
> information
> > > in
> > > > > some
> > > > > > way to derive a new set of resource requirements for this job.
> The
> > > > > > subsequent PATCH request with updated resource requirements would
> > > > include
> > > > > > the previously retrieved resource ID <ABC>. The PATCH call would
> > fail
> > > > if
> > > > > > there was a concurrent PATCH call in between indicating to the
> > > external
> > > > > > system that the resource requirements were concurrently updated.
> > > > > > - How often do we allow resource requirements to be changed? That
> > > > > question
> > > > > > might make my previous comment on the resource ID obsolete
> because
> > we
> > > > > could
> > > > > > just make any PATCH call fail if there was a resource requirement
> > > > update
> > > > > > within a certain time frame before the request. But such a time
> > > period
> > > > is
> > > > > > something we might want to make configurable then, I guess.
> > > > > > - Versioning the JobGraph in the JobGraphStore rather than
> > > overwriting
> > > > it
> > > > > > might be an idea. This would enable us to provide resource
> > > requirement
> > > > > > changes in the UI or through the REST API. It is related to a
> > problem
> > > > > > around keeping track of the exception history within the
> > > > > AdaptiveScheduler
> > > > > > and also having to consider multiple versions of a JobGraph. But
> > for
> > > > that
> > > > > > one, we use the ExecutionGraphInfoStore right now.
> > > > > > - Updating the JobGraph in the JobGraphStore makes sense. I'm
> just
> > > > > > wondering whether we bundle two things together that are actually
> > > > > separate:
> > > > > > The business logic and the execution configuration (the resource
> > > > > > requirements). I'm aware that this is not a flaw of the current
> > FLIP
> > > > but
> > > > > > rather something that was not necessary to address in the past
> > > because
> > > > > the
> > > > > > JobGraph was kind of static. I don't remember whether that was
> > > already
> > > > > > discussed while working on the AdaptiveScheduler for FLIP-160
> [1].
> > > > Maybe,
> > > > > > I'm missing some functionality here that requires us to have
> > > everything
> > > > > in
> > > > > > one place. But it feels like updating the entire JobGraph which
> > could
> > > > be
> > > > > > actually a "config change" is not reasonable. ...also considering
> > the
> > > > > > amount of data that can be stored in a ConfigMap/ZooKeeper node
> if
> > > > > > versioning the resource requirement change as proposed in my
> > previous
> > > > > item
> > > > > > is an option for us.
> > > > > > - Updating the JobGraphStore means adding more requests to the HA
> > > > backend
> > > > > > API. There were some concerns shared in the discussion thread [2]
> > for
> > > > > > FLIP-270 [3] on pressuring the k8s API server in the past with
> too
> > > many
> > > > > > calls. Eventhough, it's more likely to be caused by
> checkpointing,
> > I
> > > > > still
> > > > > > wanted to bring it up. We're working on a standardized
> performance
> > > test
> > > > > to
> > > > > > prepare going forward with FLIP-270 [3] right now.
> > > > > >
> > > > > > Best,
> > > > > > Matthias
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > > > > > [2]
> > https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > > > > > [3]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> > > > > >
> > > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <ja...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi David:
> > > > > > >
> > > > > > > Thank you for drive this flip, which helps less flink shutdown
> > time
> > > > > > >
> > > > > > > for this flip, I would like to make a few idea on share
> > > > > > >
> > > > > > >
> > > > > > >    - when the number of "slots" is insufficient, can we can
> stop
> > > > users
> > > > > > >    rescaling or throw something to tell user "less avaliable
> > slots
> > > to
> > > > > > > upgrade,
> > > > > > >    please checkout your alivalbe slots" ? Or we could have a
> > > request
> > > > > > >    switch(true/false) to allow this behavior
> > > > > > >
> > > > > > >
> > > > > > >    - when user upgrade job-vertx-parallelism . I want to have
> an
> > > > > > interface
> > > > > > >    to query the current update parallel execution status, so
> that
> > > the
> > > > > > user
> > > > > > > or
> > > > > > >    program can understand the current status
> > > > > > >    - I want to have an interface to query the current update
> > > > > parallelism
> > > > > > >    execution status. This also helps similar to *[1] Flink K8S
> > > > > Operator*
> > > > > > >    management
> > > > > > >
> > > > > > >
> > > > > > > {
> > > > > > >   status: Failed
> > > > > > >   reason: "less avaliable slots to upgrade, please checkout
> your
> > > > > alivalbe
> > > > > > > slots"
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >    - *Pending*: this job now is join the upgrade queue,it will
> be
> > > > > update
> > > > > > >    later
> > > > > > >    - *Rescaling*: job now is rescaling,wait it finish
> > > > > > >    - *Finished*: finish do it
> > > > > > >    - *Failed* : something have wrong,so this job is not
> alivable
> > > > > upgrade
> > > > > > >
> > > > > > > I want to supplement my above content in flip, what do you
> think
> > ?
> > > > > > >
> > > > > > >
> > > > > > >    1.
> > > > > > >
> > > > > >
> > > >
> > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > > > > > >
> > > > > > >
> > > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > This FLIP [1] introduces a new REST API for declaring
> resource
> > > > > > > requirements
> > > > > > > > for the Adaptive Scheduler. There seems to be a clear need
> for
> > > this
> > > > > API
> > > > > > > > based on the discussion on the "Reworking the Rescale API"
> [2]
> > > > > thread.
> > > > > > > >
> > > > > > > > Before we get started, this work is heavily based on the
> > > prototype
> > > > > [3]
> > > > > > > > created by Till Rohrmann, and the FLIP is being published
> with
> > > his
> > > > > > > consent.
> > > > > > > > Big shoutout to him!
> > > > > > > >
> > > > > > > > Last and not least, thanks to Chesnay and Roman for the
> initial
> > > > > reviews
> > > > > > > and
> > > > > > > > discussions.
> > > > > > > >
> > > > > > > > The best start would be watching a short demo [4] that I've
> > > > recorded,
> > > > > > > which
> > > > > > > > illustrates newly added capabilities (rescaling the running
> > job,
> > > > > > handing
> > > > > > > > back resources to the RM, and session cluster support).
> > > > > > > >
> > > > > > > > The intuition behind the FLIP is being able to define
> resource
> > > > > > > requirements
> > > > > > > > ("resource boundaries") externally that the AdaptiveScheduler
> > can
> > > > > > > navigate
> > > > > > > > within. This is a building block for higher-level efforts
> such
> > as
> > > > an
> > > > > > > > external Autoscaler. The natural extension of this work would
> > be
> > > to
> > > > > > allow
> > > > > > > > to specify per-vertex ResourceProfiles.
> > > > > > > >
> > > > > > > > Looking forward to your thoughts; any feedback is
> appreciated!
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > > > > [2]
> > > > https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > > > > > > > [3] https://github.com/tillrohrmann/flink/tree/autoscaling
> > > > > > > > [4]
> > > > > > >
> > > >
> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > D.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best
> > > > > > >
> > > > > > > ConradJam
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by feng xiangyu <xi...@gmail.com>.

Hi David,

Thanks for creating this flip. I think this work it is very useful,
especially in autoscaling scenario.  I would like to share some questions
from my view.

1, Is it possible for this REST API to declare TM resources in the future?
I'm asking because we are building the autoscaling feature for Flink OLAP
Session Cluster in ByteDance. We need to rescale the cluster's resource on
TM level instead of Job level. It would be very helpful if we have a REST
API for out external Autoscaling service to use.

2, And for streaming jobs, I'm wondering if there is any situation we need
to rescale the TM resources of a flink cluster at first and then the
adaptive scheduler will rescale the per-vertex ResourceProfiles
accordingly.

best.
Xiangyu

Shammon FY <zj...@gmail.com> 于2023年2月9日周四 11:31写道：

> Hi David
>
> Thanks for your answer.
>
> > Can you elaborate more about how you'd intend to use the endpoint? I
> think we can ultimately introduce a way of re-declaring "per-vertex
> defaults," but I'd like to understand the use case bit more first.
>
> For this issue, I mainly consider the consistency of user configuration and
> job runtime. For sql jobs, users usually set specific parallelism for
> source and sink, and set a global parallelism for other operators. These
> config items are stored in a config file. For some high-priority jobs,
> users may want to manage them manually.
> 1. When users need to scale the parallelism, they should update the config
> file and restart flink job, which may take a long time.
> 2. After providing the REST API, users can just send a request to the job
> via REST API quickly after updating the config file.
> The configuration in the running job and config file should be the same.
> What do you think of this?
>
> best.
> Shammon
>
>
>
> On Tue, Feb 7, 2023 at 4:51 PM David Morávek <da...@gmail.com>
> wrote:
>
> > Hi everyone,
> >
> > Let's try to answer the questions one by one.
> >
> > *@ConradJam*
> >
> > when the number of "slots" is insufficient, can we can stop users
> rescaling
> > > or throw something to tell user "less avaliable slots to upgrade,
> please
> > > checkout your alivalbe slots" ?
> > >
> >
> > The main property of AdaptiveScheduler is that it can adapt to "available
> > resources," which means you're still able to make progress even though
> you
> > didn't get all the slots you've asked for. Let's break down the pros and
> > cons of this property.
> >
> > - (plus) If you lose a TM for some reason, you can still recover even if
> it
> > doesn't come back. We still need to give it some time to eliminate
> > unnecessary rescaling, which can be controlled by setting
> > "resource-stabilization-timeout."
> > - (plus) The resources can arrive with a significant delay. For example,
> > you're unable to spawn enough TMs on time because you've run out of
> > resources in your k8s cluster, and you need to wait for the cluster auto
> > scaler to kick in and add new nodes to the cluster. In this scenario,
> > you'll be able to start making progress faster, at the cost of multiple
> > rescalings (once the remaining resources arrive).
> > - (plus) This plays well with the declarative manner of today's
> > infrastructure. For example, you tell k8s that you need 10 TMs, and
> you'll
> > eventually get them.
> > - (minus) In the case of large state jobs, the cost of multiple
> rescalings
> > might outweigh the above.
> >
> > We've already touched on the solution to this problem on the FLIP. Please
> > notice the parallelism knob being a range with a lower and upper bound.
> > Setting both the lower and upper bound to the same value could give the
> > behavior you're describing at the cost of giving up some properties that
> AS
> > gives you (you'd be falling back to the DefaultScheduler's behavior).
> >
> > when user upgrade job-vertx-parallelism . I want to have an interface to
> > > query the current update parallel execution status, so that the user or
> > > program can understand the current status
> > >
> >
> > This is a misunderstanding. We're not introducing the RESCALE endpoint.
> > This endpoint allows you to re-declare the resources needed to run the
> job.
> > Once you reach the desired resources (you get more resources than the
> lower
> > bound defines), your job will run.
> >
> > We can expose a similar endpoint to "resource requirements" to give you
> an
> > overview of the resources the vertices already have. You can already get
> > this from the REST API, so exposing this in yet another way should be
> > considered carefully.
> >
> > *@Matthias*
> >
> > I'm wondering whether it makes sense to add some kind of resource ID to
> the
> > > REST API.
> >
> >
> > That's a good question. I want to think about that and get back to the
> > question later. My main struggle when thinking about this is, "if this
> > would be an idempotent POST endpoint," would it be any different?
> >
> > How often do we allow resource requirements to be changed?
> >
> >
> > There shall be no rate limiting on the FLINK side. If this is something
> > your environment needs, you can achieve it on a different layer ("we
> can't
> > have FLINK to do everything").
> >
> > Versioning the JobGraph in the JobGraphStore rather than overwriting it
> > > might be an idea.
> > >
> >
> > This sounds interesting since it would be closer to the JobGraph being
> > immutable. The main problem I see here is that this would introduce a
> > BW-incompatible change so it might be a topic for follow-up FLIP.
> >
> > I'm just wondering whether we bundle two things together that are
> actually
> > > separate
> > >
> >
> > Yup, this is how we think about it as well. The main question is, "who
> > should be responsible for bookkeeping 1) the JobGraph and 2) the
> > JobResourceRequirements". The JobMaster would be the right place for
> both,
> > but it's currently not the case, and we're tightly coupling the
> dispatcher
> > with the JobMaster.
> >
> > Initially, we tried to introduce a separate HA component in JobMaster for
> > bookkeeping the JobResourceRequirements, but that proved to be a more
> > significant effort adding additional mess to the already messy HA
> > ecosystem. Another approach we've discussed was mutating the JobGraph and
> > setting JRR into the JobGraph structure itself.
> >
> > The middle ground for keeping this effort reasonably sized and not
> > violating "we want to keep JG immutable" too much is keeping the
> > JobResourceRequirements separate as an internal config option in
> JobGraph's
> > configuration.
> >
> > We ultimately need to rethink the tight coupling of Dispatcher and
> > JobMaster, but it needs to be a separate effort.
> >
> > ...also considering the amount of data that can be stored in a
> > > ConfigMap/ZooKeeper node if versioning the resource requirement change
> as
> > > proposed in my previous item is an option for us.
> > >
> >
> > AFAIK we're only storing pointers to the S3 objects in HA metadata, so we
> > should be okay with having larger structures for now.
> >
> > Updating the JobGraphStore means adding more requests to the HA backend
> > API.
> > >
> >
> > It's fine unless you intend to override the resource requirements a few
> > times per second.
> >
> > *@Shammon*
> >
> > How about adding some more information such as vertex type
> > >
> >
> > Since it was intended as a "debug" endpoint, it makes complete sense!
> >
> >  For sql jobs, we always use a unified parallelism for most vertices. Can
> > > we provide them with a more convenient setting method instead of each
> > one?
> >
> >
> > I completely feel with this. The main thoughts when designing the API
> were:
> > - We want to keep it clean and easy to understand.
> > - Global parallelism can be modeled using per-vertex parallelism but not
> > the other way around.
> > - The API will be used by external tooling (operator, auto scaler).
> >
> > Can you elaborate more about how you'd intend to use the endpoint? I
> think
> > we can ultimately introduce a way of re-declaring "per-vertex defaults,"
> > but I'd like to understand the use case bit more first.
> >
> > *@Weijie*
> >
> > What is the default value here (based on what configuration), or just
> > > infinite?
> > >
> >
> > Currently, for the lower bound, it's always one, and for the upper bound,
> > it's either parallelism (if defined) or the maxParallelism of the vertex
> in
> > JobGraph. This question might be another signal for making the defaults
> > explicit (see the answer to Shammon's question above).
> >
> >
> > Thanks, everyone, for your initial thoughts!
> >
> > Best,
> > D.
> >
> > On Tue, Feb 7, 2023 at 4:39 AM weijie guo <gu...@gmail.com>
> > wrote:
> >
> > > Thanks David for driving this. This is a very valuable work, especially
> > for
> > > cloud native environment.
> > >
> > > >> How about adding some more information such as vertex type
> > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > resource-requirements`? For users, only vertex-id may be difficult to
> > > understand.
> > >
> > > +1 for this suggestion, including jobvertex's name in the response body
> > is
> > > more
> > > user-friendly.
> > >
> > >
> > > I saw this sentence in FLIP: "Setting the upper bound to -1 will reset
> > the
> > > value to the default setting."  What is the default value here (based
> on
> > > what configuration), or just infinite?
> > >
> > >
> > > Best regards,
> > >
> > > Weijie
> > >
> > >
> > >
> > > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
> > >
> > > > Hi David
> > > >
> > > > Thanks for initiating this discussion. I think declaring job resource
> > > > requirements by REST API is very valuable. I just left some comments
> as
> > > > followed
> > > >
> > > > 1) How about adding some more information such as vertex type
> > > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > > resource-requirements`? For users, only vertex-id may be difficult to
> > > > understand.
> > > >
> > > > 2) For sql jobs, we always use a unified parallelism for most
> vertices.
> > > Can
> > > > we provide them with a more convenient setting method instead of each
> > > one?
> > > >
> > > >
> > > > Best,
> > > > Shammon
> > > >
> > > >
> > > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <matthias.pohl@aiven.io
> > > > .invalid>
> > > > wrote:
> > > >
> > > > > Thanks David for creating this FLIP. It sounds promising and useful
> > to
> > > > > have. Here are some thoughts from my side (some of them might be
> > > rather a
> > > > > follow-up and not necessarily part of this FLIP):
> > > > > - I'm wondering whether it makes sense to add some kind of resource
> > ID
> > > to
> > > > > the REST API. This would give Flink a tool to verify the PATCH
> > request
> > > of
> > > > > the external system in a compare-and-set kind of manner. AFAIU, the
> > > > process
> > > > > requires the external system to retrieve the resource requirements
> > > first
> > > > > (to retrieve the vertex IDs). A resource ID <ABC> would be sent
> along
> > > as
> > > > a
> > > > > unique identifier for the provided setup. It's essentially the
> > version
> > > ID
> > > > > of the currently deployed resource requirement configuration. Flink
> > > > doesn't
> > > > > know whether the external system would use the provided information
> > in
> > > > some
> > > > > way to derive a new set of resource requirements for this job. The
> > > > > subsequent PATCH request with updated resource requirements would
> > > include
> > > > > the previously retrieved resource ID <ABC>. The PATCH call would
> fail
> > > if
> > > > > there was a concurrent PATCH call in between indicating to the
> > external
> > > > > system that the resource requirements were concurrently updated.
> > > > > - How often do we allow resource requirements to be changed? That
> > > > question
> > > > > might make my previous comment on the resource ID obsolete because
> we
> > > > could
> > > > > just make any PATCH call fail if there was a resource requirement
> > > update
> > > > > within a certain time frame before the request. But such a time
> > period
> > > is
> > > > > something we might want to make configurable then, I guess.
> > > > > - Versioning the JobGraph in the JobGraphStore rather than
> > overwriting
> > > it
> > > > > might be an idea. This would enable us to provide resource
> > requirement
> > > > > changes in the UI or through the REST API. It is related to a
> problem
> > > > > around keeping track of the exception history within the
> > > > AdaptiveScheduler
> > > > > and also having to consider multiple versions of a JobGraph. But
> for
> > > that
> > > > > one, we use the ExecutionGraphInfoStore right now.
> > > > > - Updating the JobGraph in the JobGraphStore makes sense. I'm just
> > > > > wondering whether we bundle two things together that are actually
> > > > separate:
> > > > > The business logic and the execution configuration (the resource
> > > > > requirements). I'm aware that this is not a flaw of the current
> FLIP
> > > but
> > > > > rather something that was not necessary to address in the past
> > because
> > > > the
> > > > > JobGraph was kind of static. I don't remember whether that was
> > already
> > > > > discussed while working on the AdaptiveScheduler for FLIP-160 [1].
> > > Maybe,
> > > > > I'm missing some functionality here that requires us to have
> > everything
> > > > in
> > > > > one place. But it feels like updating the entire JobGraph which
> could
> > > be
> > > > > actually a "config change" is not reasonable. ...also considering
> the
> > > > > amount of data that can be stored in a ConfigMap/ZooKeeper node if
> > > > > versioning the resource requirement change as proposed in my
> previous
> > > > item
> > > > > is an option for us.
> > > > > - Updating the JobGraphStore means adding more requests to the HA
> > > backend
> > > > > API. There were some concerns shared in the discussion thread [2]
> for
> > > > > FLIP-270 [3] on pressuring the k8s API server in the past with too
> > many
> > > > > calls. Eventhough, it's more likely to be caused by checkpointing,
> I
> > > > still
> > > > > wanted to bring it up. We're working on a standardized performance
> > test
> > > > to
> > > > > prepare going forward with FLIP-270 [3] right now.
> > > > >
> > > > > Best,
> > > > > Matthias
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > > > > [2]
> https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > > > > [3]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> > > > >
> > > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <ja...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi David:
> > > > > >
> > > > > > Thank you for drive this flip, which helps less flink shutdown
> time
> > > > > >
> > > > > > for this flip, I would like to make a few idea on share
> > > > > >
> > > > > >
> > > > > >    - when the number of "slots" is insufficient, can we can stop
> > > users
> > > > > >    rescaling or throw something to tell user "less avaliable
> slots
> > to
> > > > > > upgrade,
> > > > > >    please checkout your alivalbe slots" ? Or we could have a
> > request
> > > > > >    switch(true/false) to allow this behavior
> > > > > >
> > > > > >
> > > > > >    - when user upgrade job-vertx-parallelism . I want to have an
> > > > > interface
> > > > > >    to query the current update parallel execution status, so that
> > the
> > > > > user
> > > > > > or
> > > > > >    program can understand the current status
> > > > > >    - I want to have an interface to query the current update
> > > > parallelism
> > > > > >    execution status. This also helps similar to *[1] Flink K8S
> > > > Operator*
> > > > > >    management
> > > > > >
> > > > > >
> > > > > > {
> > > > > >   status: Failed
> > > > > >   reason: "less avaliable slots to upgrade, please checkout your
> > > > alivalbe
> > > > > > slots"
> > > > > > }
> > > > > >
> > > > > >
> > > > > >
> > > > > >    - *Pending*: this job now is join the upgrade queue,it will be
> > > > update
> > > > > >    later
> > > > > >    - *Rescaling*: job now is rescaling,wait it finish
> > > > > >    - *Finished*: finish do it
> > > > > >    - *Failed* : something have wrong,so this job is not alivable
> > > > upgrade
> > > > > >
> > > > > > I want to supplement my above content in flip, what do you think
> ?
> > > > > >
> > > > > >
> > > > > >    1.
> > > > > >
> > > > >
> > >
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > > > > >
> > > > > >
> > > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > This FLIP [1] introduces a new REST API for declaring resource
> > > > > > requirements
> > > > > > > for the Adaptive Scheduler. There seems to be a clear need for
> > this
> > > > API
> > > > > > > based on the discussion on the "Reworking the Rescale API" [2]
> > > > thread.
> > > > > > >
> > > > > > > Before we get started, this work is heavily based on the
> > prototype
> > > > [3]
> > > > > > > created by Till Rohrmann, and the FLIP is being published with
> > his
> > > > > > consent.
> > > > > > > Big shoutout to him!
> > > > > > >
> > > > > > > Last and not least, thanks to Chesnay and Roman for the initial
> > > > reviews
> > > > > > and
> > > > > > > discussions.
> > > > > > >
> > > > > > > The best start would be watching a short demo [4] that I've
> > > recorded,
> > > > > > which
> > > > > > > illustrates newly added capabilities (rescaling the running
> job,
> > > > > handing
> > > > > > > back resources to the RM, and session cluster support).
> > > > > > >
> > > > > > > The intuition behind the FLIP is being able to define resource
> > > > > > requirements
> > > > > > > ("resource boundaries") externally that the AdaptiveScheduler
> can
> > > > > > navigate
> > > > > > > within. This is a building block for higher-level efforts such
> as
> > > an
> > > > > > > external Autoscaler. The natural extension of this work would
> be
> > to
> > > > > allow
> > > > > > > to specify per-vertex ResourceProfiles.
> > > > > > >
> > > > > > > Looking forward to your thoughts; any feedback is appreciated!
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > > > [2]
> > > https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > > > > > > [3] https://github.com/tillrohrmann/flink/tree/autoscaling
> > > > > > > [4]
> > > > > >
> > > https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > > > > > >
> > > > > > > Best,
> > > > > > > D.
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best
> > > > > >
> > > > > > ConradJam
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by Shammon FY <zj...@gmail.com>.

Hi David

Thanks for your answer.

> Can you elaborate more about how you'd intend to use the endpoint? I
think we can ultimately introduce a way of re-declaring "per-vertex
defaults," but I'd like to understand the use case bit more first.

For this issue, I mainly consider the consistency of user configuration and
job runtime. For sql jobs, users usually set specific parallelism for
source and sink, and set a global parallelism for other operators. These
config items are stored in a config file. For some high-priority jobs,
users may want to manage them manually.
1. When users need to scale the parallelism, they should update the config
file and restart flink job, which may take a long time.
2. After providing the REST API, users can just send a request to the job
via REST API quickly after updating the config file.
The configuration in the running job and config file should be the same.
What do you think of this?

best.
Shammon



On Tue, Feb 7, 2023 at 4:51 PM David Morávek <da...@gmail.com>
wrote:

> Hi everyone,
>
> Let's try to answer the questions one by one.
>
> *@ConradJam*
>
> when the number of "slots" is insufficient, can we can stop users rescaling
> > or throw something to tell user "less avaliable slots to upgrade, please
> > checkout your alivalbe slots" ?
> >
>
> The main property of AdaptiveScheduler is that it can adapt to "available
> resources," which means you're still able to make progress even though you
> didn't get all the slots you've asked for. Let's break down the pros and
> cons of this property.
>
> - (plus) If you lose a TM for some reason, you can still recover even if it
> doesn't come back. We still need to give it some time to eliminate
> unnecessary rescaling, which can be controlled by setting
> "resource-stabilization-timeout."
> - (plus) The resources can arrive with a significant delay. For example,
> you're unable to spawn enough TMs on time because you've run out of
> resources in your k8s cluster, and you need to wait for the cluster auto
> scaler to kick in and add new nodes to the cluster. In this scenario,
> you'll be able to start making progress faster, at the cost of multiple
> rescalings (once the remaining resources arrive).
> - (plus) This plays well with the declarative manner of today's
> infrastructure. For example, you tell k8s that you need 10 TMs, and you'll
> eventually get them.
> - (minus) In the case of large state jobs, the cost of multiple rescalings
> might outweigh the above.
>
> We've already touched on the solution to this problem on the FLIP. Please
> notice the parallelism knob being a range with a lower and upper bound.
> Setting both the lower and upper bound to the same value could give the
> behavior you're describing at the cost of giving up some properties that AS
> gives you (you'd be falling back to the DefaultScheduler's behavior).
>
> when user upgrade job-vertx-parallelism . I want to have an interface to
> > query the current update parallel execution status, so that the user or
> > program can understand the current status
> >
>
> This is a misunderstanding. We're not introducing the RESCALE endpoint.
> This endpoint allows you to re-declare the resources needed to run the job.
> Once you reach the desired resources (you get more resources than the lower
> bound defines), your job will run.
>
> We can expose a similar endpoint to "resource requirements" to give you an
> overview of the resources the vertices already have. You can already get
> this from the REST API, so exposing this in yet another way should be
> considered carefully.
>
> *@Matthias*
>
> I'm wondering whether it makes sense to add some kind of resource ID to the
> > REST API.
>
>
> That's a good question. I want to think about that and get back to the
> question later. My main struggle when thinking about this is, "if this
> would be an idempotent POST endpoint," would it be any different?
>
> How often do we allow resource requirements to be changed?
>
>
> There shall be no rate limiting on the FLINK side. If this is something
> your environment needs, you can achieve it on a different layer ("we can't
> have FLINK to do everything").
>
> Versioning the JobGraph in the JobGraphStore rather than overwriting it
> > might be an idea.
> >
>
> This sounds interesting since it would be closer to the JobGraph being
> immutable. The main problem I see here is that this would introduce a
> BW-incompatible change so it might be a topic for follow-up FLIP.
>
> I'm just wondering whether we bundle two things together that are actually
> > separate
> >
>
> Yup, this is how we think about it as well. The main question is, "who
> should be responsible for bookkeeping 1) the JobGraph and 2) the
> JobResourceRequirements". The JobMaster would be the right place for both,
> but it's currently not the case, and we're tightly coupling the dispatcher
> with the JobMaster.
>
> Initially, we tried to introduce a separate HA component in JobMaster for
> bookkeeping the JobResourceRequirements, but that proved to be a more
> significant effort adding additional mess to the already messy HA
> ecosystem. Another approach we've discussed was mutating the JobGraph and
> setting JRR into the JobGraph structure itself.
>
> The middle ground for keeping this effort reasonably sized and not
> violating "we want to keep JG immutable" too much is keeping the
> JobResourceRequirements separate as an internal config option in JobGraph's
> configuration.
>
> We ultimately need to rethink the tight coupling of Dispatcher and
> JobMaster, but it needs to be a separate effort.
>
> ...also considering the amount of data that can be stored in a
> > ConfigMap/ZooKeeper node if versioning the resource requirement change as
> > proposed in my previous item is an option for us.
> >
>
> AFAIK we're only storing pointers to the S3 objects in HA metadata, so we
> should be okay with having larger structures for now.
>
> Updating the JobGraphStore means adding more requests to the HA backend
> API.
> >
>
> It's fine unless you intend to override the resource requirements a few
> times per second.
>
> *@Shammon*
>
> How about adding some more information such as vertex type
> >
>
> Since it was intended as a "debug" endpoint, it makes complete sense!
>
>  For sql jobs, we always use a unified parallelism for most vertices. Can
> > we provide them with a more convenient setting method instead of each
> one?
>
>
> I completely feel with this. The main thoughts when designing the API were:
> - We want to keep it clean and easy to understand.
> - Global parallelism can be modeled using per-vertex parallelism but not
> the other way around.
> - The API will be used by external tooling (operator, auto scaler).
>
> Can you elaborate more about how you'd intend to use the endpoint? I think
> we can ultimately introduce a way of re-declaring "per-vertex defaults,"
> but I'd like to understand the use case bit more first.
>
> *@Weijie*
>
> What is the default value here (based on what configuration), or just
> > infinite?
> >
>
> Currently, for the lower bound, it's always one, and for the upper bound,
> it's either parallelism (if defined) or the maxParallelism of the vertex in
> JobGraph. This question might be another signal for making the defaults
> explicit (see the answer to Shammon's question above).
>
>
> Thanks, everyone, for your initial thoughts!
>
> Best,
> D.
>
> On Tue, Feb 7, 2023 at 4:39 AM weijie guo <gu...@gmail.com>
> wrote:
>
> > Thanks David for driving this. This is a very valuable work, especially
> for
> > cloud native environment.
> >
> > >> How about adding some more information such as vertex type
> > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > resource-requirements`? For users, only vertex-id may be difficult to
> > understand.
> >
> > +1 for this suggestion, including jobvertex's name in the response body
> is
> > more
> > user-friendly.
> >
> >
> > I saw this sentence in FLIP: "Setting the upper bound to -1 will reset
> the
> > value to the default setting."  What is the default value here (based on
> > what configuration), or just infinite?
> >
> >
> > Best regards,
> >
> > Weijie
> >
> >
> >
> > Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
> >
> > > Hi David
> > >
> > > Thanks for initiating this discussion. I think declaring job resource
> > > requirements by REST API is very valuable. I just left some comments as
> > > followed
> > >
> > > 1) How about adding some more information such as vertex type
> > > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > > resource-requirements`? For users, only vertex-id may be difficult to
> > > understand.
> > >
> > > 2) For sql jobs, we always use a unified parallelism for most vertices.
> > Can
> > > we provide them with a more convenient setting method instead of each
> > one?
> > >
> > >
> > > Best,
> > > Shammon
> > >
> > >
> > > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <matthias.pohl@aiven.io
> > > .invalid>
> > > wrote:
> > >
> > > > Thanks David for creating this FLIP. It sounds promising and useful
> to
> > > > have. Here are some thoughts from my side (some of them might be
> > rather a
> > > > follow-up and not necessarily part of this FLIP):
> > > > - I'm wondering whether it makes sense to add some kind of resource
> ID
> > to
> > > > the REST API. This would give Flink a tool to verify the PATCH
> request
> > of
> > > > the external system in a compare-and-set kind of manner. AFAIU, the
> > > process
> > > > requires the external system to retrieve the resource requirements
> > first
> > > > (to retrieve the vertex IDs). A resource ID <ABC> would be sent along
> > as
> > > a
> > > > unique identifier for the provided setup. It's essentially the
> version
> > ID
> > > > of the currently deployed resource requirement configuration. Flink
> > > doesn't
> > > > know whether the external system would use the provided information
> in
> > > some
> > > > way to derive a new set of resource requirements for this job. The
> > > > subsequent PATCH request with updated resource requirements would
> > include
> > > > the previously retrieved resource ID <ABC>. The PATCH call would fail
> > if
> > > > there was a concurrent PATCH call in between indicating to the
> external
> > > > system that the resource requirements were concurrently updated.
> > > > - How often do we allow resource requirements to be changed? That
> > > question
> > > > might make my previous comment on the resource ID obsolete because we
> > > could
> > > > just make any PATCH call fail if there was a resource requirement
> > update
> > > > within a certain time frame before the request. But such a time
> period
> > is
> > > > something we might want to make configurable then, I guess.
> > > > - Versioning the JobGraph in the JobGraphStore rather than
> overwriting
> > it
> > > > might be an idea. This would enable us to provide resource
> requirement
> > > > changes in the UI or through the REST API. It is related to a problem
> > > > around keeping track of the exception history within the
> > > AdaptiveScheduler
> > > > and also having to consider multiple versions of a JobGraph. But for
> > that
> > > > one, we use the ExecutionGraphInfoStore right now.
> > > > - Updating the JobGraph in the JobGraphStore makes sense. I'm just
> > > > wondering whether we bundle two things together that are actually
> > > separate:
> > > > The business logic and the execution configuration (the resource
> > > > requirements). I'm aware that this is not a flaw of the current FLIP
> > but
> > > > rather something that was not necessary to address in the past
> because
> > > the
> > > > JobGraph was kind of static. I don't remember whether that was
> already
> > > > discussed while working on the AdaptiveScheduler for FLIP-160 [1].
> > Maybe,
> > > > I'm missing some functionality here that requires us to have
> everything
> > > in
> > > > one place. But it feels like updating the entire JobGraph which could
> > be
> > > > actually a "config change" is not reasonable. ...also considering the
> > > > amount of data that can be stored in a ConfigMap/ZooKeeper node if
> > > > versioning the resource requirement change as proposed in my previous
> > > item
> > > > is an option for us.
> > > > - Updating the JobGraphStore means adding more requests to the HA
> > backend
> > > > API. There were some concerns shared in the discussion thread [2] for
> > > > FLIP-270 [3] on pressuring the k8s API server in the past with too
> many
> > > > calls. Eventhough, it's more likely to be caused by checkpointing, I
> > > still
> > > > wanted to bring it up. We're working on a standardized performance
> test
> > > to
> > > > prepare going forward with FLIP-270 [3] right now.
> > > >
> > > > Best,
> > > > Matthias
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > > > [2] https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > > > [3]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> > > >
> > > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <ja...@gmail.com>
> wrote:
> > > >
> > > > > Hi David:
> > > > >
> > > > > Thank you for drive this flip, which helps less flink shutdown time
> > > > >
> > > > > for this flip, I would like to make a few idea on share
> > > > >
> > > > >
> > > > >    - when the number of "slots" is insufficient, can we can stop
> > users
> > > > >    rescaling or throw something to tell user "less avaliable slots
> to
> > > > > upgrade,
> > > > >    please checkout your alivalbe slots" ? Or we could have a
> request
> > > > >    switch(true/false) to allow this behavior
> > > > >
> > > > >
> > > > >    - when user upgrade job-vertx-parallelism . I want to have an
> > > > interface
> > > > >    to query the current update parallel execution status, so that
> the
> > > > user
> > > > > or
> > > > >    program can understand the current status
> > > > >    - I want to have an interface to query the current update
> > > parallelism
> > > > >    execution status. This also helps similar to *[1] Flink K8S
> > > Operator*
> > > > >    management
> > > > >
> > > > >
> > > > > {
> > > > >   status: Failed
> > > > >   reason: "less avaliable slots to upgrade, please checkout your
> > > alivalbe
> > > > > slots"
> > > > > }
> > > > >
> > > > >
> > > > >
> > > > >    - *Pending*: this job now is join the upgrade queue,it will be
> > > update
> > > > >    later
> > > > >    - *Rescaling*: job now is rescaling,wait it finish
> > > > >    - *Finished*: finish do it
> > > > >    - *Failed* : something have wrong,so this job is not alivable
> > > upgrade
> > > > >
> > > > > I want to supplement my above content in flip, what do you think ?
> > > > >
> > > > >
> > > > >    1.
> > > > >
> > > >
> > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > > > >
> > > > >
> > > > > David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > This FLIP [1] introduces a new REST API for declaring resource
> > > > > requirements
> > > > > > for the Adaptive Scheduler. There seems to be a clear need for
> this
> > > API
> > > > > > based on the discussion on the "Reworking the Rescale API" [2]
> > > thread.
> > > > > >
> > > > > > Before we get started, this work is heavily based on the
> prototype
> > > [3]
> > > > > > created by Till Rohrmann, and the FLIP is being published with
> his
> > > > > consent.
> > > > > > Big shoutout to him!
> > > > > >
> > > > > > Last and not least, thanks to Chesnay and Roman for the initial
> > > reviews
> > > > > and
> > > > > > discussions.
> > > > > >
> > > > > > The best start would be watching a short demo [4] that I've
> > recorded,
> > > > > which
> > > > > > illustrates newly added capabilities (rescaling the running job,
> > > > handing
> > > > > > back resources to the RM, and session cluster support).
> > > > > >
> > > > > > The intuition behind the FLIP is being able to define resource
> > > > > requirements
> > > > > > ("resource boundaries") externally that the AdaptiveScheduler can
> > > > > navigate
> > > > > > within. This is a building block for higher-level efforts such as
> > an
> > > > > > external Autoscaler. The natural extension of this work would be
> to
> > > > allow
> > > > > > to specify per-vertex ResourceProfiles.
> > > > > >
> > > > > > Looking forward to your thoughts; any feedback is appreciated!
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > > [2]
> > https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > > > > > [3] https://github.com/tillrohrmann/flink/tree/autoscaling
> > > > > > [4]
> > > > >
> > https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > > > > >
> > > > > > Best,
> > > > > > D.
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best
> > > > >
> > > > > ConradJam
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by David Morávek <da...@gmail.com>.

Hi everyone,

Let's try to answer the questions one by one.

*@ConradJam*

when the number of "slots" is insufficient, can we can stop users rescaling
> or throw something to tell user "less avaliable slots to upgrade, please
> checkout your alivalbe slots" ?
>

The main property of AdaptiveScheduler is that it can adapt to "available
resources," which means you're still able to make progress even though you
didn't get all the slots you've asked for. Let's break down the pros and
cons of this property.

- (plus) If you lose a TM for some reason, you can still recover even if it
doesn't come back. We still need to give it some time to eliminate
unnecessary rescaling, which can be controlled by setting
"resource-stabilization-timeout."
- (plus) The resources can arrive with a significant delay. For example,
you're unable to spawn enough TMs on time because you've run out of
resources in your k8s cluster, and you need to wait for the cluster auto
scaler to kick in and add new nodes to the cluster. In this scenario,
you'll be able to start making progress faster, at the cost of multiple
rescalings (once the remaining resources arrive).
- (plus) This plays well with the declarative manner of today's
infrastructure. For example, you tell k8s that you need 10 TMs, and you'll
eventually get them.
- (minus) In the case of large state jobs, the cost of multiple rescalings
might outweigh the above.

We've already touched on the solution to this problem on the FLIP. Please
notice the parallelism knob being a range with a lower and upper bound.
Setting both the lower and upper bound to the same value could give the
behavior you're describing at the cost of giving up some properties that AS
gives you (you'd be falling back to the DefaultScheduler's behavior).

when user upgrade job-vertx-parallelism . I want to have an interface to
> query the current update parallel execution status, so that the user or
> program can understand the current status
>

This is a misunderstanding. We're not introducing the RESCALE endpoint.
This endpoint allows you to re-declare the resources needed to run the job.
Once you reach the desired resources (you get more resources than the lower
bound defines), your job will run.

We can expose a similar endpoint to "resource requirements" to give you an
overview of the resources the vertices already have. You can already get
this from the REST API, so exposing this in yet another way should be
considered carefully.

*@Matthias*

I'm wondering whether it makes sense to add some kind of resource ID to the
> REST API.


That's a good question. I want to think about that and get back to the
question later. My main struggle when thinking about this is, "if this
would be an idempotent POST endpoint," would it be any different?

How often do we allow resource requirements to be changed?


There shall be no rate limiting on the FLINK side. If this is something
your environment needs, you can achieve it on a different layer ("we can't
have FLINK to do everything").

Versioning the JobGraph in the JobGraphStore rather than overwriting it
> might be an idea.
>

This sounds interesting since it would be closer to the JobGraph being
immutable. The main problem I see here is that this would introduce a
BW-incompatible change so it might be a topic for follow-up FLIP.

I'm just wondering whether we bundle two things together that are actually
> separate
>

Yup, this is how we think about it as well. The main question is, "who
should be responsible for bookkeeping 1) the JobGraph and 2) the
JobResourceRequirements". The JobMaster would be the right place for both,
but it's currently not the case, and we're tightly coupling the dispatcher
with the JobMaster.

Initially, we tried to introduce a separate HA component in JobMaster for
bookkeeping the JobResourceRequirements, but that proved to be a more
significant effort adding additional mess to the already messy HA
ecosystem. Another approach we've discussed was mutating the JobGraph and
setting JRR into the JobGraph structure itself.

The middle ground for keeping this effort reasonably sized and not
violating "we want to keep JG immutable" too much is keeping the
JobResourceRequirements separate as an internal config option in JobGraph's
configuration.

We ultimately need to rethink the tight coupling of Dispatcher and
JobMaster, but it needs to be a separate effort.

...also considering the amount of data that can be stored in a
> ConfigMap/ZooKeeper node if versioning the resource requirement change as
> proposed in my previous item is an option for us.
>

AFAIK we're only storing pointers to the S3 objects in HA metadata, so we
should be okay with having larger structures for now.

Updating the JobGraphStore means adding more requests to the HA backend API.
>

It's fine unless you intend to override the resource requirements a few
times per second.

*@Shammon*

How about adding some more information such as vertex type
>

Since it was intended as a "debug" endpoint, it makes complete sense!

 For sql jobs, we always use a unified parallelism for most vertices. Can
> we provide them with a more convenient setting method instead of each one?


I completely feel with this. The main thoughts when designing the API were:
- We want to keep it clean and easy to understand.
- Global parallelism can be modeled using per-vertex parallelism but not
the other way around.
- The API will be used by external tooling (operator, auto scaler).

Can you elaborate more about how you'd intend to use the endpoint? I think
we can ultimately introduce a way of re-declaring "per-vertex defaults,"
but I'd like to understand the use case bit more first.

*@Weijie*

What is the default value here (based on what configuration), or just
> infinite?
>

Currently, for the lower bound, it's always one, and for the upper bound,
it's either parallelism (if defined) or the maxParallelism of the vertex in
JobGraph. This question might be another signal for making the defaults
explicit (see the answer to Shammon's question above).


Thanks, everyone, for your initial thoughts!

Best,
D.

On Tue, Feb 7, 2023 at 4:39 AM weijie guo <gu...@gmail.com> wrote:

> Thanks David for driving this. This is a very valuable work, especially for
> cloud native environment.
>
> >> How about adding some more information such as vertex type
> (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> resource-requirements`? For users, only vertex-id may be difficult to
> understand.
>
> +1 for this suggestion, including jobvertex's name in the response body is
> more
> user-friendly.
>
>
> I saw this sentence in FLIP: "Setting the upper bound to -1 will reset the
> value to the default setting."  What is the default value here (based on
> what configuration), or just infinite?
>
>
> Best regards,
>
> Weijie
>
>
>
> Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：
>
> > Hi David
> >
> > Thanks for initiating this discussion. I think declaring job resource
> > requirements by REST API is very valuable. I just left some comments as
> > followed
> >
> > 1) How about adding some more information such as vertex type
> > (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> > resource-requirements`? For users, only vertex-id may be difficult to
> > understand.
> >
> > 2) For sql jobs, we always use a unified parallelism for most vertices.
> Can
> > we provide them with a more convenient setting method instead of each
> one?
> >
> >
> > Best,
> > Shammon
> >
> >
> > On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <matthias.pohl@aiven.io
> > .invalid>
> > wrote:
> >
> > > Thanks David for creating this FLIP. It sounds promising and useful to
> > > have. Here are some thoughts from my side (some of them might be
> rather a
> > > follow-up and not necessarily part of this FLIP):
> > > - I'm wondering whether it makes sense to add some kind of resource ID
> to
> > > the REST API. This would give Flink a tool to verify the PATCH request
> of
> > > the external system in a compare-and-set kind of manner. AFAIU, the
> > process
> > > requires the external system to retrieve the resource requirements
> first
> > > (to retrieve the vertex IDs). A resource ID <ABC> would be sent along
> as
> > a
> > > unique identifier for the provided setup. It's essentially the version
> ID
> > > of the currently deployed resource requirement configuration. Flink
> > doesn't
> > > know whether the external system would use the provided information in
> > some
> > > way to derive a new set of resource requirements for this job. The
> > > subsequent PATCH request with updated resource requirements would
> include
> > > the previously retrieved resource ID <ABC>. The PATCH call would fail
> if
> > > there was a concurrent PATCH call in between indicating to the external
> > > system that the resource requirements were concurrently updated.
> > > - How often do we allow resource requirements to be changed? That
> > question
> > > might make my previous comment on the resource ID obsolete because we
> > could
> > > just make any PATCH call fail if there was a resource requirement
> update
> > > within a certain time frame before the request. But such a time period
> is
> > > something we might want to make configurable then, I guess.
> > > - Versioning the JobGraph in the JobGraphStore rather than overwriting
> it
> > > might be an idea. This would enable us to provide resource requirement
> > > changes in the UI or through the REST API. It is related to a problem
> > > around keeping track of the exception history within the
> > AdaptiveScheduler
> > > and also having to consider multiple versions of a JobGraph. But for
> that
> > > one, we use the ExecutionGraphInfoStore right now.
> > > - Updating the JobGraph in the JobGraphStore makes sense. I'm just
> > > wondering whether we bundle two things together that are actually
> > separate:
> > > The business logic and the execution configuration (the resource
> > > requirements). I'm aware that this is not a flaw of the current FLIP
> but
> > > rather something that was not necessary to address in the past because
> > the
> > > JobGraph was kind of static. I don't remember whether that was already
> > > discussed while working on the AdaptiveScheduler for FLIP-160 [1].
> Maybe,
> > > I'm missing some functionality here that requires us to have everything
> > in
> > > one place. But it feels like updating the entire JobGraph which could
> be
> > > actually a "config change" is not reasonable. ...also considering the
> > > amount of data that can be stored in a ConfigMap/ZooKeeper node if
> > > versioning the resource requirement change as proposed in my previous
> > item
> > > is an option for us.
> > > - Updating the JobGraphStore means adding more requests to the HA
> backend
> > > API. There were some concerns shared in the discussion thread [2] for
> > > FLIP-270 [3] on pressuring the k8s API server in the past with too many
> > > calls. Eventhough, it's more likely to be caused by checkpointing, I
> > still
> > > wanted to bring it up. We're working on a standardized performance test
> > to
> > > prepare going forward with FLIP-270 [3] right now.
> > >
> > > Best,
> > > Matthias
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > > [2] https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > > [3]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> > >
> > > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <ja...@gmail.com> wrote:
> > >
> > > > Hi David:
> > > >
> > > > Thank you for drive this flip, which helps less flink shutdown time
> > > >
> > > > for this flip, I would like to make a few idea on share
> > > >
> > > >
> > > >    - when the number of "slots" is insufficient, can we can stop
> users
> > > >    rescaling or throw something to tell user "less avaliable slots to
> > > > upgrade,
> > > >    please checkout your alivalbe slots" ? Or we could have a request
> > > >    switch(true/false) to allow this behavior
> > > >
> > > >
> > > >    - when user upgrade job-vertx-parallelism . I want to have an
> > > interface
> > > >    to query the current update parallel execution status, so that the
> > > user
> > > > or
> > > >    program can understand the current status
> > > >    - I want to have an interface to query the current update
> > parallelism
> > > >    execution status. This also helps similar to *[1] Flink K8S
> > Operator*
> > > >    management
> > > >
> > > >
> > > > {
> > > >   status: Failed
> > > >   reason: "less avaliable slots to upgrade, please checkout your
> > alivalbe
> > > > slots"
> > > > }
> > > >
> > > >
> > > >
> > > >    - *Pending*: this job now is join the upgrade queue,it will be
> > update
> > > >    later
> > > >    - *Rescaling*: job now is rescaling,wait it finish
> > > >    - *Finished*: finish do it
> > > >    - *Failed* : something have wrong,so this job is not alivable
> > upgrade
> > > >
> > > > I want to supplement my above content in flip, what do you think ?
> > > >
> > > >
> > > >    1.
> > > >
> > >
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > > >
> > > >
> > > > David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > This FLIP [1] introduces a new REST API for declaring resource
> > > > requirements
> > > > > for the Adaptive Scheduler. There seems to be a clear need for this
> > API
> > > > > based on the discussion on the "Reworking the Rescale API" [2]
> > thread.
> > > > >
> > > > > Before we get started, this work is heavily based on the prototype
> > [3]
> > > > > created by Till Rohrmann, and the FLIP is being published with his
> > > > consent.
> > > > > Big shoutout to him!
> > > > >
> > > > > Last and not least, thanks to Chesnay and Roman for the initial
> > reviews
> > > > and
> > > > > discussions.
> > > > >
> > > > > The best start would be watching a short demo [4] that I've
> recorded,
> > > > which
> > > > > illustrates newly added capabilities (rescaling the running job,
> > > handing
> > > > > back resources to the RM, and session cluster support).
> > > > >
> > > > > The intuition behind the FLIP is being able to define resource
> > > > requirements
> > > > > ("resource boundaries") externally that the AdaptiveScheduler can
> > > > navigate
> > > > > within. This is a building block for higher-level efforts such as
> an
> > > > > external Autoscaler. The natural extension of this work would be to
> > > allow
> > > > > to specify per-vertex ResourceProfiles.
> > > > >
> > > > > Looking forward to your thoughts; any feedback is appreciated!
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > [2]
> https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > > > > [3] https://github.com/tillrohrmann/flink/tree/autoscaling
> > > > > [4]
> > > >
> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > > > >
> > > > > Best,
> > > > > D.
> > > > >
> > > >
> > > >
> > > > --
> > > > Best
> > > >
> > > > ConradJam
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by weijie guo <gu...@gmail.com>.

Thanks David for driving this. This is a very valuable work, especially for
cloud native environment.

>> How about adding some more information such as vertex type
(SOURCE/MAP/JOIN and .etc) in the response of `get jobs
resource-requirements`? For users, only vertex-id may be difficult to
understand.

+1 for this suggestion, including jobvertex's name in the response body is more
user-friendly.


I saw this sentence in FLIP: "Setting the upper bound to -1 will reset the
value to the default setting."  What is the default value here (based on
what configuration), or just infinite?


Best regards,

Weijie



Shammon FY <zj...@gmail.com> 于2023年2月6日周一 18:06写道：

> Hi David
>
> Thanks for initiating this discussion. I think declaring job resource
> requirements by REST API is very valuable. I just left some comments as
> followed
>
> 1) How about adding some more information such as vertex type
> (SOURCE/MAP/JOIN and .etc) in the response of `get jobs
> resource-requirements`? For users, only vertex-id may be difficult to
> understand.
>
> 2) For sql jobs, we always use a unified parallelism for most vertices. Can
> we provide them with a more convenient setting method instead of each one?
>
>
> Best,
> Shammon
>
>
> On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <matthias.pohl@aiven.io
> .invalid>
> wrote:
>
> > Thanks David for creating this FLIP. It sounds promising and useful to
> > have. Here are some thoughts from my side (some of them might be rather a
> > follow-up and not necessarily part of this FLIP):
> > - I'm wondering whether it makes sense to add some kind of resource ID to
> > the REST API. This would give Flink a tool to verify the PATCH request of
> > the external system in a compare-and-set kind of manner. AFAIU, the
> process
> > requires the external system to retrieve the resource requirements first
> > (to retrieve the vertex IDs). A resource ID <ABC> would be sent along as
> a
> > unique identifier for the provided setup. It's essentially the version ID
> > of the currently deployed resource requirement configuration. Flink
> doesn't
> > know whether the external system would use the provided information in
> some
> > way to derive a new set of resource requirements for this job. The
> > subsequent PATCH request with updated resource requirements would include
> > the previously retrieved resource ID <ABC>. The PATCH call would fail if
> > there was a concurrent PATCH call in between indicating to the external
> > system that the resource requirements were concurrently updated.
> > - How often do we allow resource requirements to be changed? That
> question
> > might make my previous comment on the resource ID obsolete because we
> could
> > just make any PATCH call fail if there was a resource requirement update
> > within a certain time frame before the request. But such a time period is
> > something we might want to make configurable then, I guess.
> > - Versioning the JobGraph in the JobGraphStore rather than overwriting it
> > might be an idea. This would enable us to provide resource requirement
> > changes in the UI or through the REST API. It is related to a problem
> > around keeping track of the exception history within the
> AdaptiveScheduler
> > and also having to consider multiple versions of a JobGraph. But for that
> > one, we use the ExecutionGraphInfoStore right now.
> > - Updating the JobGraph in the JobGraphStore makes sense. I'm just
> > wondering whether we bundle two things together that are actually
> separate:
> > The business logic and the execution configuration (the resource
> > requirements). I'm aware that this is not a flaw of the current FLIP but
> > rather something that was not necessary to address in the past because
> the
> > JobGraph was kind of static. I don't remember whether that was already
> > discussed while working on the AdaptiveScheduler for FLIP-160 [1]. Maybe,
> > I'm missing some functionality here that requires us to have everything
> in
> > one place. But it feels like updating the entire JobGraph which could be
> > actually a "config change" is not reasonable. ...also considering the
> > amount of data that can be stored in a ConfigMap/ZooKeeper node if
> > versioning the resource requirement change as proposed in my previous
> item
> > is an option for us.
> > - Updating the JobGraphStore means adding more requests to the HA backend
> > API. There were some concerns shared in the discussion thread [2] for
> > FLIP-270 [3] on pressuring the k8s API server in the past with too many
> > calls. Eventhough, it's more likely to be caused by checkpointing, I
> still
> > wanted to bring it up. We're working on a standardized performance test
> to
> > prepare going forward with FLIP-270 [3] right now.
> >
> > Best,
> > Matthias
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> > [2] https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> > [3]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> >
> > On Fri, Feb 3, 2023 at 10:31 AM ConradJam <ja...@gmail.com> wrote:
> >
> > > Hi David:
> > >
> > > Thank you for drive this flip, which helps less flink shutdown time
> > >
> > > for this flip, I would like to make a few idea on share
> > >
> > >
> > >    - when the number of "slots" is insufficient, can we can stop users
> > >    rescaling or throw something to tell user "less avaliable slots to
> > > upgrade,
> > >    please checkout your alivalbe slots" ? Or we could have a request
> > >    switch(true/false) to allow this behavior
> > >
> > >
> > >    - when user upgrade job-vertx-parallelism . I want to have an
> > interface
> > >    to query the current update parallel execution status, so that the
> > user
> > > or
> > >    program can understand the current status
> > >    - I want to have an interface to query the current update
> parallelism
> > >    execution status. This also helps similar to *[1] Flink K8S
> Operator*
> > >    management
> > >
> > >
> > > {
> > >   status: Failed
> > >   reason: "less avaliable slots to upgrade, please checkout your
> alivalbe
> > > slots"
> > > }
> > >
> > >
> > >
> > >    - *Pending*: this job now is join the upgrade queue,it will be
> update
> > >    later
> > >    - *Rescaling*: job now is rescaling,wait it finish
> > >    - *Finished*: finish do it
> > >    - *Failed* : something have wrong,so this job is not alivable
> upgrade
> > >
> > > I want to supplement my above content in flip, what do you think ?
> > >
> > >
> > >    1.
> > >
> > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> > >
> > >
> > > David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：
> > >
> > > > Hi everyone,
> > > >
> > > > This FLIP [1] introduces a new REST API for declaring resource
> > > requirements
> > > > for the Adaptive Scheduler. There seems to be a clear need for this
> API
> > > > based on the discussion on the "Reworking the Rescale API" [2]
> thread.
> > > >
> > > > Before we get started, this work is heavily based on the prototype
> [3]
> > > > created by Till Rohrmann, and the FLIP is being published with his
> > > consent.
> > > > Big shoutout to him!
> > > >
> > > > Last and not least, thanks to Chesnay and Roman for the initial
> reviews
> > > and
> > > > discussions.
> > > >
> > > > The best start would be watching a short demo [4] that I've recorded,
> > > which
> > > > illustrates newly added capabilities (rescaling the running job,
> > handing
> > > > back resources to the RM, and session cluster support).
> > > >
> > > > The intuition behind the FLIP is being able to define resource
> > > requirements
> > > > ("resource boundaries") externally that the AdaptiveScheduler can
> > > navigate
> > > > within. This is a building block for higher-level efforts such as an
> > > > external Autoscaler. The natural extension of this work would be to
> > allow
> > > > to specify per-vertex ResourceProfiles.
> > > >
> > > > Looking forward to your thoughts; any feedback is appreciated!
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > [2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > > > [3] https://github.com/tillrohrmann/flink/tree/autoscaling
> > > > [4]
> > > https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > > >
> > > > Best,
> > > > D.
> > > >
> > >
> > >
> > > --
> > > Best
> > >
> > > ConradJam
> > >
> >
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by Shammon FY <zj...@gmail.com>.

Hi David

Thanks for initiating this discussion. I think declaring job resource
requirements by REST API is very valuable. I just left some comments as
followed

1) How about adding some more information such as vertex type
(SOURCE/MAP/JOIN and .etc) in the response of `get jobs
resource-requirements`? For users, only vertex-id may be difficult to
understand.

2) For sql jobs, we always use a unified parallelism for most vertices. Can
we provide them with a more convenient setting method instead of each one?


Best,
Shammon


On Fri, Feb 3, 2023 at 8:18 PM Matthias Pohl <ma...@aiven.io.invalid>
wrote:

> Thanks David for creating this FLIP. It sounds promising and useful to
> have. Here are some thoughts from my side (some of them might be rather a
> follow-up and not necessarily part of this FLIP):
> - I'm wondering whether it makes sense to add some kind of resource ID to
> the REST API. This would give Flink a tool to verify the PATCH request of
> the external system in a compare-and-set kind of manner. AFAIU, the process
> requires the external system to retrieve the resource requirements first
> (to retrieve the vertex IDs). A resource ID <ABC> would be sent along as a
> unique identifier for the provided setup. It's essentially the version ID
> of the currently deployed resource requirement configuration. Flink doesn't
> know whether the external system would use the provided information in some
> way to derive a new set of resource requirements for this job. The
> subsequent PATCH request with updated resource requirements would include
> the previously retrieved resource ID <ABC>. The PATCH call would fail if
> there was a concurrent PATCH call in between indicating to the external
> system that the resource requirements were concurrently updated.
> - How often do we allow resource requirements to be changed? That question
> might make my previous comment on the resource ID obsolete because we could
> just make any PATCH call fail if there was a resource requirement update
> within a certain time frame before the request. But such a time period is
> something we might want to make configurable then, I guess.
> - Versioning the JobGraph in the JobGraphStore rather than overwriting it
> might be an idea. This would enable us to provide resource requirement
> changes in the UI or through the REST API. It is related to a problem
> around keeping track of the exception history within the AdaptiveScheduler
> and also having to consider multiple versions of a JobGraph. But for that
> one, we use the ExecutionGraphInfoStore right now.
> - Updating the JobGraph in the JobGraphStore makes sense. I'm just
> wondering whether we bundle two things together that are actually separate:
> The business logic and the execution configuration (the resource
> requirements). I'm aware that this is not a flaw of the current FLIP but
> rather something that was not necessary to address in the past because the
> JobGraph was kind of static. I don't remember whether that was already
> discussed while working on the AdaptiveScheduler for FLIP-160 [1]. Maybe,
> I'm missing some functionality here that requires us to have everything in
> one place. But it feels like updating the entire JobGraph which could be
> actually a "config change" is not reasonable. ...also considering the
> amount of data that can be stored in a ConfigMap/ZooKeeper node if
> versioning the resource requirement change as proposed in my previous item
> is an option for us.
> - Updating the JobGraphStore means adding more requests to the HA backend
> API. There were some concerns shared in the discussion thread [2] for
> FLIP-270 [3] on pressuring the k8s API server in the past with too many
> calls. Eventhough, it's more likely to be caused by checkpointing, I still
> wanted to bring it up. We're working on a standardized performance test to
> prepare going forward with FLIP-270 [3] right now.
>
> Best,
> Matthias
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
> [2] https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
> [3]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
>
> On Fri, Feb 3, 2023 at 10:31 AM ConradJam <ja...@gmail.com> wrote:
>
> > Hi David:
> >
> > Thank you for drive this flip, which helps less flink shutdown time
> >
> > for this flip, I would like to make a few idea on share
> >
> >
> >    - when the number of "slots" is insufficient, can we can stop users
> >    rescaling or throw something to tell user "less avaliable slots to
> > upgrade,
> >    please checkout your alivalbe slots" ? Or we could have a request
> >    switch(true/false) to allow this behavior
> >
> >
> >    - when user upgrade job-vertx-parallelism . I want to have an
> interface
> >    to query the current update parallel execution status, so that the
> user
> > or
> >    program can understand the current status
> >    - I want to have an interface to query the current update parallelism
> >    execution status. This also helps similar to *[1] Flink K8S Operator*
> >    management
> >
> >
> > {
> >   status: Failed
> >   reason: "less avaliable slots to upgrade, please checkout your alivalbe
> > slots"
> > }
> >
> >
> >
> >    - *Pending*: this job now is join the upgrade queue,it will be update
> >    later
> >    - *Rescaling*: job now is rescaling,wait it finish
> >    - *Finished*: finish do it
> >    - *Failed* : something have wrong,so this job is not alivable upgrade
> >
> > I want to supplement my above content in flip, what do you think ?
> >
> >
> >    1.
> >
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
> >
> >
> > David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：
> >
> > > Hi everyone,
> > >
> > > This FLIP [1] introduces a new REST API for declaring resource
> > requirements
> > > for the Adaptive Scheduler. There seems to be a clear need for this API
> > > based on the discussion on the "Reworking the Rescale API" [2] thread.
> > >
> > > Before we get started, this work is heavily based on the prototype [3]
> > > created by Till Rohrmann, and the FLIP is being published with his
> > consent.
> > > Big shoutout to him!
> > >
> > > Last and not least, thanks to Chesnay and Roman for the initial reviews
> > and
> > > discussions.
> > >
> > > The best start would be watching a short demo [4] that I've recorded,
> > which
> > > illustrates newly added capabilities (rescaling the running job,
> handing
> > > back resources to the RM, and session cluster support).
> > >
> > > The intuition behind the FLIP is being able to define resource
> > requirements
> > > ("resource boundaries") externally that the AdaptiveScheduler can
> > navigate
> > > within. This is a building block for higher-level efforts such as an
> > > external Autoscaler. The natural extension of this work would be to
> allow
> > > to specify per-vertex ResourceProfiles.
> > >
> > > Looking forward to your thoughts; any feedback is appreciated!
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > [2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > > [3] https://github.com/tillrohrmann/flink/tree/autoscaling
> > > [4]
> > https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> > >
> > > Best,
> > > D.
> > >
> >
> >
> > --
> > Best
> >
> > ConradJam
> >
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by Matthias Pohl <ma...@aiven.io.INVALID>.

Thanks David for creating this FLIP. It sounds promising and useful to
have. Here are some thoughts from my side (some of them might be rather a
follow-up and not necessarily part of this FLIP):
- I'm wondering whether it makes sense to add some kind of resource ID to
the REST API. This would give Flink a tool to verify the PATCH request of
the external system in a compare-and-set kind of manner. AFAIU, the process
requires the external system to retrieve the resource requirements first
(to retrieve the vertex IDs). A resource ID <ABC> would be sent along as a
unique identifier for the provided setup. It's essentially the version ID
of the currently deployed resource requirement configuration. Flink doesn't
know whether the external system would use the provided information in some
way to derive a new set of resource requirements for this job. The
subsequent PATCH request with updated resource requirements would include
the previously retrieved resource ID <ABC>. The PATCH call would fail if
there was a concurrent PATCH call in between indicating to the external
system that the resource requirements were concurrently updated.
- How often do we allow resource requirements to be changed? That question
might make my previous comment on the resource ID obsolete because we could
just make any PATCH call fail if there was a resource requirement update
within a certain time frame before the request. But such a time period is
something we might want to make configurable then, I guess.
- Versioning the JobGraph in the JobGraphStore rather than overwriting it
might be an idea. This would enable us to provide resource requirement
changes in the UI or through the REST API. It is related to a problem
around keeping track of the exception history within the AdaptiveScheduler
and also having to consider multiple versions of a JobGraph. But for that
one, we use the ExecutionGraphInfoStore right now.
- Updating the JobGraph in the JobGraphStore makes sense. I'm just
wondering whether we bundle two things together that are actually separate:
The business logic and the execution configuration (the resource
requirements). I'm aware that this is not a flaw of the current FLIP but
rather something that was not necessary to address in the past because the
JobGraph was kind of static. I don't remember whether that was already
discussed while working on the AdaptiveScheduler for FLIP-160 [1]. Maybe,
I'm missing some functionality here that requires us to have everything in
one place. But it feels like updating the entire JobGraph which could be
actually a "config change" is not reasonable. ...also considering the
amount of data that can be stored in a ConfigMap/ZooKeeper node if
versioning the resource requirement change as proposed in my previous item
is an option for us.
- Updating the JobGraphStore means adding more requests to the HA backend
API. There were some concerns shared in the discussion thread [2] for
FLIP-270 [3] on pressuring the k8s API server in the past with too many
calls. Eventhough, it's more likely to be caused by checkpointing, I still
wanted to bring it up. We're working on a standardized performance test to
prepare going forward with FLIP-270 [3] right now.

Best,
Matthias

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-160%3A+Adaptive+Scheduler
[2] https://lists.apache.org/thread/bm6rmxxk6fbrqfsgz71gvso58950d4mj
[3]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints

On Fri, Feb 3, 2023 at 10:31 AM ConradJam <ja...@gmail.com> wrote:

> Hi David:
>
> Thank you for drive this flip, which helps less flink shutdown time
>
> for this flip, I would like to make a few idea on share
>
>
>    - when the number of "slots" is insufficient, can we can stop users
>    rescaling or throw something to tell user "less avaliable slots to
> upgrade,
>    please checkout your alivalbe slots" ? Or we could have a request
>    switch(true/false) to allow this behavior
>
>
>    - when user upgrade job-vertx-parallelism . I want to have an interface
>    to query the current update parallel execution status, so that the user
> or
>    program can understand the current status
>    - I want to have an interface to query the current update parallelism
>    execution status. This also helps similar to *[1] Flink K8S Operator*
>    management
>
>
> {
>   status: Failed
>   reason: "less avaliable slots to upgrade, please checkout your alivalbe
> slots"
> }
>
>
>
>    - *Pending*: this job now is join the upgrade queue,it will be update
>    later
>    - *Rescaling*: job now is rescaling,wait it finish
>    - *Finished*: finish do it
>    - *Failed* : something have wrong,so this job is not alivable upgrade
>
> I want to supplement my above content in flip, what do you think ?
>
>
>    1.
>    https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/
>
>
> David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：
>
> > Hi everyone,
> >
> > This FLIP [1] introduces a new REST API for declaring resource
> requirements
> > for the Adaptive Scheduler. There seems to be a clear need for this API
> > based on the discussion on the "Reworking the Rescale API" [2] thread.
> >
> > Before we get started, this work is heavily based on the prototype [3]
> > created by Till Rohrmann, and the FLIP is being published with his
> consent.
> > Big shoutout to him!
> >
> > Last and not least, thanks to Chesnay and Roman for the initial reviews
> and
> > discussions.
> >
> > The best start would be watching a short demo [4] that I've recorded,
> which
> > illustrates newly added capabilities (rescaling the running job, handing
> > back resources to the RM, and session cluster support).
> >
> > The intuition behind the FLIP is being able to define resource
> requirements
> > ("resource boundaries") externally that the AdaptiveScheduler can
> navigate
> > within. This is a building block for higher-level efforts such as an
> > external Autoscaler. The natural extension of this work would be to allow
> > to specify per-vertex ResourceProfiles.
> >
> > Looking forward to your thoughts; any feedback is appreciated!
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > [2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> > [3] https://github.com/tillrohrmann/flink/tree/autoscaling
> > [4]
> https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
> >
> > Best,
> > D.
> >
>
>
> --
> Best
>
> ConradJam
>

Re: [DISCUSS] FLIP-291: Externalized Declarative Resource Management

Posted by ConradJam <ja...@gmail.com>.

Hi David:

Thank you for drive this flip, which helps less flink shutdown time

for this flip, I would like to make a few idea on share


   - when the number of "slots" is insufficient, can we can stop users
   rescaling or throw something to tell user "less avaliable slots to upgrade,
   please checkout your alivalbe slots" ? Or we could have a request
   switch(true/false) to allow this behavior


   - when user upgrade job-vertx-parallelism . I want to have an interface
   to query the current update parallel execution status, so that the user or
   program can understand the current status
   - I want to have an interface to query the current update parallelism
   execution status. This also helps similar to *[1] Flink K8S Operator*
   management


{
  status: Failed
  reason: "less avaliable slots to upgrade, please checkout your alivalbe slots"
}



   - *Pending*: this job now is join the upgrade queue,it will be update
   later
   - *Rescaling*: job now is rescaling,wait it finish
   - *Finished*: finish do it
   - *Failed* : something have wrong,so this job is not alivable upgrade

I want to supplement my above content in flip, what do you think ?


   1.
   https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/


David Morávek <dm...@apache.org> 于2023年2月3日周五 16:42写道：

> Hi everyone,
>
> This FLIP [1] introduces a new REST API for declaring resource requirements
> for the Adaptive Scheduler. There seems to be a clear need for this API
> based on the discussion on the "Reworking the Rescale API" [2] thread.
>
> Before we get started, this work is heavily based on the prototype [3]
> created by Till Rohrmann, and the FLIP is being published with his consent.
> Big shoutout to him!
>
> Last and not least, thanks to Chesnay and Roman for the initial reviews and
> discussions.
>
> The best start would be watching a short demo [4] that I've recorded, which
> illustrates newly added capabilities (rescaling the running job, handing
> back resources to the RM, and session cluster support).
>
> The intuition behind the FLIP is being able to define resource requirements
> ("resource boundaries") externally that the AdaptiveScheduler can navigate
> within. This is a building block for higher-level efforts such as an
> external Autoscaler. The natural extension of this work would be to allow
> to specify per-vertex ResourceProfiles.
>
> Looking forward to your thoughts; any feedback is appreciated!
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> [2] https://lists.apache.org/thread/2f7dgr88xtbmsohtr0f6wmsvw8sw04f5
> [3] https://github.com/tillrohrmann/flink/tree/autoscaling
> [4] https://drive.google.com/file/d/1Vp8W-7Zk_iKXPTAiBT-eLPmCMd_I57Ty/view
>
> Best,
> D.
>


-- 
Best

ConradJam