You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Samrat Deb <de...@gmail.com> on 2023/02/16 09:26:44 UTC

[DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Hi ,

*Context:*
Auto Scaling was introduced in Flink as part of FLIP-271[1].
It discusses one of the important aspects to provide a robust default
scaling algorithm.
      a. Ensure scaling yields effective usage of assigned task slots.
      b. Ramp up in case of any backlog to ensure it gets processed in a
timely manner
      c. Minimize the number of scaling decisions to prevent costly rescale
operation
The flip intends to add an auto scaling framework based on 6 major metrics
and contains different types of threshold to trigger the scaling.

Thread[2] discusses a different problem: why autoscaler is part of the
operator instead of jobmanager at runtime.
The Community decided to keep the autoscaling logic in the
flink-kubernetes-operator.

*Proposal: *
In this discussion, I want to put forward a thought of extracting out the
auto scaling logic into a new submodule in flink-kubernetes-operator
repository[3],
which will be independent of any resource manager/Operator.
Currently the Autoscaling algorithm is very tightly coupled with the
kubernetes API.
This makes the autoscaling core algorithm not so easily extensible for
different available resource managers like YARN, Mesos etc.
A Separate autoscaling module inside the flink kubernetes operator will
help other resource managers to leverage the autoscaling logic.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
[2] https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
[3] https://github.com/apache/flink-kubernetes-operator


Bests,
Samrat

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Matt Wang <wa...@163.com>.

Hi,
Thank you gays for bringing this up, we're very interested in that as well.

We are currently migrating from yarn to kubernetes, but this will last for a long time, so the support of yarn is also more important. We have now started to promote Autoscaling in our internal business. The model we use is the DS2 model similar to flip-271. In the near future, we will also communicate with you about the problems we encounter online.

Best,
Matt Wang

My team is also looking forward to autoscaler is compatible with yarn.

Currently, all of our flink jobs are running on yarn. And autoscaler is
a great feature for flink users, it can greatly simplify the process of
tuning parallelism.

If the autoscaler supports yarn, I propose to divide it into two stages:
1. It only collects and evaluates scaling related performance metrics
but does not trigger any job upgrades.
2. Support for automatic upgrades of yarn jobs.

Also, I also hope to join it, and improve it together.

And very happy Gyula can help with the review.

Best,
Rui Fan

On Mon, Feb 20, 2023 at 8:56 AM Shammon FY <zj...@gmail.com> wrote:

Hi Samrat

My team is also looking at this piece. After you give your proposal, we
also hope to join it with you if possible. I hope we can improve this
together for use in our production too, thanks :)

Best,
Shammon

On Fri, Feb 17, 2023 at 9:27 PM Samrat Deb <de...@gmail.com> wrote:

@Gyula
Thank you
We will work on this and try to come up with an approach.

On Fri, Feb 17, 2023 at 6:12 PM Gyula Fóra <gy...@gmail.com> wrote:

In case you guys feel strongly about this I suggest you try to fork the
autoscaler implementation and make a version that works with both the
Kubernetes operator and YARN.
If your solution is generic and works well, we can discuss the way
forward.

Unfortunately me or my team don't really have the resources to assist
you
with the YARN effort as we are mostly invested in Kubernetes but of
course
we are happy to review your work.

Gyula

On Fri, Feb 17, 2023 at 1:09 PM Prabhu Joseph <
prabhujose.gates@gmail.com>
wrote:

@Gyula

It is easier to make the operator work with jobs running in
different
types of clusters than to take the
autoscaler module itself and plug that in somewhere else.

Our (part of Samrat's team) main problem is to leverage the
AutoScaler
Recommendation Engine part of Flink-Kubernetes-Operator for our Flink
jobs
running on YARN.
Currently, it is not feasible as the autoscaler module is tightly
coupled
with the operator. We agree that the operator serves the two core
requirements, but the operator itself
cannot be used for Flink jobs running on YARN. Those core
requirements
are
solved through other mechanisms in the case of YARN. But the main
problem
for us is *how to*
*use the AutoScaler Recommendation Engine for Flink Jobs on YARN.*

On Fri, Feb 17, 2023 at 6:34 AM Shammon FY <zj...@gmail.com>
wrote:

Hi Gyula, Samrat

Thanks for your input and I totally agree with you that it's really
big
work. As @Samrat mentioned above, I think it's not a short way to
make
the
autoscaler completely independent too. But I still find some
valuable
points for the `completely independent autoscaler`, and I think
this
may
be
the goal we need to achieve in the future.

1. A large k8s cluster may manage thousands of machines, and users
may
run
tens of thousands flink jobs in one k8s cluster. If the autoscaler
manages
all these jobs, the autoscaler should be horizontal expansion.

2. As you mentioned, "execute the job stateful upgrades safely" is
indeed a
complexity work, but I think we should decouple it from k8s
operator

a) In addition to k8s, there may be some other resource management

b) Flink may support more scaler operations by REST API, such as
FLIP-291
[1]

c) In our production environment, there's a 'Job Submission
Gateway'
which
stores job info and config, monitors the status of running jobs.
After
the
autoscaler upgrades the job, it must update the config in Gateway
and
users
can restart his job with the updated config to avoid resource
conflict.
Under these circumstances, the autoscaler sending upgrade requests
to
the
gateway may be a good choice.

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management

Best,
Shammon

On Thu, Feb 16, 2023 at 11:03 PM Gyula Fóra <gy...@gmail.com>
wrote:

@Shammon , Samrat:

I appreciate the enthusiasm and I wish this was only a matter of
intention
but making the autoscaler work without the operator may be a
pretty
big
task.
You must not forget 2 core requirements here.

1. The autoscaler logic itself has to run somewhere (in this case
on
k8s
within the operator)S
2. Something has to execute the job stateful upgrades safely
based
on
the
scaling decisions (in this case the operator does that).

1. Can be solved almost anywhere easily however you need
resiliency
etc
for
this to be a prod application, 2. is the really tricky part. The
operator
was actually built to execute job upgrades, if you look at the
code
you
will appreciate the complexity of the task.

As I said in the earlier thread. It is easier to make the
operator
work
with jobs running in different types of clusters than to take the
autoscaler module itself and plug that in somewhere else.

Gyula

On Thu, Feb 16, 2023 at 3:12 PM Samrat Deb <
decordeapex@gmail.com>
wrote:

Hi Shammon,

Thank you for your input, completely aligned with you.

We are fine with either of the options ,

but IMO, to start with it will be easy to have it in the
flink-kubernetes-operator as a module instead of a separate
repo
which
requires additional effort.

Given that we would be incrementally working on making an
autoscaling
recommendation framework generic enough,

Once it reaches a point where the community feels it needs to
be
moved
to a
separate repo we can take a call.

Bests,

Samrat

On Thu, Feb 16, 2023 at 7:37 PM Samrat Deb <
decordeapex@gmail.com>
wrote:

Hi Max ,
If you are fine and aligned with the same thought , since
this
is
going
to
be very useful to us, we are ready to help / contribute
additional
work
required.

Bests,
Samrat

On Thu, 16 Feb 2023 at 5:28 PM, Shammon FY <
zjureel@gmail.com>
wrote:

Hi Samrat

Do you mean to create an independent module for flink
scaling
in
flink-k8s-operator? How about creating a project such as
`flink-auto-scaling` which is completely independent?
Besides
resource
managers such as k8s and yarn, we can do more things in the
project,
for
example, updating config in the user's `job submission
system`
after
scaling flink jobs. WDYT?

Best,
Shammon

On Thu, Feb 16, 2023 at 7:38 PM Maximilian Michels <
mxm@apache.org>
wrote:

Hi Samrat,

The autoscaling module is now pluggable but it is still
tightly
coupled with Kubernetes. It will take additional work for
the
logic
to
work independently of the cluster manager.

-Max

On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <
decordeapex@gmail.com>
wrote:

Oh! yesterday it got merged.
Apologies , I missed the recent commit @Gyula.

Thanks for the update

On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <
gyula.fora@gmail.com>
wrote:

Max recently moved the autoscaler logic in a separate
submodule,
did
you
see that?

https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4

Gyula

On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <
decordeapex@gmail.com>
wrote:

Hi ,

*Context:*
Auto Scaling was introduced in Flink as part of
FLIP-271[1].
It discusses one of the important aspects to
provide a
robust
default
scaling algorithm.
a. Ensure scaling yields effective usage of
assigned
task
slots.
b. Ramp up in case of any backlog to ensure it
gets
processed
in a
timely manner
c. Minimize the number of scaling decisions to
prevent
costly
rescale
operation
The flip intends to add an auto scaling framework
based
on 6
major
metrics
and contains different types of threshold to trigger
the
scaling.

Thread[2] discusses a different problem: why
autoscaler
is
part
of
the
operator instead of jobmanager at runtime.
The Community decided to keep the autoscaling logic
in
the
flink-kubernetes-operator.

*Proposal: *
In this discussion, I want to put forward a thought
of
extracting
out the
auto scaling logic into a new submodule in
flink-kubernetes-operator
repository[3],
which will be independent of any resource
manager/Operator.
Currently the Autoscaling algorithm is very tightly
coupled
with
the
kubernetes API.
This makes the autoscaling core algorithm not so
easily
extensible
for
different available resource managers like YARN,
Mesos
etc.
A Separate autoscaling module inside the flink
kubernetes
operator
will
help other resource managers to leverage the
autoscaling
logic.

[1]

https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
[2]

https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
[3]
https://github.com/apache/flink-kubernetes-operator

Bests,
Samrat

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Rui Fan <19...@gmail.com>.

Hi Gyula, Samrat and Shammon,

My team is also looking forward to autoscaler is compatible with yarn.

Currently, all of our flink jobs are running on yarn. And autoscaler is
a great feature for flink users, it can greatly simplify the process of
tuning parallelism.

If the autoscaler supports yarn, I propose to divide it into two stages:
1. It only collects and evaluates scaling related performance metrics
 but does not trigger any job upgrades.
2. Support for automatic upgrades of yarn jobs.

Also, I also hope to join it, and improve it together.

And very happy Gyula can help with the review.

Best,
Rui Fan

On Mon, Feb 20, 2023 at 8:56 AM Shammon FY <zj...@gmail.com> wrote:

> Hi Samrat
>
> My team is also looking at this piece. After you give your proposal, we
> also hope to join it with you if possible. I hope we can improve this
> together for use in our production too, thanks :)
>
> Best,
> Shammon
>
> On Fri, Feb 17, 2023 at 9:27 PM Samrat Deb <de...@gmail.com> wrote:
>
> > @Gyula
> > Thank you
> > We will work on this and try to come up with an approach.
> >
> >
> >
> >
> > On Fri, Feb 17, 2023 at 6:12 PM Gyula Fóra <gy...@gmail.com> wrote:
> >
> > > In case you guys feel strongly about this I suggest you try to fork the
> > > autoscaler implementation and make a version that works with both the
> > > Kubernetes operator and YARN.
> > > If your solution is generic and works well, we can discuss the way
> > forward.
> > >
> > > Unfortunately me or my team don't really have the resources to assist
> you
> > > with the YARN effort as we are mostly invested in Kubernetes but of
> > course
> > > we are happy to review your work.
> > >
> > > Gyula
> > >
> > >
> > > On Fri, Feb 17, 2023 at 1:09 PM Prabhu Joseph <
> > prabhujose.gates@gmail.com>
> > > wrote:
> > >
> > > > @Gyula
> > > >
> > > > >> It is easier to make the operator work with jobs running in
> > different
> > > > types of clusters than to take the
> > > > autoscaler module itself and plug that in somewhere else.
> > > >
> > > > Our (part of Samrat's team) main problem is to leverage the
> AutoScaler
> > > > Recommendation Engine part of Flink-Kubernetes-Operator for our Flink
> > > jobs
> > > > running on YARN.
> > > > Currently, it is not feasible as the autoscaler module is tightly
> > coupled
> > > > with the operator. We agree that the operator serves the two core
> > > > requirements, but the operator itself
> > > > cannot be used for Flink jobs running on YARN. Those core
> requirements
> > > are
> > > > solved through other mechanisms in the case of YARN. But the main
> > problem
> > > > for us is *how to*
> > > > *use the AutoScaler Recommendation Engine for Flink Jobs on YARN.*
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Feb 17, 2023 at 6:34 AM Shammon FY <zj...@gmail.com>
> wrote:
> > > >
> > > > > Hi Gyula, Samrat
> > > > >
> > > > > Thanks for your input and I totally agree with you that it's really
> > big
> > > > > work. As @Samrat mentioned above, I think it's not a short way to
> > make
> > > > the
> > > > > autoscaler completely independent too. But I still find some
> valuable
> > > > > points for the `completely independent autoscaler`, and I think
> this
> > > may
> > > > be
> > > > > the goal we need to achieve in the future.
> > > > >
> > > > > 1. A large k8s cluster may manage thousands of machines, and users
> > may
> > > > run
> > > > > tens of thousands flink jobs in one k8s cluster. If the autoscaler
> > > > manages
> > > > > all these jobs, the autoscaler should be horizontal expansion.
> > > > >
> > > > > 2. As you mentioned, "execute the job stateful upgrades safely" is
> > > > indeed a
> > > > > complexity work, but I think we should decouple it from k8s
> operator
> > > > >
> > > > > a) In addition to k8s, there may be some other resource management
> > > > >
> > > > > b) Flink may support more scaler operations by REST API, such as
> > > FLIP-291
> > > > > [1]
> > > > >
> > > > > c) In our production environment, there's a 'Job Submission
> Gateway'
> > > > which
> > > > > stores job info and config, monitors the status of running jobs.
> > After
> > > > the
> > > > > autoscaler upgrades the job, it must update the config in Gateway
> and
> > > > users
> > > > > can restart his job with the updated config to avoid resource
> > conflict.
> > > > > Under these circumstances, the autoscaler sending upgrade requests
> to
> > > the
> > > > > gateway may be a good choice.
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > >
> > > > >
> > > > > Best,
> > > > > Shammon
> > > > >
> > > > >
> > > > > On Thu, Feb 16, 2023 at 11:03 PM Gyula Fóra <gy...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > @Shammon , Samrat:
> > > > > >
> > > > > > I appreciate the enthusiasm and I wish this was only a matter of
> > > > > intention
> > > > > > but making the autoscaler work without the operator may be a
> pretty
> > > big
> > > > > > task.
> > > > > > You must not forget 2 core requirements here.
> > > > > >
> > > > > > 1. The autoscaler logic itself has to run somewhere (in this case
> > on
> > > > k8s
> > > > > > within the operator)S
> > > > > > 2. Something has to execute the job stateful upgrades safely
> based
> > on
> > > > the
> > > > > > scaling decisions (in this case the operator does that).
> > > > > >
> > > > > > 1. Can be solved almost anywhere easily however you need
> resiliency
> > > etc
> > > > > for
> > > > > > this to be a prod application, 2. is the really tricky part. The
> > > > operator
> > > > > > was actually built to execute job upgrades, if you look at the
> code
> > > you
> > > > > > will appreciate the complexity of the task.
> > > > > >
> > > > > > As I said in the earlier thread. It is easier to make the
> operator
> > > work
> > > > > > with jobs running in different types of clusters than to take the
> > > > > > autoscaler module itself and plug that in somewhere else.
> > > > > >
> > > > > > Gyula
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 16, 2023 at 3:12 PM Samrat Deb <
> decordeapex@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Shammon,
> > > > > > >
> > > > > > > Thank you for your input, completely aligned with you.
> > > > > > >
> > > > > > > We are fine with either of the options ,
> > > > > > >
> > > > > > > but IMO, to start with it will be easy to have it in the
> > > > > > > flink-kubernetes-operator as a module instead of a separate
> repo
> > > > which
> > > > > > > requires additional effort.
> > > > > > >
> > > > > > > Given that we would be incrementally working on making an
> > > autoscaling
> > > > > > > recommendation framework generic enough,
> > > > > > >
> > > > > > > Once it reaches a point where the community feels it needs to
> be
> > > > moved
> > > > > > to a
> > > > > > > separate repo we can take a call.
> > > > > > >
> > > > > > > Bests,
> > > > > > >
> > > > > > > Samrat
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Feb 16, 2023 at 7:37 PM Samrat Deb <
> > decordeapex@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Max ,
> > > > > > > > If you are fine and aligned with the same thought , since
> this
> > is
> > > > > going
> > > > > > > to
> > > > > > > > be very useful to us, we are ready to help / contribute
> > > additional
> > > > > work
> > > > > > > > required.
> > > > > > > >
> > > > > > > > Bests,
> > > > > > > > Samrat
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, 16 Feb 2023 at 5:28 PM, Shammon FY <
> zjureel@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi Samrat
> > > > > > > >>
> > > > > > > >> Do you mean to create an independent module for flink
> scaling
> > in
> > > > > > > >> flink-k8s-operator? How about creating a project such as
> > > > > > > >> `flink-auto-scaling` which is completely independent?
> Besides
> > > > > resource
> > > > > > > >> managers such as k8s and yarn, we can do more things in the
> > > > project,
> > > > > > for
> > > > > > > >> example, updating config in the user's `job submission
> system`
> > > > after
> > > > > > > >> scaling flink jobs. WDYT?
> > > > > > > >>
> > > > > > > >> Best,
> > > > > > > >> Shammon
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Thu, Feb 16, 2023 at 7:38 PM Maximilian Michels <
> > > > mxm@apache.org>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Hi Samrat,
> > > > > > > >> >
> > > > > > > >> > The autoscaling module is now pluggable but it is still
> > > tightly
> > > > > > > >> > coupled with Kubernetes. It will take additional work for
> > the
> > > > > logic
> > > > > > to
> > > > > > > >> > work independently of the cluster manager.
> > > > > > > >> >
> > > > > > > >> > -Max
> > > > > > > >> >
> > > > > > > >> > On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <
> > > > > decordeapex@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >> > >
> > > > > > > >> > > Oh! yesterday it got merged.
> > > > > > > >> > > Apologies , I missed the recent commit @Gyula.
> > > > > > > >> > >
> > > > > > > >> > > Thanks for the update
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <
> > > > > gyula.fora@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Max recently moved the autoscaler logic in a separate
> > > > > submodule,
> > > > > > > did
> > > > > > > >> > you
> > > > > > > >> > > > see that?
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
> > > > > > > >> > > >
> > > > > > > >> > > > Gyula
> > > > > > > >> > > >
> > > > > > > >> > > > On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <
> > > > > > > decordeapex@gmail.com>
> > > > > > > >> > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Hi ,
> > > > > > > >> > > > >
> > > > > > > >> > > > > *Context:*
> > > > > > > >> > > > > Auto Scaling was introduced in Flink as part of
> > > > FLIP-271[1].
> > > > > > > >> > > > > It discusses one of the important aspects to
> provide a
> > > > > robust
> > > > > > > >> default
> > > > > > > >> > > > > scaling algorithm.
> > > > > > > >> > > > >       a. Ensure scaling yields effective usage of
> > > assigned
> > > > > > task
> > > > > > > >> > slots.
> > > > > > > >> > > > >       b. Ramp up in case of any backlog to ensure it
> > > gets
> > > > > > > >> processed
> > > > > > > >> > in a
> > > > > > > >> > > > > timely manner
> > > > > > > >> > > > >       c. Minimize the number of scaling decisions to
> > > > prevent
> > > > > > > >> costly
> > > > > > > >> > > > rescale
> > > > > > > >> > > > > operation
> > > > > > > >> > > > > The flip intends to add an auto scaling framework
> > based
> > > > on 6
> > > > > > > major
> > > > > > > >> > > > metrics
> > > > > > > >> > > > > and contains different types of threshold to trigger
> > the
> > > > > > > scaling.
> > > > > > > >> > > > >
> > > > > > > >> > > > > Thread[2] discusses a different problem: why
> > autoscaler
> > > is
> > > > > > part
> > > > > > > of
> > > > > > > >> > the
> > > > > > > >> > > > > operator instead of jobmanager at runtime.
> > > > > > > >> > > > > The Community decided to keep the autoscaling logic
> in
> > > the
> > > > > > > >> > > > > flink-kubernetes-operator.
> > > > > > > >> > > > >
> > > > > > > >> > > > > *Proposal: *
> > > > > > > >> > > > > In this discussion, I want to put forward a thought
> of
> > > > > > > extracting
> > > > > > > >> > out the
> > > > > > > >> > > > > auto scaling logic into a new submodule in
> > > > > > > >> flink-kubernetes-operator
> > > > > > > >> > > > > repository[3],
> > > > > > > >> > > > > which will be independent of any resource
> > > > manager/Operator.
> > > > > > > >> > > > > Currently the Autoscaling algorithm is very tightly
> > > > coupled
> > > > > > with
> > > > > > > >> the
> > > > > > > >> > > > > kubernetes API.
> > > > > > > >> > > > > This makes the autoscaling core algorithm not so
> > easily
> > > > > > > extensible
> > > > > > > >> > for
> > > > > > > >> > > > > different available resource managers like YARN,
> Mesos
> > > > etc.
> > > > > > > >> > > > > A Separate autoscaling module inside the flink
> > > kubernetes
> > > > > > > operator
> > > > > > > >> > will
> > > > > > > >> > > > > help other resource managers to leverage the
> > autoscaling
> > > > > > logic.
> > > > > > > >> > > > >
> > > > > > > >> > > > > [1]
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> > > > > > > >> > > > > [2]
> > > > > > > >>
> > > https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> > > > > > > >> > > > > [3]
> > https://github.com/apache/flink-kubernetes-operator
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > > Bests,
> > > > > > > >> > > > > Samrat
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Shammon FY <zj...@gmail.com>.

Hi Samrat

My team is also looking at this piece. After you give your proposal, we
also hope to join it with you if possible. I hope we can improve this
together for use in our production too, thanks :)

Best,
Shammon

On Fri, Feb 17, 2023 at 9:27 PM Samrat Deb <de...@gmail.com> wrote:

> @Gyula
> Thank you
> We will work on this and try to come up with an approach.
>
>
>
>
> On Fri, Feb 17, 2023 at 6:12 PM Gyula Fóra <gy...@gmail.com> wrote:
>
> > In case you guys feel strongly about this I suggest you try to fork the
> > autoscaler implementation and make a version that works with both the
> > Kubernetes operator and YARN.
> > If your solution is generic and works well, we can discuss the way
> forward.
> >
> > Unfortunately me or my team don't really have the resources to assist you
> > with the YARN effort as we are mostly invested in Kubernetes but of
> course
> > we are happy to review your work.
> >
> > Gyula
> >
> >
> > On Fri, Feb 17, 2023 at 1:09 PM Prabhu Joseph <
> prabhujose.gates@gmail.com>
> > wrote:
> >
> > > @Gyula
> > >
> > > >> It is easier to make the operator work with jobs running in
> different
> > > types of clusters than to take the
> > > autoscaler module itself and plug that in somewhere else.
> > >
> > > Our (part of Samrat's team) main problem is to leverage the AutoScaler
> > > Recommendation Engine part of Flink-Kubernetes-Operator for our Flink
> > jobs
> > > running on YARN.
> > > Currently, it is not feasible as the autoscaler module is tightly
> coupled
> > > with the operator. We agree that the operator serves the two core
> > > requirements, but the operator itself
> > > cannot be used for Flink jobs running on YARN. Those core requirements
> > are
> > > solved through other mechanisms in the case of YARN. But the main
> problem
> > > for us is *how to*
> > > *use the AutoScaler Recommendation Engine for Flink Jobs on YARN.*
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Feb 17, 2023 at 6:34 AM Shammon FY <zj...@gmail.com> wrote:
> > >
> > > > Hi Gyula, Samrat
> > > >
> > > > Thanks for your input and I totally agree with you that it's really
> big
> > > > work. As @Samrat mentioned above, I think it's not a short way to
> make
> > > the
> > > > autoscaler completely independent too. But I still find some valuable
> > > > points for the `completely independent autoscaler`, and I think this
> > may
> > > be
> > > > the goal we need to achieve in the future.
> > > >
> > > > 1. A large k8s cluster may manage thousands of machines, and users
> may
> > > run
> > > > tens of thousands flink jobs in one k8s cluster. If the autoscaler
> > > manages
> > > > all these jobs, the autoscaler should be horizontal expansion.
> > > >
> > > > 2. As you mentioned, "execute the job stateful upgrades safely" is
> > > indeed a
> > > > complexity work, but I think we should decouple it from k8s operator
> > > >
> > > > a) In addition to k8s, there may be some other resource management
> > > >
> > > > b) Flink may support more scaler operations by REST API, such as
> > FLIP-291
> > > > [1]
> > > >
> > > > c) In our production environment, there's a 'Job Submission Gateway'
> > > which
> > > > stores job info and config, monitors the status of running jobs.
> After
> > > the
> > > > autoscaler upgrades the job, it must update the config in Gateway and
> > > users
> > > > can restart his job with the updated config to avoid resource
> conflict.
> > > > Under these circumstances, the autoscaler sending upgrade requests to
> > the
> > > > gateway may be a good choice.
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > >
> > > >
> > > > Best,
> > > > Shammon
> > > >
> > > >
> > > > On Thu, Feb 16, 2023 at 11:03 PM Gyula Fóra <gy...@gmail.com>
> > > wrote:
> > > >
> > > > > @Shammon , Samrat:
> > > > >
> > > > > I appreciate the enthusiasm and I wish this was only a matter of
> > > > intention
> > > > > but making the autoscaler work without the operator may be a pretty
> > big
> > > > > task.
> > > > > You must not forget 2 core requirements here.
> > > > >
> > > > > 1. The autoscaler logic itself has to run somewhere (in this case
> on
> > > k8s
> > > > > within the operator)S
> > > > > 2. Something has to execute the job stateful upgrades safely based
> on
> > > the
> > > > > scaling decisions (in this case the operator does that).
> > > > >
> > > > > 1. Can be solved almost anywhere easily however you need resiliency
> > etc
> > > > for
> > > > > this to be a prod application, 2. is the really tricky part. The
> > > operator
> > > > > was actually built to execute job upgrades, if you look at the code
> > you
> > > > > will appreciate the complexity of the task.
> > > > >
> > > > > As I said in the earlier thread. It is easier to make the operator
> > work
> > > > > with jobs running in different types of clusters than to take the
> > > > > autoscaler module itself and plug that in somewhere else.
> > > > >
> > > > > Gyula
> > > > >
> > > > >
> > > > > On Thu, Feb 16, 2023 at 3:12 PM Samrat Deb <de...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Shammon,
> > > > > >
> > > > > > Thank you for your input, completely aligned with you.
> > > > > >
> > > > > > We are fine with either of the options ,
> > > > > >
> > > > > > but IMO, to start with it will be easy to have it in the
> > > > > > flink-kubernetes-operator as a module instead of a separate repo
> > > which
> > > > > > requires additional effort.
> > > > > >
> > > > > > Given that we would be incrementally working on making an
> > autoscaling
> > > > > > recommendation framework generic enough,
> > > > > >
> > > > > > Once it reaches a point where the community feels it needs to be
> > > moved
> > > > > to a
> > > > > > separate repo we can take a call.
> > > > > >
> > > > > > Bests,
> > > > > >
> > > > > > Samrat
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 16, 2023 at 7:37 PM Samrat Deb <
> decordeapex@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Max ,
> > > > > > > If you are fine and aligned with the same thought , since this
> is
> > > > going
> > > > > > to
> > > > > > > be very useful to us, we are ready to help / contribute
> > additional
> > > > work
> > > > > > > required.
> > > > > > >
> > > > > > > Bests,
> > > > > > > Samrat
> > > > > > >
> > > > > > >
> > > > > > > On Thu, 16 Feb 2023 at 5:28 PM, Shammon FY <zj...@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > >> Hi Samrat
> > > > > > >>
> > > > > > >> Do you mean to create an independent module for flink scaling
> in
> > > > > > >> flink-k8s-operator? How about creating a project such as
> > > > > > >> `flink-auto-scaling` which is completely independent? Besides
> > > > resource
> > > > > > >> managers such as k8s and yarn, we can do more things in the
> > > project,
> > > > > for
> > > > > > >> example, updating config in the user's `job submission system`
> > > after
> > > > > > >> scaling flink jobs. WDYT?
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Shammon
> > > > > > >>
> > > > > > >>
> > > > > > >> On Thu, Feb 16, 2023 at 7:38 PM Maximilian Michels <
> > > mxm@apache.org>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Hi Samrat,
> > > > > > >> >
> > > > > > >> > The autoscaling module is now pluggable but it is still
> > tightly
> > > > > > >> > coupled with Kubernetes. It will take additional work for
> the
> > > > logic
> > > > > to
> > > > > > >> > work independently of the cluster manager.
> > > > > > >> >
> > > > > > >> > -Max
> > > > > > >> >
> > > > > > >> > On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <
> > > > decordeapex@gmail.com>
> > > > > > >> wrote:
> > > > > > >> > >
> > > > > > >> > > Oh! yesterday it got merged.
> > > > > > >> > > Apologies , I missed the recent commit @Gyula.
> > > > > > >> > >
> > > > > > >> > > Thanks for the update
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <
> > > > gyula.fora@gmail.com>
> > > > > > >> wrote:
> > > > > > >> > >
> > > > > > >> > > > Max recently moved the autoscaler logic in a separate
> > > > submodule,
> > > > > > did
> > > > > > >> > you
> > > > > > >> > > > see that?
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
> > > > > > >> > > >
> > > > > > >> > > > Gyula
> > > > > > >> > > >
> > > > > > >> > > > On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <
> > > > > > decordeapex@gmail.com>
> > > > > > >> > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi ,
> > > > > > >> > > > >
> > > > > > >> > > > > *Context:*
> > > > > > >> > > > > Auto Scaling was introduced in Flink as part of
> > > FLIP-271[1].
> > > > > > >> > > > > It discusses one of the important aspects to provide a
> > > > robust
> > > > > > >> default
> > > > > > >> > > > > scaling algorithm.
> > > > > > >> > > > >       a. Ensure scaling yields effective usage of
> > assigned
> > > > > task
> > > > > > >> > slots.
> > > > > > >> > > > >       b. Ramp up in case of any backlog to ensure it
> > gets
> > > > > > >> processed
> > > > > > >> > in a
> > > > > > >> > > > > timely manner
> > > > > > >> > > > >       c. Minimize the number of scaling decisions to
> > > prevent
> > > > > > >> costly
> > > > > > >> > > > rescale
> > > > > > >> > > > > operation
> > > > > > >> > > > > The flip intends to add an auto scaling framework
> based
> > > on 6
> > > > > > major
> > > > > > >> > > > metrics
> > > > > > >> > > > > and contains different types of threshold to trigger
> the
> > > > > > scaling.
> > > > > > >> > > > >
> > > > > > >> > > > > Thread[2] discusses a different problem: why
> autoscaler
> > is
> > > > > part
> > > > > > of
> > > > > > >> > the
> > > > > > >> > > > > operator instead of jobmanager at runtime.
> > > > > > >> > > > > The Community decided to keep the autoscaling logic in
> > the
> > > > > > >> > > > > flink-kubernetes-operator.
> > > > > > >> > > > >
> > > > > > >> > > > > *Proposal: *
> > > > > > >> > > > > In this discussion, I want to put forward a thought of
> > > > > > extracting
> > > > > > >> > out the
> > > > > > >> > > > > auto scaling logic into a new submodule in
> > > > > > >> flink-kubernetes-operator
> > > > > > >> > > > > repository[3],
> > > > > > >> > > > > which will be independent of any resource
> > > manager/Operator.
> > > > > > >> > > > > Currently the Autoscaling algorithm is very tightly
> > > coupled
> > > > > with
> > > > > > >> the
> > > > > > >> > > > > kubernetes API.
> > > > > > >> > > > > This makes the autoscaling core algorithm not so
> easily
> > > > > > extensible
> > > > > > >> > for
> > > > > > >> > > > > different available resource managers like YARN, Mesos
> > > etc.
> > > > > > >> > > > > A Separate autoscaling module inside the flink
> > kubernetes
> > > > > > operator
> > > > > > >> > will
> > > > > > >> > > > > help other resource managers to leverage the
> autoscaling
> > > > > logic.
> > > > > > >> > > > >
> > > > > > >> > > > > [1]
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> > > > > > >> > > > > [2]
> > > > > > >>
> > https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> > > > > > >> > > > > [3]
> https://github.com/apache/flink-kubernetes-operator
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > Bests,
> > > > > > >> > > > > Samrat
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Samrat Deb <de...@gmail.com>.

@Gyula
Thank you
We will work on this and try to come up with an approach.




On Fri, Feb 17, 2023 at 6:12 PM Gyula Fóra <gy...@gmail.com> wrote:

> In case you guys feel strongly about this I suggest you try to fork the
> autoscaler implementation and make a version that works with both the
> Kubernetes operator and YARN.
> If your solution is generic and works well, we can discuss the way forward.
>
> Unfortunately me or my team don't really have the resources to assist you
> with the YARN effort as we are mostly invested in Kubernetes but of course
> we are happy to review your work.
>
> Gyula
>
>
> On Fri, Feb 17, 2023 at 1:09 PM Prabhu Joseph <pr...@gmail.com>
> wrote:
>
> > @Gyula
> >
> > >> It is easier to make the operator work with jobs running in different
> > types of clusters than to take the
> > autoscaler module itself and plug that in somewhere else.
> >
> > Our (part of Samrat's team) main problem is to leverage the AutoScaler
> > Recommendation Engine part of Flink-Kubernetes-Operator for our Flink
> jobs
> > running on YARN.
> > Currently, it is not feasible as the autoscaler module is tightly coupled
> > with the operator. We agree that the operator serves the two core
> > requirements, but the operator itself
> > cannot be used for Flink jobs running on YARN. Those core requirements
> are
> > solved through other mechanisms in the case of YARN. But the main problem
> > for us is *how to*
> > *use the AutoScaler Recommendation Engine for Flink Jobs on YARN.*
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Feb 17, 2023 at 6:34 AM Shammon FY <zj...@gmail.com> wrote:
> >
> > > Hi Gyula, Samrat
> > >
> > > Thanks for your input and I totally agree with you that it's really big
> > > work. As @Samrat mentioned above, I think it's not a short way to make
> > the
> > > autoscaler completely independent too. But I still find some valuable
> > > points for the `completely independent autoscaler`, and I think this
> may
> > be
> > > the goal we need to achieve in the future.
> > >
> > > 1. A large k8s cluster may manage thousands of machines, and users may
> > run
> > > tens of thousands flink jobs in one k8s cluster. If the autoscaler
> > manages
> > > all these jobs, the autoscaler should be horizontal expansion.
> > >
> > > 2. As you mentioned, "execute the job stateful upgrades safely" is
> > indeed a
> > > complexity work, but I think we should decouple it from k8s operator
> > >
> > > a) In addition to k8s, there may be some other resource management
> > >
> > > b) Flink may support more scaler operations by REST API, such as
> FLIP-291
> > > [1]
> > >
> > > c) In our production environment, there's a 'Job Submission Gateway'
> > which
> > > stores job info and config, monitors the status of running jobs. After
> > the
> > > autoscaler upgrades the job, it must update the config in Gateway and
> > users
> > > can restart his job with the updated config to avoid resource conflict.
> > > Under these circumstances, the autoscaler sending upgrade requests to
> the
> > > gateway may be a good choice.
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > >
> > >
> > > Best,
> > > Shammon
> > >
> > >
> > > On Thu, Feb 16, 2023 at 11:03 PM Gyula Fóra <gy...@gmail.com>
> > wrote:
> > >
> > > > @Shammon , Samrat:
> > > >
> > > > I appreciate the enthusiasm and I wish this was only a matter of
> > > intention
> > > > but making the autoscaler work without the operator may be a pretty
> big
> > > > task.
> > > > You must not forget 2 core requirements here.
> > > >
> > > > 1. The autoscaler logic itself has to run somewhere (in this case on
> > k8s
> > > > within the operator)S
> > > > 2. Something has to execute the job stateful upgrades safely based on
> > the
> > > > scaling decisions (in this case the operator does that).
> > > >
> > > > 1. Can be solved almost anywhere easily however you need resiliency
> etc
> > > for
> > > > this to be a prod application, 2. is the really tricky part. The
> > operator
> > > > was actually built to execute job upgrades, if you look at the code
> you
> > > > will appreciate the complexity of the task.
> > > >
> > > > As I said in the earlier thread. It is easier to make the operator
> work
> > > > with jobs running in different types of clusters than to take the
> > > > autoscaler module itself and plug that in somewhere else.
> > > >
> > > > Gyula
> > > >
> > > >
> > > > On Thu, Feb 16, 2023 at 3:12 PM Samrat Deb <de...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Shammon,
> > > > >
> > > > > Thank you for your input, completely aligned with you.
> > > > >
> > > > > We are fine with either of the options ,
> > > > >
> > > > > but IMO, to start with it will be easy to have it in the
> > > > > flink-kubernetes-operator as a module instead of a separate repo
> > which
> > > > > requires additional effort.
> > > > >
> > > > > Given that we would be incrementally working on making an
> autoscaling
> > > > > recommendation framework generic enough,
> > > > >
> > > > > Once it reaches a point where the community feels it needs to be
> > moved
> > > > to a
> > > > > separate repo we can take a call.
> > > > >
> > > > > Bests,
> > > > >
> > > > > Samrat
> > > > >
> > > > >
> > > > > On Thu, Feb 16, 2023 at 7:37 PM Samrat Deb <de...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Max ,
> > > > > > If you are fine and aligned with the same thought , since this is
> > > going
> > > > > to
> > > > > > be very useful to us, we are ready to help / contribute
> additional
> > > work
> > > > > > required.
> > > > > >
> > > > > > Bests,
> > > > > > Samrat
> > > > > >
> > > > > >
> > > > > > On Thu, 16 Feb 2023 at 5:28 PM, Shammon FY <zj...@gmail.com>
> > > wrote:
> > > > > >
> > > > > >> Hi Samrat
> > > > > >>
> > > > > >> Do you mean to create an independent module for flink scaling in
> > > > > >> flink-k8s-operator? How about creating a project such as
> > > > > >> `flink-auto-scaling` which is completely independent? Besides
> > > resource
> > > > > >> managers such as k8s and yarn, we can do more things in the
> > project,
> > > > for
> > > > > >> example, updating config in the user's `job submission system`
> > after
> > > > > >> scaling flink jobs. WDYT?
> > > > > >>
> > > > > >> Best,
> > > > > >> Shammon
> > > > > >>
> > > > > >>
> > > > > >> On Thu, Feb 16, 2023 at 7:38 PM Maximilian Michels <
> > mxm@apache.org>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Samrat,
> > > > > >> >
> > > > > >> > The autoscaling module is now pluggable but it is still
> tightly
> > > > > >> > coupled with Kubernetes. It will take additional work for the
> > > logic
> > > > to
> > > > > >> > work independently of the cluster manager.
> > > > > >> >
> > > > > >> > -Max
> > > > > >> >
> > > > > >> > On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <
> > > decordeapex@gmail.com>
> > > > > >> wrote:
> > > > > >> > >
> > > > > >> > > Oh! yesterday it got merged.
> > > > > >> > > Apologies , I missed the recent commit @Gyula.
> > > > > >> > >
> > > > > >> > > Thanks for the update
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <
> > > gyula.fora@gmail.com>
> > > > > >> wrote:
> > > > > >> > >
> > > > > >> > > > Max recently moved the autoscaler logic in a separate
> > > submodule,
> > > > > did
> > > > > >> > you
> > > > > >> > > > see that?
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
> > > > > >> > > >
> > > > > >> > > > Gyula
> > > > > >> > > >
> > > > > >> > > > On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <
> > > > > decordeapex@gmail.com>
> > > > > >> > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi ,
> > > > > >> > > > >
> > > > > >> > > > > *Context:*
> > > > > >> > > > > Auto Scaling was introduced in Flink as part of
> > FLIP-271[1].
> > > > > >> > > > > It discusses one of the important aspects to provide a
> > > robust
> > > > > >> default
> > > > > >> > > > > scaling algorithm.
> > > > > >> > > > >       a. Ensure scaling yields effective usage of
> assigned
> > > > task
> > > > > >> > slots.
> > > > > >> > > > >       b. Ramp up in case of any backlog to ensure it
> gets
> > > > > >> processed
> > > > > >> > in a
> > > > > >> > > > > timely manner
> > > > > >> > > > >       c. Minimize the number of scaling decisions to
> > prevent
> > > > > >> costly
> > > > > >> > > > rescale
> > > > > >> > > > > operation
> > > > > >> > > > > The flip intends to add an auto scaling framework based
> > on 6
> > > > > major
> > > > > >> > > > metrics
> > > > > >> > > > > and contains different types of threshold to trigger the
> > > > > scaling.
> > > > > >> > > > >
> > > > > >> > > > > Thread[2] discusses a different problem: why autoscaler
> is
> > > > part
> > > > > of
> > > > > >> > the
> > > > > >> > > > > operator instead of jobmanager at runtime.
> > > > > >> > > > > The Community decided to keep the autoscaling logic in
> the
> > > > > >> > > > > flink-kubernetes-operator.
> > > > > >> > > > >
> > > > > >> > > > > *Proposal: *
> > > > > >> > > > > In this discussion, I want to put forward a thought of
> > > > > extracting
> > > > > >> > out the
> > > > > >> > > > > auto scaling logic into a new submodule in
> > > > > >> flink-kubernetes-operator
> > > > > >> > > > > repository[3],
> > > > > >> > > > > which will be independent of any resource
> > manager/Operator.
> > > > > >> > > > > Currently the Autoscaling algorithm is very tightly
> > coupled
> > > > with
> > > > > >> the
> > > > > >> > > > > kubernetes API.
> > > > > >> > > > > This makes the autoscaling core algorithm not so easily
> > > > > extensible
> > > > > >> > for
> > > > > >> > > > > different available resource managers like YARN, Mesos
> > etc.
> > > > > >> > > > > A Separate autoscaling module inside the flink
> kubernetes
> > > > > operator
> > > > > >> > will
> > > > > >> > > > > help other resource managers to leverage the autoscaling
> > > > logic.
> > > > > >> > > > >
> > > > > >> > > > > [1]
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> > > > > >> > > > > [2]
> > > > > >>
> https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> > > > > >> > > > > [3] https://github.com/apache/flink-kubernetes-operator
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > Bests,
> > > > > >> > > > > Samrat
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Gyula Fóra <gy...@gmail.com>.

In case you guys feel strongly about this I suggest you try to fork the
autoscaler implementation and make a version that works with both the
Kubernetes operator and YARN.
If your solution is generic and works well, we can discuss the way forward.

Unfortunately me or my team don't really have the resources to assist you
with the YARN effort as we are mostly invested in Kubernetes but of course
we are happy to review your work.

Gyula


On Fri, Feb 17, 2023 at 1:09 PM Prabhu Joseph <pr...@gmail.com>
wrote:

> @Gyula
>
> >> It is easier to make the operator work with jobs running in different
> types of clusters than to take the
> autoscaler module itself and plug that in somewhere else.
>
> Our (part of Samrat's team) main problem is to leverage the AutoScaler
> Recommendation Engine part of Flink-Kubernetes-Operator for our Flink jobs
> running on YARN.
> Currently, it is not feasible as the autoscaler module is tightly coupled
> with the operator. We agree that the operator serves the two core
> requirements, but the operator itself
> cannot be used for Flink jobs running on YARN. Those core requirements are
> solved through other mechanisms in the case of YARN. But the main problem
> for us is *how to*
> *use the AutoScaler Recommendation Engine for Flink Jobs on YARN.*
>
>
>
>
>
>
>
>
> On Fri, Feb 17, 2023 at 6:34 AM Shammon FY <zj...@gmail.com> wrote:
>
> > Hi Gyula, Samrat
> >
> > Thanks for your input and I totally agree with you that it's really big
> > work. As @Samrat mentioned above, I think it's not a short way to make
> the
> > autoscaler completely independent too. But I still find some valuable
> > points for the `completely independent autoscaler`, and I think this may
> be
> > the goal we need to achieve in the future.
> >
> > 1. A large k8s cluster may manage thousands of machines, and users may
> run
> > tens of thousands flink jobs in one k8s cluster. If the autoscaler
> manages
> > all these jobs, the autoscaler should be horizontal expansion.
> >
> > 2. As you mentioned, "execute the job stateful upgrades safely" is
> indeed a
> > complexity work, but I think we should decouple it from k8s operator
> >
> > a) In addition to k8s, there may be some other resource management
> >
> > b) Flink may support more scaler operations by REST API, such as FLIP-291
> > [1]
> >
> > c) In our production environment, there's a 'Job Submission Gateway'
> which
> > stores job info and config, monitors the status of running jobs. After
> the
> > autoscaler upgrades the job, it must update the config in Gateway and
> users
> > can restart his job with the updated config to avoid resource conflict.
> > Under these circumstances, the autoscaler sending upgrade requests to the
> > gateway may be a good choice.
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> >
> >
> > Best,
> > Shammon
> >
> >
> > On Thu, Feb 16, 2023 at 11:03 PM Gyula Fóra <gy...@gmail.com>
> wrote:
> >
> > > @Shammon , Samrat:
> > >
> > > I appreciate the enthusiasm and I wish this was only a matter of
> > intention
> > > but making the autoscaler work without the operator may be a pretty big
> > > task.
> > > You must not forget 2 core requirements here.
> > >
> > > 1. The autoscaler logic itself has to run somewhere (in this case on
> k8s
> > > within the operator)S
> > > 2. Something has to execute the job stateful upgrades safely based on
> the
> > > scaling decisions (in this case the operator does that).
> > >
> > > 1. Can be solved almost anywhere easily however you need resiliency etc
> > for
> > > this to be a prod application, 2. is the really tricky part. The
> operator
> > > was actually built to execute job upgrades, if you look at the code you
> > > will appreciate the complexity of the task.
> > >
> > > As I said in the earlier thread. It is easier to make the operator work
> > > with jobs running in different types of clusters than to take the
> > > autoscaler module itself and plug that in somewhere else.
> > >
> > > Gyula
> > >
> > >
> > > On Thu, Feb 16, 2023 at 3:12 PM Samrat Deb <de...@gmail.com>
> > wrote:
> > >
> > > > Hi Shammon,
> > > >
> > > > Thank you for your input, completely aligned with you.
> > > >
> > > > We are fine with either of the options ,
> > > >
> > > > but IMO, to start with it will be easy to have it in the
> > > > flink-kubernetes-operator as a module instead of a separate repo
> which
> > > > requires additional effort.
> > > >
> > > > Given that we would be incrementally working on making an autoscaling
> > > > recommendation framework generic enough,
> > > >
> > > > Once it reaches a point where the community feels it needs to be
> moved
> > > to a
> > > > separate repo we can take a call.
> > > >
> > > > Bests,
> > > >
> > > > Samrat
> > > >
> > > >
> > > > On Thu, Feb 16, 2023 at 7:37 PM Samrat Deb <de...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Max ,
> > > > > If you are fine and aligned with the same thought , since this is
> > going
> > > > to
> > > > > be very useful to us, we are ready to help / contribute additional
> > work
> > > > > required.
> > > > >
> > > > > Bests,
> > > > > Samrat
> > > > >
> > > > >
> > > > > On Thu, 16 Feb 2023 at 5:28 PM, Shammon FY <zj...@gmail.com>
> > wrote:
> > > > >
> > > > >> Hi Samrat
> > > > >>
> > > > >> Do you mean to create an independent module for flink scaling in
> > > > >> flink-k8s-operator? How about creating a project such as
> > > > >> `flink-auto-scaling` which is completely independent? Besides
> > resource
> > > > >> managers such as k8s and yarn, we can do more things in the
> project,
> > > for
> > > > >> example, updating config in the user's `job submission system`
> after
> > > > >> scaling flink jobs. WDYT?
> > > > >>
> > > > >> Best,
> > > > >> Shammon
> > > > >>
> > > > >>
> > > > >> On Thu, Feb 16, 2023 at 7:38 PM Maximilian Michels <
> mxm@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Samrat,
> > > > >> >
> > > > >> > The autoscaling module is now pluggable but it is still tightly
> > > > >> > coupled with Kubernetes. It will take additional work for the
> > logic
> > > to
> > > > >> > work independently of the cluster manager.
> > > > >> >
> > > > >> > -Max
> > > > >> >
> > > > >> > On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <
> > decordeapex@gmail.com>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > Oh! yesterday it got merged.
> > > > >> > > Apologies , I missed the recent commit @Gyula.
> > > > >> > >
> > > > >> > > Thanks for the update
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <
> > gyula.fora@gmail.com>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Max recently moved the autoscaler logic in a separate
> > submodule,
> > > > did
> > > > >> > you
> > > > >> > > > see that?
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
> > > > >> > > >
> > > > >> > > > Gyula
> > > > >> > > >
> > > > >> > > > On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <
> > > > decordeapex@gmail.com>
> > > > >> > wrote:
> > > > >> > > >
> > > > >> > > > > Hi ,
> > > > >> > > > >
> > > > >> > > > > *Context:*
> > > > >> > > > > Auto Scaling was introduced in Flink as part of
> FLIP-271[1].
> > > > >> > > > > It discusses one of the important aspects to provide a
> > robust
> > > > >> default
> > > > >> > > > > scaling algorithm.
> > > > >> > > > >       a. Ensure scaling yields effective usage of assigned
> > > task
> > > > >> > slots.
> > > > >> > > > >       b. Ramp up in case of any backlog to ensure it gets
> > > > >> processed
> > > > >> > in a
> > > > >> > > > > timely manner
> > > > >> > > > >       c. Minimize the number of scaling decisions to
> prevent
> > > > >> costly
> > > > >> > > > rescale
> > > > >> > > > > operation
> > > > >> > > > > The flip intends to add an auto scaling framework based
> on 6
> > > > major
> > > > >> > > > metrics
> > > > >> > > > > and contains different types of threshold to trigger the
> > > > scaling.
> > > > >> > > > >
> > > > >> > > > > Thread[2] discusses a different problem: why autoscaler is
> > > part
> > > > of
> > > > >> > the
> > > > >> > > > > operator instead of jobmanager at runtime.
> > > > >> > > > > The Community decided to keep the autoscaling logic in the
> > > > >> > > > > flink-kubernetes-operator.
> > > > >> > > > >
> > > > >> > > > > *Proposal: *
> > > > >> > > > > In this discussion, I want to put forward a thought of
> > > > extracting
> > > > >> > out the
> > > > >> > > > > auto scaling logic into a new submodule in
> > > > >> flink-kubernetes-operator
> > > > >> > > > > repository[3],
> > > > >> > > > > which will be independent of any resource
> manager/Operator.
> > > > >> > > > > Currently the Autoscaling algorithm is very tightly
> coupled
> > > with
> > > > >> the
> > > > >> > > > > kubernetes API.
> > > > >> > > > > This makes the autoscaling core algorithm not so easily
> > > > extensible
> > > > >> > for
> > > > >> > > > > different available resource managers like YARN, Mesos
> etc.
> > > > >> > > > > A Separate autoscaling module inside the flink kubernetes
> > > > operator
> > > > >> > will
> > > > >> > > > > help other resource managers to leverage the autoscaling
> > > logic.
> > > > >> > > > >
> > > > >> > > > > [1]
> > > > >> > > > >
> > > > >> > > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> > > > >> > > > > [2]
> > > > >> https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> > > > >> > > > > [3] https://github.com/apache/flink-kubernetes-operator
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > Bests,
> > > > >> > > > > Samrat
> > > > >> > > > >
> > > > >> > > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Prabhu Joseph <pr...@gmail.com>.

@Gyula

>> It is easier to make the operator work with jobs running in different
types of clusters than to take the
autoscaler module itself and plug that in somewhere else.

Our (part of Samrat's team) main problem is to leverage the AutoScaler
Recommendation Engine part of Flink-Kubernetes-Operator for our Flink jobs
running on YARN.
Currently, it is not feasible as the autoscaler module is tightly coupled
with the operator. We agree that the operator serves the two core
requirements, but the operator itself
cannot be used for Flink jobs running on YARN. Those core requirements are
solved through other mechanisms in the case of YARN. But the main problem
for us is *how to*
*use the AutoScaler Recommendation Engine for Flink Jobs on YARN.*








On Fri, Feb 17, 2023 at 6:34 AM Shammon FY <zj...@gmail.com> wrote:

> Hi Gyula, Samrat
>
> Thanks for your input and I totally agree with you that it's really big
> work. As @Samrat mentioned above, I think it's not a short way to make the
> autoscaler completely independent too. But I still find some valuable
> points for the `completely independent autoscaler`, and I think this may be
> the goal we need to achieve in the future.
>
> 1. A large k8s cluster may manage thousands of machines, and users may run
> tens of thousands flink jobs in one k8s cluster. If the autoscaler manages
> all these jobs, the autoscaler should be horizontal expansion.
>
> 2. As you mentioned, "execute the job stateful upgrades safely" is indeed a
> complexity work, but I think we should decouple it from k8s operator
>
> a) In addition to k8s, there may be some other resource management
>
> b) Flink may support more scaler operations by REST API, such as FLIP-291
> [1]
>
> c) In our production environment, there's a 'Job Submission Gateway' which
> stores job info and config, monitors the status of running jobs. After the
> autoscaler upgrades the job, it must update the config in Gateway and users
> can restart his job with the updated config to avoid resource conflict.
> Under these circumstances, the autoscaler sending upgrade requests to the
> gateway may be a good choice.
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
>
>
> Best,
> Shammon
>
>
> On Thu, Feb 16, 2023 at 11:03 PM Gyula Fóra <gy...@gmail.com> wrote:
>
> > @Shammon , Samrat:
> >
> > I appreciate the enthusiasm and I wish this was only a matter of
> intention
> > but making the autoscaler work without the operator may be a pretty big
> > task.
> > You must not forget 2 core requirements here.
> >
> > 1. The autoscaler logic itself has to run somewhere (in this case on k8s
> > within the operator)S
> > 2. Something has to execute the job stateful upgrades safely based on the
> > scaling decisions (in this case the operator does that).
> >
> > 1. Can be solved almost anywhere easily however you need resiliency etc
> for
> > this to be a prod application, 2. is the really tricky part. The operator
> > was actually built to execute job upgrades, if you look at the code you
> > will appreciate the complexity of the task.
> >
> > As I said in the earlier thread. It is easier to make the operator work
> > with jobs running in different types of clusters than to take the
> > autoscaler module itself and plug that in somewhere else.
> >
> > Gyula
> >
> >
> > On Thu, Feb 16, 2023 at 3:12 PM Samrat Deb <de...@gmail.com>
> wrote:
> >
> > > Hi Shammon,
> > >
> > > Thank you for your input, completely aligned with you.
> > >
> > > We are fine with either of the options ,
> > >
> > > but IMO, to start with it will be easy to have it in the
> > > flink-kubernetes-operator as a module instead of a separate repo which
> > > requires additional effort.
> > >
> > > Given that we would be incrementally working on making an autoscaling
> > > recommendation framework generic enough,
> > >
> > > Once it reaches a point where the community feels it needs to be moved
> > to a
> > > separate repo we can take a call.
> > >
> > > Bests,
> > >
> > > Samrat
> > >
> > >
> > > On Thu, Feb 16, 2023 at 7:37 PM Samrat Deb <de...@gmail.com>
> > wrote:
> > >
> > > > Hi Max ,
> > > > If you are fine and aligned with the same thought , since this is
> going
> > > to
> > > > be very useful to us, we are ready to help / contribute additional
> work
> > > > required.
> > > >
> > > > Bests,
> > > > Samrat
> > > >
> > > >
> > > > On Thu, 16 Feb 2023 at 5:28 PM, Shammon FY <zj...@gmail.com>
> wrote:
> > > >
> > > >> Hi Samrat
> > > >>
> > > >> Do you mean to create an independent module for flink scaling in
> > > >> flink-k8s-operator? How about creating a project such as
> > > >> `flink-auto-scaling` which is completely independent? Besides
> resource
> > > >> managers such as k8s and yarn, we can do more things in the project,
> > for
> > > >> example, updating config in the user's `job submission system` after
> > > >> scaling flink jobs. WDYT?
> > > >>
> > > >> Best,
> > > >> Shammon
> > > >>
> > > >>
> > > >> On Thu, Feb 16, 2023 at 7:38 PM Maximilian Michels <mx...@apache.org>
> > > >> wrote:
> > > >>
> > > >> > Hi Samrat,
> > > >> >
> > > >> > The autoscaling module is now pluggable but it is still tightly
> > > >> > coupled with Kubernetes. It will take additional work for the
> logic
> > to
> > > >> > work independently of the cluster manager.
> > > >> >
> > > >> > -Max
> > > >> >
> > > >> > On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <
> decordeapex@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > Oh! yesterday it got merged.
> > > >> > > Apologies , I missed the recent commit @Gyula.
> > > >> > >
> > > >> > > Thanks for the update
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <
> gyula.fora@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > > Max recently moved the autoscaler logic in a separate
> submodule,
> > > did
> > > >> > you
> > > >> > > > see that?
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
> > > >> > > >
> > > >> > > > Gyula
> > > >> > > >
> > > >> > > > On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <
> > > decordeapex@gmail.com>
> > > >> > wrote:
> > > >> > > >
> > > >> > > > > Hi ,
> > > >> > > > >
> > > >> > > > > *Context:*
> > > >> > > > > Auto Scaling was introduced in Flink as part of FLIP-271[1].
> > > >> > > > > It discusses one of the important aspects to provide a
> robust
> > > >> default
> > > >> > > > > scaling algorithm.
> > > >> > > > >       a. Ensure scaling yields effective usage of assigned
> > task
> > > >> > slots.
> > > >> > > > >       b. Ramp up in case of any backlog to ensure it gets
> > > >> processed
> > > >> > in a
> > > >> > > > > timely manner
> > > >> > > > >       c. Minimize the number of scaling decisions to prevent
> > > >> costly
> > > >> > > > rescale
> > > >> > > > > operation
> > > >> > > > > The flip intends to add an auto scaling framework based on 6
> > > major
> > > >> > > > metrics
> > > >> > > > > and contains different types of threshold to trigger the
> > > scaling.
> > > >> > > > >
> > > >> > > > > Thread[2] discusses a different problem: why autoscaler is
> > part
> > > of
> > > >> > the
> > > >> > > > > operator instead of jobmanager at runtime.
> > > >> > > > > The Community decided to keep the autoscaling logic in the
> > > >> > > > > flink-kubernetes-operator.
> > > >> > > > >
> > > >> > > > > *Proposal: *
> > > >> > > > > In this discussion, I want to put forward a thought of
> > > extracting
> > > >> > out the
> > > >> > > > > auto scaling logic into a new submodule in
> > > >> flink-kubernetes-operator
> > > >> > > > > repository[3],
> > > >> > > > > which will be independent of any resource manager/Operator.
> > > >> > > > > Currently the Autoscaling algorithm is very tightly coupled
> > with
> > > >> the
> > > >> > > > > kubernetes API.
> > > >> > > > > This makes the autoscaling core algorithm not so easily
> > > extensible
> > > >> > for
> > > >> > > > > different available resource managers like YARN, Mesos etc.
> > > >> > > > > A Separate autoscaling module inside the flink kubernetes
> > > operator
> > > >> > will
> > > >> > > > > help other resource managers to leverage the autoscaling
> > logic.
> > > >> > > > >
> > > >> > > > > [1]
> > > >> > > > >
> > > >> > > >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> > > >> > > > > [2]
> > > >> https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> > > >> > > > > [3] https://github.com/apache/flink-kubernetes-operator
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > Bests,
> > > >> > > > > Samrat
> > > >> > > > >
> > > >> > > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Shammon FY <zj...@gmail.com>.

Hi Gyula, Samrat

Thanks for your input and I totally agree with you that it's really big
work. As @Samrat mentioned above, I think it's not a short way to make the
autoscaler completely independent too. But I still find some valuable
points for the `completely independent autoscaler`, and I think this may be
the goal we need to achieve in the future.

1. A large k8s cluster may manage thousands of machines, and users may run
tens of thousands flink jobs in one k8s cluster. If the autoscaler manages
all these jobs, the autoscaler should be horizontal expansion.

2. As you mentioned, "execute the job stateful upgrades safely" is indeed a
complexity work, but I think we should decouple it from k8s operator

a) In addition to k8s, there may be some other resource management

b) Flink may support more scaler operations by REST API, such as FLIP-291
[1]

c) In our production environment, there's a 'Job Submission Gateway' which
stores job info and config, monitors the status of running jobs. After the
autoscaler upgrades the job, it must update the config in Gateway and users
can restart his job with the updated config to avoid resource conflict.
Under these circumstances, the autoscaler sending upgrade requests to the
gateway may be a good choice.


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management


Best,
Shammon


On Thu, Feb 16, 2023 at 11:03 PM Gyula Fóra <gy...@gmail.com> wrote:

> @Shammon , Samrat:
>
> I appreciate the enthusiasm and I wish this was only a matter of intention
> but making the autoscaler work without the operator may be a pretty big
> task.
> You must not forget 2 core requirements here.
>
> 1. The autoscaler logic itself has to run somewhere (in this case on k8s
> within the operator)S
> 2. Something has to execute the job stateful upgrades safely based on the
> scaling decisions (in this case the operator does that).
>
> 1. Can be solved almost anywhere easily however you need resiliency etc for
> this to be a prod application, 2. is the really tricky part. The operator
> was actually built to execute job upgrades, if you look at the code you
> will appreciate the complexity of the task.
>
> As I said in the earlier thread. It is easier to make the operator work
> with jobs running in different types of clusters than to take the
> autoscaler module itself and plug that in somewhere else.
>
> Gyula
>
>
> On Thu, Feb 16, 2023 at 3:12 PM Samrat Deb <de...@gmail.com> wrote:
>
> > Hi Shammon,
> >
> > Thank you for your input, completely aligned with you.
> >
> > We are fine with either of the options ,
> >
> > but IMO, to start with it will be easy to have it in the
> > flink-kubernetes-operator as a module instead of a separate repo which
> > requires additional effort.
> >
> > Given that we would be incrementally working on making an autoscaling
> > recommendation framework generic enough,
> >
> > Once it reaches a point where the community feels it needs to be moved
> to a
> > separate repo we can take a call.
> >
> > Bests,
> >
> > Samrat
> >
> >
> > On Thu, Feb 16, 2023 at 7:37 PM Samrat Deb <de...@gmail.com>
> wrote:
> >
> > > Hi Max ,
> > > If you are fine and aligned with the same thought , since this is going
> > to
> > > be very useful to us, we are ready to help / contribute additional work
> > > required.
> > >
> > > Bests,
> > > Samrat
> > >
> > >
> > > On Thu, 16 Feb 2023 at 5:28 PM, Shammon FY <zj...@gmail.com> wrote:
> > >
> > >> Hi Samrat
> > >>
> > >> Do you mean to create an independent module for flink scaling in
> > >> flink-k8s-operator? How about creating a project such as
> > >> `flink-auto-scaling` which is completely independent? Besides resource
> > >> managers such as k8s and yarn, we can do more things in the project,
> for
> > >> example, updating config in the user's `job submission system` after
> > >> scaling flink jobs. WDYT?
> > >>
> > >> Best,
> > >> Shammon
> > >>
> > >>
> > >> On Thu, Feb 16, 2023 at 7:38 PM Maximilian Michels <mx...@apache.org>
> > >> wrote:
> > >>
> > >> > Hi Samrat,
> > >> >
> > >> > The autoscaling module is now pluggable but it is still tightly
> > >> > coupled with Kubernetes. It will take additional work for the logic
> to
> > >> > work independently of the cluster manager.
> > >> >
> > >> > -Max
> > >> >
> > >> > On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <de...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > Oh! yesterday it got merged.
> > >> > > Apologies , I missed the recent commit @Gyula.
> > >> > >
> > >> > > Thanks for the update
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <gy...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > Max recently moved the autoscaler logic in a separate submodule,
> > did
> > >> > you
> > >> > > > see that?
> > >> > > >
> > >> > > >
> > >> > > >
> > >> >
> > >>
> >
> https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
> > >> > > >
> > >> > > > Gyula
> > >> > > >
> > >> > > > On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <
> > decordeapex@gmail.com>
> > >> > wrote:
> > >> > > >
> > >> > > > > Hi ,
> > >> > > > >
> > >> > > > > *Context:*
> > >> > > > > Auto Scaling was introduced in Flink as part of FLIP-271[1].
> > >> > > > > It discusses one of the important aspects to provide a robust
> > >> default
> > >> > > > > scaling algorithm.
> > >> > > > >       a. Ensure scaling yields effective usage of assigned
> task
> > >> > slots.
> > >> > > > >       b. Ramp up in case of any backlog to ensure it gets
> > >> processed
> > >> > in a
> > >> > > > > timely manner
> > >> > > > >       c. Minimize the number of scaling decisions to prevent
> > >> costly
> > >> > > > rescale
> > >> > > > > operation
> > >> > > > > The flip intends to add an auto scaling framework based on 6
> > major
> > >> > > > metrics
> > >> > > > > and contains different types of threshold to trigger the
> > scaling.
> > >> > > > >
> > >> > > > > Thread[2] discusses a different problem: why autoscaler is
> part
> > of
> > >> > the
> > >> > > > > operator instead of jobmanager at runtime.
> > >> > > > > The Community decided to keep the autoscaling logic in the
> > >> > > > > flink-kubernetes-operator.
> > >> > > > >
> > >> > > > > *Proposal: *
> > >> > > > > In this discussion, I want to put forward a thought of
> > extracting
> > >> > out the
> > >> > > > > auto scaling logic into a new submodule in
> > >> flink-kubernetes-operator
> > >> > > > > repository[3],
> > >> > > > > which will be independent of any resource manager/Operator.
> > >> > > > > Currently the Autoscaling algorithm is very tightly coupled
> with
> > >> the
> > >> > > > > kubernetes API.
> > >> > > > > This makes the autoscaling core algorithm not so easily
> > extensible
> > >> > for
> > >> > > > > different available resource managers like YARN, Mesos etc.
> > >> > > > > A Separate autoscaling module inside the flink kubernetes
> > operator
> > >> > will
> > >> > > > > help other resource managers to leverage the autoscaling
> logic.
> > >> > > > >
> > >> > > > > [1]
> > >> > > > >
> > >> > > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> > >> > > > > [2]
> > >> https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> > >> > > > > [3] https://github.com/apache/flink-kubernetes-operator
> > >> > > > >
> > >> > > > >
> > >> > > > > Bests,
> > >> > > > > Samrat
> > >> > > > >
> > >> > > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Gyula Fóra <gy...@gmail.com>.

@Shammon , Samrat:

I appreciate the enthusiasm and I wish this was only a matter of intention
but making the autoscaler work without the operator may be a pretty big
task.
You must not forget 2 core requirements here.

1. The autoscaler logic itself has to run somewhere (in this case on k8s
within the operator)S
2. Something has to execute the job stateful upgrades safely based on the
scaling decisions (in this case the operator does that).

1. Can be solved almost anywhere easily however you need resiliency etc for
this to be a prod application, 2. is the really tricky part. The operator
was actually built to execute job upgrades, if you look at the code you
will appreciate the complexity of the task.

As I said in the earlier thread. It is easier to make the operator work
with jobs running in different types of clusters than to take the
autoscaler module itself and plug that in somewhere else.

Gyula


On Thu, Feb 16, 2023 at 3:12 PM Samrat Deb <de...@gmail.com> wrote:

> Hi Shammon,
>
> Thank you for your input, completely aligned with you.
>
> We are fine with either of the options ,
>
> but IMO, to start with it will be easy to have it in the
> flink-kubernetes-operator as a module instead of a separate repo which
> requires additional effort.
>
> Given that we would be incrementally working on making an autoscaling
> recommendation framework generic enough,
>
> Once it reaches a point where the community feels it needs to be moved to a
> separate repo we can take a call.
>
> Bests,
>
> Samrat
>
>
> On Thu, Feb 16, 2023 at 7:37 PM Samrat Deb <de...@gmail.com> wrote:
>
> > Hi Max ,
> > If you are fine and aligned with the same thought , since this is going
> to
> > be very useful to us, we are ready to help / contribute additional work
> > required.
> >
> > Bests,
> > Samrat
> >
> >
> > On Thu, 16 Feb 2023 at 5:28 PM, Shammon FY <zj...@gmail.com> wrote:
> >
> >> Hi Samrat
> >>
> >> Do you mean to create an independent module for flink scaling in
> >> flink-k8s-operator? How about creating a project such as
> >> `flink-auto-scaling` which is completely independent? Besides resource
> >> managers such as k8s and yarn, we can do more things in the project, for
> >> example, updating config in the user's `job submission system` after
> >> scaling flink jobs. WDYT?
> >>
> >> Best,
> >> Shammon
> >>
> >>
> >> On Thu, Feb 16, 2023 at 7:38 PM Maximilian Michels <mx...@apache.org>
> >> wrote:
> >>
> >> > Hi Samrat,
> >> >
> >> > The autoscaling module is now pluggable but it is still tightly
> >> > coupled with Kubernetes. It will take additional work for the logic to
> >> > work independently of the cluster manager.
> >> >
> >> > -Max
> >> >
> >> > On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <de...@gmail.com>
> >> wrote:
> >> > >
> >> > > Oh! yesterday it got merged.
> >> > > Apologies , I missed the recent commit @Gyula.
> >> > >
> >> > > Thanks for the update
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <gy...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Max recently moved the autoscaler logic in a separate submodule,
> did
> >> > you
> >> > > > see that?
> >> > > >
> >> > > >
> >> > > >
> >> >
> >>
> https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
> >> > > >
> >> > > > Gyula
> >> > > >
> >> > > > On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <
> decordeapex@gmail.com>
> >> > wrote:
> >> > > >
> >> > > > > Hi ,
> >> > > > >
> >> > > > > *Context:*
> >> > > > > Auto Scaling was introduced in Flink as part of FLIP-271[1].
> >> > > > > It discusses one of the important aspects to provide a robust
> >> default
> >> > > > > scaling algorithm.
> >> > > > >       a. Ensure scaling yields effective usage of assigned task
> >> > slots.
> >> > > > >       b. Ramp up in case of any backlog to ensure it gets
> >> processed
> >> > in a
> >> > > > > timely manner
> >> > > > >       c. Minimize the number of scaling decisions to prevent
> >> costly
> >> > > > rescale
> >> > > > > operation
> >> > > > > The flip intends to add an auto scaling framework based on 6
> major
> >> > > > metrics
> >> > > > > and contains different types of threshold to trigger the
> scaling.
> >> > > > >
> >> > > > > Thread[2] discusses a different problem: why autoscaler is part
> of
> >> > the
> >> > > > > operator instead of jobmanager at runtime.
> >> > > > > The Community decided to keep the autoscaling logic in the
> >> > > > > flink-kubernetes-operator.
> >> > > > >
> >> > > > > *Proposal: *
> >> > > > > In this discussion, I want to put forward a thought of
> extracting
> >> > out the
> >> > > > > auto scaling logic into a new submodule in
> >> flink-kubernetes-operator
> >> > > > > repository[3],
> >> > > > > which will be independent of any resource manager/Operator.
> >> > > > > Currently the Autoscaling algorithm is very tightly coupled with
> >> the
> >> > > > > kubernetes API.
> >> > > > > This makes the autoscaling core algorithm not so easily
> extensible
> >> > for
> >> > > > > different available resource managers like YARN, Mesos etc.
> >> > > > > A Separate autoscaling module inside the flink kubernetes
> operator
> >> > will
> >> > > > > help other resource managers to leverage the autoscaling logic.
> >> > > > >
> >> > > > > [1]
> >> > > > >
> >> > > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> >> > > > > [2]
> >> https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> >> > > > > [3] https://github.com/apache/flink-kubernetes-operator
> >> > > > >
> >> > > > >
> >> > > > > Bests,
> >> > > > > Samrat
> >> > > > >
> >> > > >
> >> >
> >>
> >
>

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Samrat Deb <de...@gmail.com>.

Hi Shammon,

Thank you for your input, completely aligned with you.

We are fine with either of the options ,

but IMO, to start with it will be easy to have it in the
flink-kubernetes-operator as a module instead of a separate repo which
requires additional effort.

Given that we would be incrementally working on making an autoscaling
recommendation framework generic enough,

Once it reaches a point where the community feels it needs to be moved to a
separate repo we can take a call.

Bests,

Samrat


On Thu, Feb 16, 2023 at 7:37 PM Samrat Deb <de...@gmail.com> wrote:

> Hi Max ,
> If you are fine and aligned with the same thought , since this is going to
> be very useful to us, we are ready to help / contribute additional work
> required.
>
> Bests,
> Samrat
>
>
> On Thu, 16 Feb 2023 at 5:28 PM, Shammon FY <zj...@gmail.com> wrote:
>
>> Hi Samrat
>>
>> Do you mean to create an independent module for flink scaling in
>> flink-k8s-operator? How about creating a project such as
>> `flink-auto-scaling` which is completely independent? Besides resource
>> managers such as k8s and yarn, we can do more things in the project, for
>> example, updating config in the user's `job submission system` after
>> scaling flink jobs. WDYT?
>>
>> Best,
>> Shammon
>>
>>
>> On Thu, Feb 16, 2023 at 7:38 PM Maximilian Michels <mx...@apache.org>
>> wrote:
>>
>> > Hi Samrat,
>> >
>> > The autoscaling module is now pluggable but it is still tightly
>> > coupled with Kubernetes. It will take additional work for the logic to
>> > work independently of the cluster manager.
>> >
>> > -Max
>> >
>> > On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <de...@gmail.com>
>> wrote:
>> > >
>> > > Oh! yesterday it got merged.
>> > > Apologies , I missed the recent commit @Gyula.
>> > >
>> > > Thanks for the update
>> > >
>> > >
>> > >
>> > > On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <gy...@gmail.com>
>> wrote:
>> > >
>> > > > Max recently moved the autoscaler logic in a separate submodule, did
>> > you
>> > > > see that?
>> > > >
>> > > >
>> > > >
>> >
>> https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
>> > > >
>> > > > Gyula
>> > > >
>> > > > On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <de...@gmail.com>
>> > wrote:
>> > > >
>> > > > > Hi ,
>> > > > >
>> > > > > *Context:*
>> > > > > Auto Scaling was introduced in Flink as part of FLIP-271[1].
>> > > > > It discusses one of the important aspects to provide a robust
>> default
>> > > > > scaling algorithm.
>> > > > >       a. Ensure scaling yields effective usage of assigned task
>> > slots.
>> > > > >       b. Ramp up in case of any backlog to ensure it gets
>> processed
>> > in a
>> > > > > timely manner
>> > > > >       c. Minimize the number of scaling decisions to prevent
>> costly
>> > > > rescale
>> > > > > operation
>> > > > > The flip intends to add an auto scaling framework based on 6 major
>> > > > metrics
>> > > > > and contains different types of threshold to trigger the scaling.
>> > > > >
>> > > > > Thread[2] discusses a different problem: why autoscaler is part of
>> > the
>> > > > > operator instead of jobmanager at runtime.
>> > > > > The Community decided to keep the autoscaling logic in the
>> > > > > flink-kubernetes-operator.
>> > > > >
>> > > > > *Proposal: *
>> > > > > In this discussion, I want to put forward a thought of extracting
>> > out the
>> > > > > auto scaling logic into a new submodule in
>> flink-kubernetes-operator
>> > > > > repository[3],
>> > > > > which will be independent of any resource manager/Operator.
>> > > > > Currently the Autoscaling algorithm is very tightly coupled with
>> the
>> > > > > kubernetes API.
>> > > > > This makes the autoscaling core algorithm not so easily extensible
>> > for
>> > > > > different available resource managers like YARN, Mesos etc.
>> > > > > A Separate autoscaling module inside the flink kubernetes operator
>> > will
>> > > > > help other resource managers to leverage the autoscaling logic.
>> > > > >
>> > > > > [1]
>> > > > >
>> > > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
>> > > > > [2]
>> https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
>> > > > > [3] https://github.com/apache/flink-kubernetes-operator
>> > > > >
>> > > > >
>> > > > > Bests,
>> > > > > Samrat
>> > > > >
>> > > >
>> >
>>
>

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Samrat Deb <de...@gmail.com>.

Hi Max ,
If you are fine and aligned with the same thought , since this is going to
be very useful to us, we are ready to help / contribute additional work
required.

Bests,
Samrat


On Thu, 16 Feb 2023 at 5:28 PM, Shammon FY <zj...@gmail.com> wrote:

> Hi Samrat
>
> Do you mean to create an independent module for flink scaling in
> flink-k8s-operator? How about creating a project such as
> `flink-auto-scaling` which is completely independent? Besides resource
> managers such as k8s and yarn, we can do more things in the project, for
> example, updating config in the user's `job submission system` after
> scaling flink jobs. WDYT?
>
> Best,
> Shammon
>
>
> On Thu, Feb 16, 2023 at 7:38 PM Maximilian Michels <mx...@apache.org> wrote:
>
> > Hi Samrat,
> >
> > The autoscaling module is now pluggable but it is still tightly
> > coupled with Kubernetes. It will take additional work for the logic to
> > work independently of the cluster manager.
> >
> > -Max
> >
> > On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <de...@gmail.com>
> wrote:
> > >
> > > Oh! yesterday it got merged.
> > > Apologies , I missed the recent commit @Gyula.
> > >
> > > Thanks for the update
> > >
> > >
> > >
> > > On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <gy...@gmail.com>
> wrote:
> > >
> > > > Max recently moved the autoscaler logic in a separate submodule, did
> > you
> > > > see that?
> > > >
> > > >
> > > >
> >
> https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
> > > >
> > > > Gyula
> > > >
> > > > On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <de...@gmail.com>
> > wrote:
> > > >
> > > > > Hi ,
> > > > >
> > > > > *Context:*
> > > > > Auto Scaling was introduced in Flink as part of FLIP-271[1].
> > > > > It discusses one of the important aspects to provide a robust
> default
> > > > > scaling algorithm.
> > > > >       a. Ensure scaling yields effective usage of assigned task
> > slots.
> > > > >       b. Ramp up in case of any backlog to ensure it gets processed
> > in a
> > > > > timely manner
> > > > >       c. Minimize the number of scaling decisions to prevent costly
> > > > rescale
> > > > > operation
> > > > > The flip intends to add an auto scaling framework based on 6 major
> > > > metrics
> > > > > and contains different types of threshold to trigger the scaling.
> > > > >
> > > > > Thread[2] discusses a different problem: why autoscaler is part of
> > the
> > > > > operator instead of jobmanager at runtime.
> > > > > The Community decided to keep the autoscaling logic in the
> > > > > flink-kubernetes-operator.
> > > > >
> > > > > *Proposal: *
> > > > > In this discussion, I want to put forward a thought of extracting
> > out the
> > > > > auto scaling logic into a new submodule in
> flink-kubernetes-operator
> > > > > repository[3],
> > > > > which will be independent of any resource manager/Operator.
> > > > > Currently the Autoscaling algorithm is very tightly coupled with
> the
> > > > > kubernetes API.
> > > > > This makes the autoscaling core algorithm not so easily extensible
> > for
> > > > > different available resource managers like YARN, Mesos etc.
> > > > > A Separate autoscaling module inside the flink kubernetes operator
> > will
> > > > > help other resource managers to leverage the autoscaling logic.
> > > > >
> > > > > [1]
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> > > > > [2]
> https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> > > > > [3] https://github.com/apache/flink-kubernetes-operator
> > > > >
> > > > >
> > > > > Bests,
> > > > > Samrat
> > > > >
> > > >
> >
>

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Shammon FY <zj...@gmail.com>.

Hi Samrat

Do you mean to create an independent module for flink scaling in
flink-k8s-operator? How about creating a project such as
`flink-auto-scaling` which is completely independent? Besides resource
managers such as k8s and yarn, we can do more things in the project, for
example, updating config in the user's `job submission system` after
scaling flink jobs. WDYT?

Best,
Shammon


On Thu, Feb 16, 2023 at 7:38 PM Maximilian Michels <mx...@apache.org> wrote:

> Hi Samrat,
>
> The autoscaling module is now pluggable but it is still tightly
> coupled with Kubernetes. It will take additional work for the logic to
> work independently of the cluster manager.
>
> -Max
>
> On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <de...@gmail.com> wrote:
> >
> > Oh! yesterday it got merged.
> > Apologies , I missed the recent commit @Gyula.
> >
> > Thanks for the update
> >
> >
> >
> > On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <gy...@gmail.com> wrote:
> >
> > > Max recently moved the autoscaler logic in a separate submodule, did
> you
> > > see that?
> > >
> > >
> > >
> https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
> > >
> > > Gyula
> > >
> > > On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <de...@gmail.com>
> wrote:
> > >
> > > > Hi ,
> > > >
> > > > *Context:*
> > > > Auto Scaling was introduced in Flink as part of FLIP-271[1].
> > > > It discusses one of the important aspects to provide a robust default
> > > > scaling algorithm.
> > > >       a. Ensure scaling yields effective usage of assigned task
> slots.
> > > >       b. Ramp up in case of any backlog to ensure it gets processed
> in a
> > > > timely manner
> > > >       c. Minimize the number of scaling decisions to prevent costly
> > > rescale
> > > > operation
> > > > The flip intends to add an auto scaling framework based on 6 major
> > > metrics
> > > > and contains different types of threshold to trigger the scaling.
> > > >
> > > > Thread[2] discusses a different problem: why autoscaler is part of
> the
> > > > operator instead of jobmanager at runtime.
> > > > The Community decided to keep the autoscaling logic in the
> > > > flink-kubernetes-operator.
> > > >
> > > > *Proposal: *
> > > > In this discussion, I want to put forward a thought of extracting
> out the
> > > > auto scaling logic into a new submodule in flink-kubernetes-operator
> > > > repository[3],
> > > > which will be independent of any resource manager/Operator.
> > > > Currently the Autoscaling algorithm is very tightly coupled with the
> > > > kubernetes API.
> > > > This makes the autoscaling core algorithm not so easily extensible
> for
> > > > different available resource managers like YARN, Mesos etc.
> > > > A Separate autoscaling module inside the flink kubernetes operator
> will
> > > > help other resource managers to leverage the autoscaling logic.
> > > >
> > > > [1]
> > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> > > > [2] https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> > > > [3] https://github.com/apache/flink-kubernetes-operator
> > > >
> > > >
> > > > Bests,
> > > > Samrat
> > > >
> > >
>

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Maximilian Michels <mx...@apache.org>.

Hi Samrat,

The autoscaling module is now pluggable but it is still tightly
coupled with Kubernetes. It will take additional work for the logic to
work independently of the cluster manager.

-Max

On Thu, Feb 16, 2023 at 11:14 AM Samrat Deb <de...@gmail.com> wrote:
>
> Oh! yesterday it got merged.
> Apologies , I missed the recent commit @Gyula.
>
> Thanks for the update
>
>
>
> On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <gy...@gmail.com> wrote:
>
> > Max recently moved the autoscaler logic in a separate submodule, did you
> > see that?
> >
> >
> > https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
> >
> > Gyula
> >
> > On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <de...@gmail.com> wrote:
> >
> > > Hi ,
> > >
> > > *Context:*
> > > Auto Scaling was introduced in Flink as part of FLIP-271[1].
> > > It discusses one of the important aspects to provide a robust default
> > > scaling algorithm.
> > >       a. Ensure scaling yields effective usage of assigned task slots.
> > >       b. Ramp up in case of any backlog to ensure it gets processed in a
> > > timely manner
> > >       c. Minimize the number of scaling decisions to prevent costly
> > rescale
> > > operation
> > > The flip intends to add an auto scaling framework based on 6 major
> > metrics
> > > and contains different types of threshold to trigger the scaling.
> > >
> > > Thread[2] discusses a different problem: why autoscaler is part of the
> > > operator instead of jobmanager at runtime.
> > > The Community decided to keep the autoscaling logic in the
> > > flink-kubernetes-operator.
> > >
> > > *Proposal: *
> > > In this discussion, I want to put forward a thought of extracting out the
> > > auto scaling logic into a new submodule in flink-kubernetes-operator
> > > repository[3],
> > > which will be independent of any resource manager/Operator.
> > > Currently the Autoscaling algorithm is very tightly coupled with the
> > > kubernetes API.
> > > This makes the autoscaling core algorithm not so easily extensible for
> > > different available resource managers like YARN, Mesos etc.
> > > A Separate autoscaling module inside the flink kubernetes operator will
> > > help other resource managers to leverage the autoscaling logic.
> > >
> > > [1]
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> > > [2] https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> > > [3] https://github.com/apache/flink-kubernetes-operator
> > >
> > >
> > > Bests,
> > > Samrat
> > >
> >

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Samrat Deb <de...@gmail.com>.

Oh! yesterday it got merged.
Apologies , I missed the recent commit @Gyula.

Thanks for the update



On Thu, Feb 16, 2023 at 3:17 PM Gyula Fóra <gy...@gmail.com> wrote:

> Max recently moved the autoscaler logic in a separate submodule, did you
> see that?
>
>
> https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4
>
> Gyula
>
> On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <de...@gmail.com> wrote:
>
> > Hi ,
> >
> > *Context:*
> > Auto Scaling was introduced in Flink as part of FLIP-271[1].
> > It discusses one of the important aspects to provide a robust default
> > scaling algorithm.
> >       a. Ensure scaling yields effective usage of assigned task slots.
> >       b. Ramp up in case of any backlog to ensure it gets processed in a
> > timely manner
> >       c. Minimize the number of scaling decisions to prevent costly
> rescale
> > operation
> > The flip intends to add an auto scaling framework based on 6 major
> metrics
> > and contains different types of threshold to trigger the scaling.
> >
> > Thread[2] discusses a different problem: why autoscaler is part of the
> > operator instead of jobmanager at runtime.
> > The Community decided to keep the autoscaling logic in the
> > flink-kubernetes-operator.
> >
> > *Proposal: *
> > In this discussion, I want to put forward a thought of extracting out the
> > auto scaling logic into a new submodule in flink-kubernetes-operator
> > repository[3],
> > which will be independent of any resource manager/Operator.
> > Currently the Autoscaling algorithm is very tightly coupled with the
> > kubernetes API.
> > This makes the autoscaling core algorithm not so easily extensible for
> > different available resource managers like YARN, Mesos etc.
> > A Separate autoscaling module inside the flink kubernetes operator will
> > help other resource managers to leverage the autoscaling logic.
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> > [2] https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> > [3] https://github.com/apache/flink-kubernetes-operator
> >
> >
> > Bests,
> > Samrat
> >
>

Re: [DISCUSS] Extract core autoscaling algorithm as new SubModule in flink-kubernetes-operator

Posted by Gyula Fóra <gy...@gmail.com>.

Max recently moved the autoscaler logic in a separate submodule, did you
see that?

https://github.com/apache/flink-kubernetes-operator/commit/5bb8e9dc4dd29e10f3ba7c8ce7cefcdffbf92da4

Gyula

On Thu, Feb 16, 2023 at 10:27 AM Samrat Deb <de...@gmail.com> wrote:

> Hi ,
>
> *Context:*
> Auto Scaling was introduced in Flink as part of FLIP-271[1].
> It discusses one of the important aspects to provide a robust default
> scaling algorithm.
>       a. Ensure scaling yields effective usage of assigned task slots.
>       b. Ramp up in case of any backlog to ensure it gets processed in a
> timely manner
>       c. Minimize the number of scaling decisions to prevent costly rescale
> operation
> The flip intends to add an auto scaling framework based on 6 major metrics
> and contains different types of threshold to trigger the scaling.
>
> Thread[2] discusses a different problem: why autoscaler is part of the
> operator instead of jobmanager at runtime.
> The Community decided to keep the autoscaling logic in the
> flink-kubernetes-operator.
>
> *Proposal: *
> In this discussion, I want to put forward a thought of extracting out the
> auto scaling logic into a new submodule in flink-kubernetes-operator
> repository[3],
> which will be independent of any resource manager/Operator.
> Currently the Autoscaling algorithm is very tightly coupled with the
> kubernetes API.
> This makes the autoscaling core algorithm not so easily extensible for
> different available resource managers like YARN, Mesos etc.
> A Separate autoscaling module inside the flink kubernetes operator will
> help other resource managers to leverage the autoscaling logic.
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> [2] https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> [3] https://github.com/apache/flink-kubernetes-operator
>
>
> Bests,
> Samrat
>