You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Zheng Yu Chen <ja...@gmail.com> on 2022/08/16 09:40:41 UTC

[Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Hi community ~

I think this title should be quite interesting. The idea is to reduce the
workload of the JobManager and make the SessionCluster [2] more stable in
the process of running jobs. I designed a plan for splitting the JobManager
on FLIP-257 [1]:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process>

This proposal proposes a splitting scheme for the current process and a new
process implementation idea that is compatible with the original process
model: splitting the internal JobMaster component of the JobManager, and
controlling whether to enable this new process through a parameter In the
split scheme, when the user configures, the JobMaster will make it run as
an independent service, reducing the workload of the JobManager. By
implementing a new Dispatcher to communicate and interact with a single
split JobMaster or multiple JobMasters, to achieve job management

The main features that it provides is:

   - After the user submits the job, the JobMaster thread was split into
   other processes to run in the past. They no longer run in the JobManager,
   but in other processes.
   - Users can deploy multiple components mentioned above, which run
   multiple JobMaster threads, thereby reducing the workload of the JobManager

Some of the challenging use cases that these features solve are:

   - Compatible with the original job running mode (run JobMaster Thread on
   JobManager)
   - Implement a new Dispatcher that forwards client operations related to
   jobs


 I would love to hear and address your thoughts and feedback , and if
possible drive a FLIP-257 !


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process>

[2]
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode


-- 

Have a nice day ~

ConradJam

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Posted by Zheng Yu Chen <ja...@gmail.com>.
Thank you for your interest in this program. This proposal only takes
effect for Session Cluster Mode, and will not take effect for Application
Mode. I would like to answer some of your previous questions about patterns
and practical cases first, and then discuss some implementation details
later.
On a SessionCluster cluster in production, I run dozens or hundreds of
jobs, which are not short-running jobs, but long-running jobs. Why they
don't choose Applicantion Mode
   ○ This can save some JVM resources and reduce server costs
   ○ More adequate resource utilization
   ○ Starting Application Mode has a long resource application and waiting
(because SessionCluster has already applied for fixed TM and JM resources
at startup)
But we will also find the following potential problems
   ○ Poor isolation between JobMaster threads in JobManager: When there are
too many jobs, the JobManager is under great pressure. For example, when
running more than 20 jobs, because the JobMaster thread is inside the
JobManager process, it is equivalent to having more than 20 threads in one
maintenance The thread pools compete with each other for resources to
restore or compile JobGraph, etc.
   ○ JobManager's functional responsibilities are too large: With the
development of Flink, there will inevitably be more rich functions running
on JobMaster. As the title says, when in SessionCluster mode, should we
consider JobManager as a routing-like forwarding function , instead of a
component that integrates huge functions, proper splitting reduces the
workload of the JobManager
Based on the above situation, so there is the idea of the title. Can we
introduce the JobMasterContainer component to reduce the impact between
jobs, so that jobs can run more gracefully and for a long time on the
SessionCluster instead of just running rather small/short-lived jobs (e.g.
FlinkSQL queries) or when deploying some kind
of dev environment for testing out job implementations

What do you think ~


Chesnay Schepler <ch...@apache.org> 于2022年8月17日周三 22:31写道:

> To be honest I'm terrified at the idea of splitting the Dispatcher into
> several processes, even more so if this is supposed to be opt-in and
> specific to session mode.
> It would fragment the coordination layer even more than it already is,
> and make ops more complicated (yet another set of processes to monitor,
> configure etc.).
>
> I'm not convinced that this proposal really gets us a lot of benefits;
> and would rather propose that you split your single session cluster into
> multiple session clusters (with the scheduling component in front of it
> to distribute jobs) to even the load.
>
>  > The currently idling JobManagers could be utilized to take over some
> of the workload from the leader.
>
> This would also be the path I would go down if we'd try to tackle this.
>
> On 17/08/2022 16:22, Matthias Pohl wrote:
> > Hi Conrad,
> > thanks for reaching out to the community with your proposal. I looked
> > through FLIP-257 [1]. Your motivation sounds interesting. Can you
> > elaborate a bit more on the concrete use-cases you have in mind? How
> > do these match the user-cases of the two favored execution modes, i.e.
> > Flink's Session and Application mode?
> >
> > As mentioned in [2], Application Mode should be preferred for single
> > long-running jobs to isolate the resources of each of those jobs from
> > each other. In contrast, Session Mode is the natural choice when
> > running rather small/short-lived jobs (e.g. FlinkSQL queries) or when
> > deploying some kind of dev environment for testing out job
> > implementations. It feels like your use-case is somewhere in between a
> > bit? It would be interesting to get a better understanding of where
> > you're coming from. Maybe, you could provide some workload statistics?
> >
> > That considered, I guess it's a topic worth looking into. Here are a
> > few thoughts after looking into FLIP-257:
> > - As far as I can see, the BlobServer is used for sharing
> > configuration information (e.g. Classpath information) as part of the
> > ExecutionGraph instantiation [3]. The JobGraph is not persisted
> > through the BlobServer but rather stored in the JobGraphStore backed
> > by the HighAvailabilityServices implementation. The HA side is not
> > really covered in FLIP-257, yet.
> > - The approach of having the current Dispatcher living next to the new
> > JobMasterDispatcher (that encapsulates the logic around distributing
> > the workload onto multiple runners) leaves me with some doubt whether
> > there wouldn't be a better separation of concerns here. What about
> > leaving the Dispatcher as is but adding some abstraction between
> > JobManagerRunner/JobMaster and the Dispatcher that hides the logic
> > around whether these instances are "deployed" on the same machine or
> > somewhere else.
> > - About distributing JobManager workload: The JobManager already
> > utilizes leader election for faster recovery. Hence, one can set up
> > multiple JobManagers in idle mode which wait to gain leadership and
> > pick up the work (i.e. recovering the jobs) of the previously failed
> > JobManager leader. What about utilizing this setup: The currently
> > idling JobManagers could be utilized to take over some of the workload
> > from the leader. I haven't thought this through, yet. But I'm
> > wondering whether that would be a path we could go down. This would
> > enable us to still stick to the JobManager/TaskManager setup which
> > users are already used to rather than introducing another type of
> > cluster node.
> > - The JobManager initialization logic is kind of tricky to get your
> > head around. There is some overhead, I hope, we could clean up as part
> > of the efforts of removing the per-Job Mode from Flink [4]. It was
> > decided to deprecate per-Job Mode in Flink 1.15. But we have to stick
> > with it for some time (i.e. it's not going to be removed in 1.16)
> > since it's a quite basic feature users might rely on. This shouldn't
> > be a blocker. I just wanted to mention it to have it in the back of
> > our minds when looking into ways to come up with a solid proposal for
> > FLIP-257.
> > - My concern is that this FLIP might turn out to be larger than
> > expected and that it might be worth cutting it down into smaller
> > chunks with each being covered in a separate FLIP down the road if we
> > have some agreement and a clearer picture on how this should be tackled.
> >
> > I'm gonna add Chesnay and David to this discussion.
> >
> > Best,
> > Matthias
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > [2]
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#deployment-modes
> > [3]
> >
> https://github.com/apache/flink/blob/9ed70a1e8b5d59abdf9d7673bc5b44d421140ef0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java#L333
> > [4] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
> >
> >
> > On Tue, Aug 16, 2022 at 11:43 AM Zheng Yu Chen <ja...@gmail.com>
> > wrote:
> >
> >     Hi community ~
> >
> >     I think this title should be quite interesting. The idea is to
> >     reduce the
> >     workload of the JobManager and make the SessionCluster [2] more
> >     stable in
> >     the process of running jobs. I designed a plan for splitting the
> >     JobManager
> >     on FLIP-257 [1]:
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> >     <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> >
> >
> >     This proposal proposes a splitting scheme for the current process
> >     and a new
> >     process implementation idea that is compatible with the original
> >     process
> >     model: splitting the internal JobMaster component of the
> >     JobManager, and
> >     controlling whether to enable this new process through a parameter
> >     In the
> >     split scheme, when the user configures, the JobMaster will make it
> >     run as
> >     an independent service, reducing the workload of the JobManager. By
> >     implementing a new Dispatcher to communicate and interact with a
> >     single
> >     split JobMaster or multiple JobMasters, to achieve job management
> >
> >     The main features that it provides is:
> >
> >        - After the user submits the job, the JobMaster thread was
> >     split into
> >        other processes to run in the past. They no longer run in the
> >     JobManager,
> >        but in other processes.
> >        - Users can deploy multiple components mentioned above, which run
> >        multiple JobMaster threads, thereby reducing the workload of
> >     the JobManager
> >
> >     Some of the challenging use cases that these features solve are:
> >
> >        - Compatible with the original job running mode (run JobMaster
> >     Thread on
> >        JobManager)
> >        - Implement a new Dispatcher that forwards client operations
> >     related to
> >        jobs
> >
> >
> >      I would love to hear and address your thoughts and feedback , and if
> >     possible drive a FLIP-257 !
> >
> >
> >     [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> >     <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> >
> >
> >     [2]
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
> >
> >
> >     --
> >
> >     Have a nice day ~
> >
> >     ConradJam
> >
>


-- 
Best

ConradJam

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Posted by David Morávek <dm...@apache.org>.
Hi Zheng,

Thanks for the write-up! I tend to agree with Chesnay that this introduces
additional complexity to an already complex deployment model.

One of the main focuses in this area is to reduce feature sparsity and to
have fewer high-quality options. Example efforts are deprecation (and
eventual removal) of per-job mode, removal of Mesos RM, ...

Let's discuss your points:

> This can save some JVM resources and reduce server costs

if so, the saving would IMO be negligible; why?

- JobMaster is by far the most resource-intensive component inside the
JobManager
- The CPU / memory ratio of the underlying hypervisor remains the same (or
you'd have unused resources on the machine that you still need to pay for)
- The most overhead of the JobMaster comes from the JVM itself, not from RM
/ Dispatcher

> More adequate resource utilization

Can you elaborate? Is this about sharing TMs between multiple jobs (I'd
discourage that for long-running mission-critical workloads)?

> Starting Application Mode has a long resource application and waiting
(because SessionCluster has already applied for fixed TM and JM resources
at startup)

This means you have to overprovision your SessionCluster. This goes against
resource utilization efforts from the previous point (you're shaving off
little resources from JM, but have spare TMs instead, that are the order of
magnitude
 more resource intensive).

If you're able to start TMs upfront with the session cluster, you already
know you're going to need them. If this is a concern, you could as well
start the TMs that will eventually connect to your JM once it starts
(you've decided to submit your job) - there might be some enhancements to
ApplicationMode needed to make this robust, but efforts in this direction
are where the things should IMO be headed.

As for the resource utilization, the session cluster actually blocks you
from leveraging reactive scaling efforts and eventually auto-scaling,
because we'd need to enhance Flink surface area with multi-job scheduling
capabilities (queues, pre-emptions, priorities between jobs) - I don't
think we should ever go in that direction, that's outside Flink's scope.

> Poor isolation between JobMaster threads in JobManager: When there are
too many jobs, the JobManager is under great pressure.

The session mode is mainly designed for interactive workloads but agreed
that JM threads might interfere. Still, I fail to see this as a reason for
introducing additional complexity because this could be mitigated on the
user side (smarter job scheduling, multiple clusters, AM for streaming
jobs).

> there will inevitably be more rich functions running on JobMaster.

This is a separate discussion. So far we were mostly pushing against
running against any user code on JM (there are few exceptions already, but
any enhancement should be carefully considered)

> JobManager's functional responsibilities are too large

from the "architecture perspective", it's just a bundle of independent
components with clearly defined responsibilities, that makes their
coordination simpler and more resource efficient (networking, fewer JVMs -
each comes with a significant overhead)

--

So far I'm under impression that this actually introduces more issues than
it tries to solve.

Best,
D.


On Thu, Aug 18, 2022 at 12:10 PM Zheng Yu Chen <ja...@gmail.com> wrote:

> You're right, this does add to the complexity of their communication
> coordination
> I can understand what you mean is similar to ngnix, load balancing to
> different SessionClusters in the front, rather than one more component. In
> fact, I have tried this myself, and it seems to solve the problem of high
> load of cluster JM, but it cannot fundamentally solve the following
> problems
>
> Deploying components is complicated and requires one more ngnix and related
> configuration. You also need to make sure that your jobs are not assigned
> to a busy JobManager
> As well as my previous reply mentioned the problem, this is a trade-off
> solution (after all, you can choose Application Mode, so there will be no
> such problem), when we need to use SessionCluster for long-running jobs, we
> Can you think like this?
>
> what do you think ~
>
>
> Chesnay Schepler <ch...@apache.org> 于2022年8月17日周三 22:31写道:
>
> > To be honest I'm terrified at the idea of splitting the Dispatcher into
> > several processes, even more so if this is supposed to be opt-in and
> > specific to session mode.
> > It would fragment the coordination layer even more than it already is,
> > and make ops more complicated (yet another set of processes to monitor,
> > configure etc.).
> >
> > I'm not convinced that this proposal really gets us a lot of benefits;
> > and would rather propose that you split your single session cluster into
> > multiple session clusters (with the scheduling component in front of it
> > to distribute jobs) to even the load.
> >
> >  > The currently idling JobManagers could be utilized to take over some
> > of the workload from the leader.
> >
> > This would also be the path I would go down if we'd try to tackle this.
> >
> > On 17/08/2022 16:22, Matthias Pohl wrote:
> > > Hi Conrad,
> > > thanks for reaching out to the community with your proposal. I looked
> > > through FLIP-257 [1]. Your motivation sounds interesting. Can you
> > > elaborate a bit more on the concrete use-cases you have in mind? How
> > > do these match the user-cases of the two favored execution modes, i.e.
> > > Flink's Session and Application mode?
> > >
> > > As mentioned in [2], Application Mode should be preferred for single
> > > long-running jobs to isolate the resources of each of those jobs from
> > > each other. In contrast, Session Mode is the natural choice when
> > > running rather small/short-lived jobs (e.g. FlinkSQL queries) or when
> > > deploying some kind of dev environment for testing out job
> > > implementations. It feels like your use-case is somewhere in between a
> > > bit? It would be interesting to get a better understanding of where
> > > you're coming from. Maybe, you could provide some workload statistics?
> > >
> > > That considered, I guess it's a topic worth looking into. Here are a
> > > few thoughts after looking into FLIP-257:
> > > - As far as I can see, the BlobServer is used for sharing
> > > configuration information (e.g. Classpath information) as part of the
> > > ExecutionGraph instantiation [3]. The JobGraph is not persisted
> > > through the BlobServer but rather stored in the JobGraphStore backed
> > > by the HighAvailabilityServices implementation. The HA side is not
> > > really covered in FLIP-257, yet.
> > > - The approach of having the current Dispatcher living next to the new
> > > JobMasterDispatcher (that encapsulates the logic around distributing
> > > the workload onto multiple runners) leaves me with some doubt whether
> > > there wouldn't be a better separation of concerns here. What about
> > > leaving the Dispatcher as is but adding some abstraction between
> > > JobManagerRunner/JobMaster and the Dispatcher that hides the logic
> > > around whether these instances are "deployed" on the same machine or
> > > somewhere else.
> > > - About distributing JobManager workload: The JobManager already
> > > utilizes leader election for faster recovery. Hence, one can set up
> > > multiple JobManagers in idle mode which wait to gain leadership and
> > > pick up the work (i.e. recovering the jobs) of the previously failed
> > > JobManager leader. What about utilizing this setup: The currently
> > > idling JobManagers could be utilized to take over some of the workload
> > > from the leader. I haven't thought this through, yet. But I'm
> > > wondering whether that would be a path we could go down. This would
> > > enable us to still stick to the JobManager/TaskManager setup which
> > > users are already used to rather than introducing another type of
> > > cluster node.
> > > - The JobManager initialization logic is kind of tricky to get your
> > > head around. There is some overhead, I hope, we could clean up as part
> > > of the efforts of removing the per-Job Mode from Flink [4]. It was
> > > decided to deprecate per-Job Mode in Flink 1.15. But we have to stick
> > > with it for some time (i.e. it's not going to be removed in 1.16)
> > > since it's a quite basic feature users might rely on. This shouldn't
> > > be a blocker. I just wanted to mention it to have it in the back of
> > > our minds when looking into ways to come up with a solid proposal for
> > > FLIP-257.
> > > - My concern is that this FLIP might turn out to be larger than
> > > expected and that it might be worth cutting it down into smaller
> > > chunks with each being covered in a separate FLIP down the road if we
> > > have some agreement and a clearer picture on how this should be
> tackled.
> > >
> > > I'm gonna add Chesnay and David to this discussion.
> > >
> > > Best,
> > > Matthias
> > >
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > > [2]
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#deployment-modes
> > > [3]
> > >
> >
> https://github.com/apache/flink/blob/9ed70a1e8b5d59abdf9d7673bc5b44d421140ef0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java#L333
> > > [4] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
> > >
> > >
> > > On Tue, Aug 16, 2022 at 11:43 AM Zheng Yu Chen <ja...@gmail.com>
> > > wrote:
> > >
> > >     Hi community ~
> > >
> > >     I think this title should be quite interesting. The idea is to
> > >     reduce the
> > >     workload of the JobManager and make the SessionCluster [2] more
> > >     stable in
> > >     the process of running jobs. I designed a plan for splitting the
> > >     JobManager
> > >     on FLIP-257 [1]:
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > >     <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> > >
> > >
> > >     This proposal proposes a splitting scheme for the current process
> > >     and a new
> > >     process implementation idea that is compatible with the original
> > >     process
> > >     model: splitting the internal JobMaster component of the
> > >     JobManager, and
> > >     controlling whether to enable this new process through a parameter
> > >     In the
> > >     split scheme, when the user configures, the JobMaster will make it
> > >     run as
> > >     an independent service, reducing the workload of the JobManager. By
> > >     implementing a new Dispatcher to communicate and interact with a
> > >     single
> > >     split JobMaster or multiple JobMasters, to achieve job management
> > >
> > >     The main features that it provides is:
> > >
> > >        - After the user submits the job, the JobMaster thread was
> > >     split into
> > >        other processes to run in the past. They no longer run in the
> > >     JobManager,
> > >        but in other processes.
> > >        - Users can deploy multiple components mentioned above, which
> run
> > >        multiple JobMaster threads, thereby reducing the workload of
> > >     the JobManager
> > >
> > >     Some of the challenging use cases that these features solve are:
> > >
> > >        - Compatible with the original job running mode (run JobMaster
> > >     Thread on
> > >        JobManager)
> > >        - Implement a new Dispatcher that forwards client operations
> > >     related to
> > >        jobs
> > >
> > >
> > >      I would love to hear and address your thoughts and feedback , and
> if
> > >     possible drive a FLIP-257 !
> > >
> > >
> > >     [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > >     <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> > >
> > >
> > >     [2]
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
> > >
> > >
> > >     --
> > >
> > >     Have a nice day ~
> > >
> > >     ConradJam
> > >
> >
>
>
> --
> Best
>
> ConradJam
>

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Posted by Zheng Yu Chen <ja...@gmail.com>.
You're right, this does add to the complexity of their communication
coordination
I can understand what you mean is similar to ngnix, load balancing to
different SessionClusters in the front, rather than one more component. In
fact, I have tried this myself, and it seems to solve the problem of high
load of cluster JM, but it cannot fundamentally solve the following problems

Deploying components is complicated and requires one more ngnix and related
configuration. You also need to make sure that your jobs are not assigned
to a busy JobManager
As well as my previous reply mentioned the problem, this is a trade-off
solution (after all, you can choose Application Mode, so there will be no
such problem), when we need to use SessionCluster for long-running jobs, we
Can you think like this?

what do you think ~


Chesnay Schepler <ch...@apache.org> 于2022年8月17日周三 22:31写道:

> To be honest I'm terrified at the idea of splitting the Dispatcher into
> several processes, even more so if this is supposed to be opt-in and
> specific to session mode.
> It would fragment the coordination layer even more than it already is,
> and make ops more complicated (yet another set of processes to monitor,
> configure etc.).
>
> I'm not convinced that this proposal really gets us a lot of benefits;
> and would rather propose that you split your single session cluster into
> multiple session clusters (with the scheduling component in front of it
> to distribute jobs) to even the load.
>
>  > The currently idling JobManagers could be utilized to take over some
> of the workload from the leader.
>
> This would also be the path I would go down if we'd try to tackle this.
>
> On 17/08/2022 16:22, Matthias Pohl wrote:
> > Hi Conrad,
> > thanks for reaching out to the community with your proposal. I looked
> > through FLIP-257 [1]. Your motivation sounds interesting. Can you
> > elaborate a bit more on the concrete use-cases you have in mind? How
> > do these match the user-cases of the two favored execution modes, i.e.
> > Flink's Session and Application mode?
> >
> > As mentioned in [2], Application Mode should be preferred for single
> > long-running jobs to isolate the resources of each of those jobs from
> > each other. In contrast, Session Mode is the natural choice when
> > running rather small/short-lived jobs (e.g. FlinkSQL queries) or when
> > deploying some kind of dev environment for testing out job
> > implementations. It feels like your use-case is somewhere in between a
> > bit? It would be interesting to get a better understanding of where
> > you're coming from. Maybe, you could provide some workload statistics?
> >
> > That considered, I guess it's a topic worth looking into. Here are a
> > few thoughts after looking into FLIP-257:
> > - As far as I can see, the BlobServer is used for sharing
> > configuration information (e.g. Classpath information) as part of the
> > ExecutionGraph instantiation [3]. The JobGraph is not persisted
> > through the BlobServer but rather stored in the JobGraphStore backed
> > by the HighAvailabilityServices implementation. The HA side is not
> > really covered in FLIP-257, yet.
> > - The approach of having the current Dispatcher living next to the new
> > JobMasterDispatcher (that encapsulates the logic around distributing
> > the workload onto multiple runners) leaves me with some doubt whether
> > there wouldn't be a better separation of concerns here. What about
> > leaving the Dispatcher as is but adding some abstraction between
> > JobManagerRunner/JobMaster and the Dispatcher that hides the logic
> > around whether these instances are "deployed" on the same machine or
> > somewhere else.
> > - About distributing JobManager workload: The JobManager already
> > utilizes leader election for faster recovery. Hence, one can set up
> > multiple JobManagers in idle mode which wait to gain leadership and
> > pick up the work (i.e. recovering the jobs) of the previously failed
> > JobManager leader. What about utilizing this setup: The currently
> > idling JobManagers could be utilized to take over some of the workload
> > from the leader. I haven't thought this through, yet. But I'm
> > wondering whether that would be a path we could go down. This would
> > enable us to still stick to the JobManager/TaskManager setup which
> > users are already used to rather than introducing another type of
> > cluster node.
> > - The JobManager initialization logic is kind of tricky to get your
> > head around. There is some overhead, I hope, we could clean up as part
> > of the efforts of removing the per-Job Mode from Flink [4]. It was
> > decided to deprecate per-Job Mode in Flink 1.15. But we have to stick
> > with it for some time (i.e. it's not going to be removed in 1.16)
> > since it's a quite basic feature users might rely on. This shouldn't
> > be a blocker. I just wanted to mention it to have it in the back of
> > our minds when looking into ways to come up with a solid proposal for
> > FLIP-257.
> > - My concern is that this FLIP might turn out to be larger than
> > expected and that it might be worth cutting it down into smaller
> > chunks with each being covered in a separate FLIP down the road if we
> > have some agreement and a clearer picture on how this should be tackled.
> >
> > I'm gonna add Chesnay and David to this discussion.
> >
> > Best,
> > Matthias
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > [2]
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#deployment-modes
> > [3]
> >
> https://github.com/apache/flink/blob/9ed70a1e8b5d59abdf9d7673bc5b44d421140ef0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java#L333
> > [4] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
> >
> >
> > On Tue, Aug 16, 2022 at 11:43 AM Zheng Yu Chen <ja...@gmail.com>
> > wrote:
> >
> >     Hi community ~
> >
> >     I think this title should be quite interesting. The idea is to
> >     reduce the
> >     workload of the JobManager and make the SessionCluster [2] more
> >     stable in
> >     the process of running jobs. I designed a plan for splitting the
> >     JobManager
> >     on FLIP-257 [1]:
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> >     <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> >
> >
> >     This proposal proposes a splitting scheme for the current process
> >     and a new
> >     process implementation idea that is compatible with the original
> >     process
> >     model: splitting the internal JobMaster component of the
> >     JobManager, and
> >     controlling whether to enable this new process through a parameter
> >     In the
> >     split scheme, when the user configures, the JobMaster will make it
> >     run as
> >     an independent service, reducing the workload of the JobManager. By
> >     implementing a new Dispatcher to communicate and interact with a
> >     single
> >     split JobMaster or multiple JobMasters, to achieve job management
> >
> >     The main features that it provides is:
> >
> >        - After the user submits the job, the JobMaster thread was
> >     split into
> >        other processes to run in the past. They no longer run in the
> >     JobManager,
> >        but in other processes.
> >        - Users can deploy multiple components mentioned above, which run
> >        multiple JobMaster threads, thereby reducing the workload of
> >     the JobManager
> >
> >     Some of the challenging use cases that these features solve are:
> >
> >        - Compatible with the original job running mode (run JobMaster
> >     Thread on
> >        JobManager)
> >        - Implement a new Dispatcher that forwards client operations
> >     related to
> >        jobs
> >
> >
> >      I would love to hear and address your thoughts and feedback , and if
> >     possible drive a FLIP-257 !
> >
> >
> >     [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> >     <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> >
> >
> >     [2]
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
> >
> >
> >     --
> >
> >     Have a nice day ~
> >
> >     ConradJam
> >
>


-- 
Best

ConradJam

Fwd: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Posted by Shammon FY <zj...@gmail.com>.
Thanks @XintongSong and sorry for replying late. Also thanks Zhengyu for
bringing this discussion up.

We use flink session cluster to run olap queries in ByteDance, and have
supported several users in production. The maximum single cluster has 2000
cores, and we focus on the job scheduling performance, stability and
serviceability of the olap cluster.

I think @chesnay raised an interesting question: using multiple clusters
VS. expand a single cluster. Flink Session Cluster is a stateless compute
cluster, and we can deploy multiple clusters to solve performance and
stability problems.

In fact, we did deploy multiple clusters due to the lack of flink job
scheduling performance a few months ago. We have even deployed 5 session
clusters to support tens of QPS needs for the single user. But after we
finished optimizing Flink job scheduling performance, we immediately merged
these clusters into one. Even in the future, we hope to be able to test the
large cluster of 50000 cores and unify some users with low isolation
requirements to the large cluster.

We really don't like to deploy multiple clusters for each user. Besides the
resource and use cost mentioned by @jam.gzczy, there are some other
problems:
1. We should add monitors and alarms for each cluster
2. When a user has problems, it will increase the cost of problem tracing
in multiple clusters
3. Increase Cluster upgrade costs due to the number of clusters
4. Multiple clusters will reduce the user's confidence in system
performance, and the system should support horizontal expansion, it's very
important

The advantages of using a single cluster instead of multiple clusters to
provide services are obvious. But there's a single point problem of
JobManager in the cluster. How to solve the single point problem? When we
started to use Flink to provide olap services, it was on our list.
JobManager single point has the following problems in flink olap:
1. Recovery time of the cluster. When JobManager fails, the cluster can't
provide services until the JobManager recovers. We have added a hook in the
JobManager to improve the recovery time from stand-by one in about 400ms,
but recovery time may take tens of seconds in some corner cases, it's
terrible.
2. Workload and stability. There may be fullgc in JobManager or it is
processing a large amount of results for some olap queries, other queries
will be affected, resulting in increased latency or decreased throughput.

Therefore, in terms of olap scenarios, we want to split the JobManager, and
it's an important feature. However, we believe that it should be a careful
decision. We need to divide the functions of JobManager carefully,
including REST, Dispatcher, ResourceManager and etc. roles. It involves job
submission, job management, web ui and other core functions.

Personally, I have reservations about the design in
https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split.
It doesn't solve the single point problem of cluster for olap scenario, and
even worse, it increases the network interaction of olap query and the
complexity of the JobManager module.

Thanks

Xintong Song <to...@gmail.com> 于2022年8月28日周日 14:48写道:

> Sorry for joining the discussion late. And thanks Zhengyu for bringing
> this discussion up.
>
> I think this is an interesting topic. I actually had something similar in
> mind for a long time. I haven't carried it out for the same concerns as
> others already mentioned, that the benefit for this effort may not be that
> significant compared to the complexity it introduces. However, I think
> Matthias has come up with an inspiring idea, to simply offload some of the
> JobMasters to standby JobManager processes, which potentially reduces the
> complexity significantly and sounds promising to me.
>
> # My understanding on the session mode
>
> In the early days, a standalone session cluster was probably the most
> straightforward way to deploy Flink on a bunch of machines / VMs. Nowadays,
> as K8s / Yarn deployments become more and more convenient and application
> mode becomes available for the standalone deployment, why (or whether) do
> users still need the session mode?
>
> From my experiences, I see users still want to use session mode in
> production mainly for 2 reasons:
> - To reduce the job bootstrap time. Leveraging an existing session cluster
> would save the time for requesting resources from K8s / Yarn and launching
> / initializing the processes, which in many cases is the major time
> consumption in launching a job. This is valued typically for interactive
> workloads.
> - To improve resource efficiency. Session cluster allows multiple jobs to
> share the same JobManager & TaskManager processes. This reduces the
> framework resource overhead for small workloads.
>
> # Concrete use cases
>
> I've seen 2 use cases which may benefit from this proposal.
>
> - AFAIK, ByteDance builds a large scale OLAP query service with Flink.
> They have a large Flink session cluster that runs >100 sql queries per
> seconds, where the JobManager process becomes the single-pointed
> performance bottleneck. I asked the same question as Chesnay did, that
> whether this can be fulfilled by having multiple session clusters plus a
> load-balancing service. Seems that's exactly the approach they were using,
> but with many limitations. Cc-ed @Shammon, would you like to provide more
> details?
> - One of our users at Alibaba has hundreds of Flink jobs for event
> processing. The workload of each job is typically low (tens of records per
> hour), meanwhile there's a high requirement on the timeliness (records upon
> received must be processed immediately). Consequently, they have a large
> session cluster that runs hundreds of long-running jobs. The reason they
> don't like multiple session clusters is that, as business develops, they
> may frequently create new jobs and retire old ones. It is inconvenient to
> maintain / migrate the long-running jobs across multiple clusters.
>
> # My opinion on this proposal
>
> I do see a value behind this proposal. Admittedly, I haven't seen any use
> cases that absolutely cannot be solved by having multiple session clusters.
> But it would definitely reduce the complexity on the user side if Flink can
> support larger scale session clusters. Additionally, there's a chance that
> spreading the JobMasters across multiple JobManager processes may also help
> with the high availability, that only part of the jobs are affected upon
> JobManager failures.
>
> On the other hand, I share the concern that the current design complicates
> the coordination & deployment components way more than the benefit it
> brings.
>
> I would suggest looking a bit more along the direction of leveraging
> standby JobManager processes as Matthias pointed out, see if the benefit is
> worth the price.
>
> Best,
>
> Xintong
>
>
>
> On Fri, Aug 26, 2022 at 5:55 PM Zheng Yu Chen <ja...@gmail.com> wrote:
>
>> Hi Chesnay ,
>> I have also considered the method you mentioned. If we deploy some
>> load balancing or intelligent scheduling in front of multiple
>> SessionClusters, this may cause the following problems
>> ● Insufficient resource utilization. When we distribute these
>> resources on each cluster, the job cannot make full use of the overall
>> TM resources. Some clusters may have very high workload and some are
>> idle, resulting in wasted resources.
>> ● The user's usage cost increases, and the user introduces additional
>> components to adapt to the SessionCluster. The problem is caused by
>> the overload of the JobManager. If there is a solution on the Flink
>> side, it will be better.
>> Maybe there is a better way to deal with it, I am sorting it out, and
>> I will reply with new ideas in the emails later.
>>
>> Chesnay Schepler <ch...@apache.org> 于2022年8月17日周三 22:31写道:
>> >
>> > To be honest I'm terrified at the idea of splitting the Dispatcher into
>> > several processes, even more so if this is supposed to be opt-in and
>> > specific to session mode.
>> > It would fragment the coordination layer even more than it already is,
>> > and make ops more complicated (yet another set of processes to monitor,
>> > configure etc.).
>> >
>> > I'm not convinced that this proposal really gets us a lot of benefits;
>> > and would rather propose that you split your single session cluster into
>> > multiple session clusters (with the scheduling component in front of it
>> > to distribute jobs) to even the load.
>> >
>> >  > The currently idling JobManagers could be utilized to take over some
>> > of the workload from the leader.
>> >
>> > This would also be the path I would go down if we'd try to tackle this.
>> >
>> > On 17/08/2022 16:22, Matthias Pohl wrote:
>> > > Hi Conrad,
>> > > thanks for reaching out to the community with your proposal. I looked
>> > > through FLIP-257 [1]. Your motivation sounds interesting. Can you
>> > > elaborate a bit more on the concrete use-cases you have in mind? How
>> > > do these match the user-cases of the two favored execution modes, i.e.
>> > > Flink's Session and Application mode?
>> > >
>> > > As mentioned in [2], Application Mode should be preferred for single
>> > > long-running jobs to isolate the resources of each of those jobs from
>> > > each other. In contrast, Session Mode is the natural choice when
>> > > running rather small/short-lived jobs (e.g. FlinkSQL queries) or when
>> > > deploying some kind of dev environment for testing out job
>> > > implementations. It feels like your use-case is somewhere in between a
>> > > bit? It would be interesting to get a better understanding of where
>> > > you're coming from. Maybe, you could provide some workload statistics?
>> > >
>> > > That considered, I guess it's a topic worth looking into. Here are a
>> > > few thoughts after looking into FLIP-257:
>> > > - As far as I can see, the BlobServer is used for sharing
>> > > configuration information (e.g. Classpath information) as part of the
>> > > ExecutionGraph instantiation [3]. The JobGraph is not persisted
>> > > through the BlobServer but rather stored in the JobGraphStore backed
>> > > by the HighAvailabilityServices implementation. The HA side is not
>> > > really covered in FLIP-257, yet.
>> > > - The approach of having the current Dispatcher living next to the new
>> > > JobMasterDispatcher (that encapsulates the logic around distributing
>> > > the workload onto multiple runners) leaves me with some doubt whether
>> > > there wouldn't be a better separation of concerns here. What about
>> > > leaving the Dispatcher as is but adding some abstraction between
>> > > JobManagerRunner/JobMaster and the Dispatcher that hides the logic
>> > > around whether these instances are "deployed" on the same machine or
>> > > somewhere else.
>> > > - About distributing JobManager workload: The JobManager already
>> > > utilizes leader election for faster recovery. Hence, one can set up
>> > > multiple JobManagers in idle mode which wait to gain leadership and
>> > > pick up the work (i.e. recovering the jobs) of the previously failed
>> > > JobManager leader. What about utilizing this setup: The currently
>> > > idling JobManagers could be utilized to take over some of the workload
>> > > from the leader. I haven't thought this through, yet. But I'm
>> > > wondering whether that would be a path we could go down. This would
>> > > enable us to still stick to the JobManager/TaskManager setup which
>> > > users are already used to rather than introducing another type of
>> > > cluster node.
>> > > - The JobManager initialization logic is kind of tricky to get your
>> > > head around. There is some overhead, I hope, we could clean up as part
>> > > of the efforts of removing the per-Job Mode from Flink [4]. It was
>> > > decided to deprecate per-Job Mode in Flink 1.15. But we have to stick
>> > > with it for some time (i.e. it's not going to be removed in 1.16)
>> > > since it's a quite basic feature users might rely on. This shouldn't
>> > > be a blocker. I just wanted to mention it to have it in the back of
>> > > our minds when looking into ways to come up with a solid proposal for
>> > > FLIP-257.
>> > > - My concern is that this FLIP might turn out to be larger than
>> > > expected and that it might be worth cutting it down into smaller
>> > > chunks with each being covered in a separate FLIP down the road if we
>> > > have some agreement and a clearer picture on how this should be
>> tackled.
>> > >
>> > > I'm gonna add Chesnay and David to this discussion.
>> > >
>> > > Best,
>> > > Matthias
>> > >
>> > >
>> > > [1]
>> > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
>> > > [2]
>> > >
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#deployment-modes
>> > > [3]
>> > >
>> https://github.com/apache/flink/blob/9ed70a1e8b5d59abdf9d7673bc5b44d421140ef0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java#L333
>> > > [4] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
>> > >
>> > >
>> > > On Tue, Aug 16, 2022 at 11:43 AM Zheng Yu Chen <ja...@gmail.com>
>> > > wrote:
>> > >
>> > >     Hi community ~
>> > >
>> > >     I think this title should be quite interesting. The idea is to
>> > >     reduce the
>> > >     workload of the JobManager and make the SessionCluster [2] more
>> > >     stable in
>> > >     the process of running jobs. I designed a plan for splitting the
>> > >     JobManager
>> > >     on FLIP-257 [1]:
>> > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
>> > >     <
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
>> >
>> > >
>> > >     This proposal proposes a splitting scheme for the current process
>> > >     and a new
>> > >     process implementation idea that is compatible with the original
>> > >     process
>> > >     model: splitting the internal JobMaster component of the
>> > >     JobManager, and
>> > >     controlling whether to enable this new process through a parameter
>> > >     In the
>> > >     split scheme, when the user configures, the JobMaster will make it
>> > >     run as
>> > >     an independent service, reducing the workload of the JobManager.
>> By
>> > >     implementing a new Dispatcher to communicate and interact with a
>> > >     single
>> > >     split JobMaster or multiple JobMasters, to achieve job management
>> > >
>> > >     The main features that it provides is:
>> > >
>> > >        - After the user submits the job, the JobMaster thread was
>> > >     split into
>> > >        other processes to run in the past. They no longer run in the
>> > >     JobManager,
>> > >        but in other processes.
>> > >        - Users can deploy multiple components mentioned above, which
>> run
>> > >        multiple JobMaster threads, thereby reducing the workload of
>> > >     the JobManager
>> > >
>> > >     Some of the challenging use cases that these features solve are:
>> > >
>> > >        - Compatible with the original job running mode (run JobMaster
>> > >     Thread on
>> > >        JobManager)
>> > >        - Implement a new Dispatcher that forwards client operations
>> > >     related to
>> > >        jobs
>> > >
>> > >
>> > >      I would love to hear and address your thoughts and feedback ,
>> and if
>> > >     possible drive a FLIP-257 !
>> > >
>> > >
>> > >     [1]
>> > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
>> > >     <
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
>> >
>> > >
>> > >     [2]
>> > >
>> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
>> > >
>> > >
>> > >     --
>> > >
>> > >     Have a nice day ~
>> > >
>> > >     ConradJam
>> > >
>>
>> --
>> Best
>>
>> ConradJam
>>
>

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Posted by Shammon FY <zj...@gmail.com>.
Thanks @XintongSong and sorry for replying late. Also thanks Zhengyu for
bringing this discussion up.

We use flink session cluster to run olap queries in ByteDance, and have
supported several users in production. The maximum single cluster has 2000
cores, and we focus on the job scheduling performance, stability and
serviceability of the olap cluster.

I think @chesnay raised an interesting question: using multiple clusters
VS. expand a single cluster. Flink Session Cluster is a stateless compute
cluster, and we can deploy multiple clusters to solve performance and
stability problems.

In fact, we did deploy multiple clusters due to the lack of flink job
scheduling performance a few months ago. We have even deployed 5 session
clusters to support tens of QPS needs for the single user. But after we
finished optimizing Flink job scheduling performance, we immediately merged
these clusters into one. Even in the future, we hope to be able to test the
large cluster of 50000 cores and unify some users with low isolation
requirements to the large cluster.

We really don't like to deploy multiple clusters for each user. Besides the
resource and use cost mentioned by @jam.gzczy, there are some other
problems:
1. We should add monitors and alarms for each cluster
2. When a user has problems, it will increase the cost of problem tracing
in multiple clusters
3. Increase Cluster upgrade costs due to the number of clusters
4. Multiple clusters will reduce the user's confidence in system
performance, and the system should support horizontal expansion, it's very
important

The advantages of using a single cluster instead of multiple clusters to
provide services are obvious. But there's a single point problem of
JobManager in the cluster. How to solve the single point problem? When we
started to use Flink to provide olap services, it was on our list.
JobManager single point has the following problems in flink olap:
1. Recovery time of the cluster. When JobManager fails, the cluster can't
provide services until the JobManager recovers. We have added a hook in the
JobManager to improve the recovery time from stand-by one in about 400ms,
but recovery time may take tens of seconds in some corner cases, it's
terrible.
2. Workload and stability. There may be fullgc in JobManager or it is
processing a large amount of results for some olap queries, other queries
will be affected, resulting in increased latency or decreased throughput.

Therefore, in terms of olap scenarios, we want to split the JobManager, and
it's an important feature. However, we believe that it should be a careful
decision. We need to divide the functions of JobManager carefully,
including REST, Dispatcher, ResourceManager and etc. roles. It involves job
submission, job management, web ui and other core functions.

Personally, I have reservations about the design in
https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split.
It doesn't solve the single point problem of cluster for olap scenario, and
even worse, it increases the network interaction of olap query and the
complexity of the JobManager module.

Thanks

Xintong Song <to...@gmail.com> 于2022年8月28日周日 14:48写道:

> Sorry for joining the discussion late. And thanks Zhengyu for bringing
> this discussion up.
>
> I think this is an interesting topic. I actually had something similar in
> mind for a long time. I haven't carried it out for the same concerns as
> others already mentioned, that the benefit for this effort may not be that
> significant compared to the complexity it introduces. However, I think
> Matthias has come up with an inspiring idea, to simply offload some of the
> JobMasters to standby JobManager processes, which potentially reduces the
> complexity significantly and sounds promising to me.
>
> # My understanding on the session mode
>
> In the early days, a standalone session cluster was probably the most
> straightforward way to deploy Flink on a bunch of machines / VMs. Nowadays,
> as K8s / Yarn deployments become more and more convenient and application
> mode becomes available for the standalone deployment, why (or whether) do
> users still need the session mode?
>
> From my experiences, I see users still want to use session mode in
> production mainly for 2 reasons:
> - To reduce the job bootstrap time. Leveraging an existing session cluster
> would save the time for requesting resources from K8s / Yarn and launching
> / initializing the processes, which in many cases is the major time
> consumption in launching a job. This is valued typically for interactive
> workloads.
> - To improve resource efficiency. Session cluster allows multiple jobs to
> share the same JobManager & TaskManager processes. This reduces the
> framework resource overhead for small workloads.
>
> # Concrete use cases
>
> I've seen 2 use cases which may benefit from this proposal.
>
> - AFAIK, ByteDance builds a large scale OLAP query service with Flink.
> They have a large Flink session cluster that runs >100 sql queries per
> seconds, where the JobManager process becomes the single-pointed
> performance bottleneck. I asked the same question as Chesnay did, that
> whether this can be fulfilled by having multiple session clusters plus a
> load-balancing service. Seems that's exactly the approach they were using,
> but with many limitations. Cc-ed @Shammon, would you like to provide more
> details?
> - One of our users at Alibaba has hundreds of Flink jobs for event
> processing. The workload of each job is typically low (tens of records per
> hour), meanwhile there's a high requirement on the timeliness (records upon
> received must be processed immediately). Consequently, they have a large
> session cluster that runs hundreds of long-running jobs. The reason they
> don't like multiple session clusters is that, as business develops, they
> may frequently create new jobs and retire old ones. It is inconvenient to
> maintain / migrate the long-running jobs across multiple clusters.
>
> # My opinion on this proposal
>
> I do see a value behind this proposal. Admittedly, I haven't seen any use
> cases that absolutely cannot be solved by having multiple session clusters.
> But it would definitely reduce the complexity on the user side if Flink can
> support larger scale session clusters. Additionally, there's a chance that
> spreading the JobMasters across multiple JobManager processes may also help
> with the high availability, that only part of the jobs are affected upon
> JobManager failures.
>
> On the other hand, I share the concern that the current design complicates
> the coordination & deployment components way more than the benefit it
> brings.
>
> I would suggest looking a bit more along the direction of leveraging
> standby JobManager processes as Matthias pointed out, see if the benefit is
> worth the price.
>
> Best,
>
> Xintong
>
>
>
> On Fri, Aug 26, 2022 at 5:55 PM Zheng Yu Chen <ja...@gmail.com> wrote:
>
>> Hi Chesnay ,
>> I have also considered the method you mentioned. If we deploy some
>> load balancing or intelligent scheduling in front of multiple
>> SessionClusters, this may cause the following problems
>> ● Insufficient resource utilization. When we distribute these
>> resources on each cluster, the job cannot make full use of the overall
>> TM resources. Some clusters may have very high workload and some are
>> idle, resulting in wasted resources.
>> ● The user's usage cost increases, and the user introduces additional
>> components to adapt to the SessionCluster. The problem is caused by
>> the overload of the JobManager. If there is a solution on the Flink
>> side, it will be better.
>> Maybe there is a better way to deal with it, I am sorting it out, and
>> I will reply with new ideas in the emails later.
>>
>> Chesnay Schepler <ch...@apache.org> 于2022年8月17日周三 22:31写道:
>> >
>> > To be honest I'm terrified at the idea of splitting the Dispatcher into
>> > several processes, even more so if this is supposed to be opt-in and
>> > specific to session mode.
>> > It would fragment the coordination layer even more than it already is,
>> > and make ops more complicated (yet another set of processes to monitor,
>> > configure etc.).
>> >
>> > I'm not convinced that this proposal really gets us a lot of benefits;
>> > and would rather propose that you split your single session cluster into
>> > multiple session clusters (with the scheduling component in front of it
>> > to distribute jobs) to even the load.
>> >
>> >  > The currently idling JobManagers could be utilized to take over some
>> > of the workload from the leader.
>> >
>> > This would also be the path I would go down if we'd try to tackle this.
>> >
>> > On 17/08/2022 16:22, Matthias Pohl wrote:
>> > > Hi Conrad,
>> > > thanks for reaching out to the community with your proposal. I looked
>> > > through FLIP-257 [1]. Your motivation sounds interesting. Can you
>> > > elaborate a bit more on the concrete use-cases you have in mind? How
>> > > do these match the user-cases of the two favored execution modes, i.e.
>> > > Flink's Session and Application mode?
>> > >
>> > > As mentioned in [2], Application Mode should be preferred for single
>> > > long-running jobs to isolate the resources of each of those jobs from
>> > > each other. In contrast, Session Mode is the natural choice when
>> > > running rather small/short-lived jobs (e.g. FlinkSQL queries) or when
>> > > deploying some kind of dev environment for testing out job
>> > > implementations. It feels like your use-case is somewhere in between a
>> > > bit? It would be interesting to get a better understanding of where
>> > > you're coming from. Maybe, you could provide some workload statistics?
>> > >
>> > > That considered, I guess it's a topic worth looking into. Here are a
>> > > few thoughts after looking into FLIP-257:
>> > > - As far as I can see, the BlobServer is used for sharing
>> > > configuration information (e.g. Classpath information) as part of the
>> > > ExecutionGraph instantiation [3]. The JobGraph is not persisted
>> > > through the BlobServer but rather stored in the JobGraphStore backed
>> > > by the HighAvailabilityServices implementation. The HA side is not
>> > > really covered in FLIP-257, yet.
>> > > - The approach of having the current Dispatcher living next to the new
>> > > JobMasterDispatcher (that encapsulates the logic around distributing
>> > > the workload onto multiple runners) leaves me with some doubt whether
>> > > there wouldn't be a better separation of concerns here. What about
>> > > leaving the Dispatcher as is but adding some abstraction between
>> > > JobManagerRunner/JobMaster and the Dispatcher that hides the logic
>> > > around whether these instances are "deployed" on the same machine or
>> > > somewhere else.
>> > > - About distributing JobManager workload: The JobManager already
>> > > utilizes leader election for faster recovery. Hence, one can set up
>> > > multiple JobManagers in idle mode which wait to gain leadership and
>> > > pick up the work (i.e. recovering the jobs) of the previously failed
>> > > JobManager leader. What about utilizing this setup: The currently
>> > > idling JobManagers could be utilized to take over some of the workload
>> > > from the leader. I haven't thought this through, yet. But I'm
>> > > wondering whether that would be a path we could go down. This would
>> > > enable us to still stick to the JobManager/TaskManager setup which
>> > > users are already used to rather than introducing another type of
>> > > cluster node.
>> > > - The JobManager initialization logic is kind of tricky to get your
>> > > head around. There is some overhead, I hope, we could clean up as part
>> > > of the efforts of removing the per-Job Mode from Flink [4]. It was
>> > > decided to deprecate per-Job Mode in Flink 1.15. But we have to stick
>> > > with it for some time (i.e. it's not going to be removed in 1.16)
>> > > since it's a quite basic feature users might rely on. This shouldn't
>> > > be a blocker. I just wanted to mention it to have it in the back of
>> > > our minds when looking into ways to come up with a solid proposal for
>> > > FLIP-257.
>> > > - My concern is that this FLIP might turn out to be larger than
>> > > expected and that it might be worth cutting it down into smaller
>> > > chunks with each being covered in a separate FLIP down the road if we
>> > > have some agreement and a clearer picture on how this should be
>> tackled.
>> > >
>> > > I'm gonna add Chesnay and David to this discussion.
>> > >
>> > > Best,
>> > > Matthias
>> > >
>> > >
>> > > [1]
>> > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
>> > > [2]
>> > >
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#deployment-modes
>> > > [3]
>> > >
>> https://github.com/apache/flink/blob/9ed70a1e8b5d59abdf9d7673bc5b44d421140ef0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java#L333
>> > > [4] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
>> > >
>> > >
>> > > On Tue, Aug 16, 2022 at 11:43 AM Zheng Yu Chen <ja...@gmail.com>
>> > > wrote:
>> > >
>> > >     Hi community ~
>> > >
>> > >     I think this title should be quite interesting. The idea is to
>> > >     reduce the
>> > >     workload of the JobManager and make the SessionCluster [2] more
>> > >     stable in
>> > >     the process of running jobs. I designed a plan for splitting the
>> > >     JobManager
>> > >     on FLIP-257 [1]:
>> > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
>> > >     <
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
>> >
>> > >
>> > >     This proposal proposes a splitting scheme for the current process
>> > >     and a new
>> > >     process implementation idea that is compatible with the original
>> > >     process
>> > >     model: splitting the internal JobMaster component of the
>> > >     JobManager, and
>> > >     controlling whether to enable this new process through a parameter
>> > >     In the
>> > >     split scheme, when the user configures, the JobMaster will make it
>> > >     run as
>> > >     an independent service, reducing the workload of the JobManager.
>> By
>> > >     implementing a new Dispatcher to communicate and interact with a
>> > >     single
>> > >     split JobMaster or multiple JobMasters, to achieve job management
>> > >
>> > >     The main features that it provides is:
>> > >
>> > >        - After the user submits the job, the JobMaster thread was
>> > >     split into
>> > >        other processes to run in the past. They no longer run in the
>> > >     JobManager,
>> > >        but in other processes.
>> > >        - Users can deploy multiple components mentioned above, which
>> run
>> > >        multiple JobMaster threads, thereby reducing the workload of
>> > >     the JobManager
>> > >
>> > >     Some of the challenging use cases that these features solve are:
>> > >
>> > >        - Compatible with the original job running mode (run JobMaster
>> > >     Thread on
>> > >        JobManager)
>> > >        - Implement a new Dispatcher that forwards client operations
>> > >     related to
>> > >        jobs
>> > >
>> > >
>> > >      I would love to hear and address your thoughts and feedback ,
>> and if
>> > >     possible drive a FLIP-257 !
>> > >
>> > >
>> > >     [1]
>> > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
>> > >     <
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
>> >
>> > >
>> > >     [2]
>> > >
>> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
>> > >
>> > >
>> > >     --
>> > >
>> > >     Have a nice day ~
>> > >
>> > >     ConradJam
>> > >
>>
>> --
>> Best
>>
>> ConradJam
>>
>

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Posted by Xintong Song <to...@gmail.com>.
Sorry for joining the discussion late. And thanks Zhengyu for bringing this
discussion up.

I think this is an interesting topic. I actually had something similar in
mind for a long time. I haven't carried it out for the same concerns as
others already mentioned, that the benefit for this effort may not be that
significant compared to the complexity it introduces. However, I think
Matthias has come up with an inspiring idea, to simply offload some of the
JobMasters to standby JobManager processes, which potentially reduces the
complexity significantly and sounds promising to me.

# My understanding on the session mode

In the early days, a standalone session cluster was probably the most
straightforward way to deploy Flink on a bunch of machines / VMs. Nowadays,
as K8s / Yarn deployments become more and more convenient and application
mode becomes available for the standalone deployment, why (or whether) do
users still need the session mode?

From my experiences, I see users still want to use session mode in
production mainly for 2 reasons:
- To reduce the job bootstrap time. Leveraging an existing session cluster
would save the time for requesting resources from K8s / Yarn and launching
/ initializing the processes, which in many cases is the major time
consumption in launching a job. This is valued typically for interactive
workloads.
- To improve resource efficiency. Session cluster allows multiple jobs to
share the same JobManager & TaskManager processes. This reduces the
framework resource overhead for small workloads.

# Concrete use cases

I've seen 2 use cases which may benefit from this proposal.

- AFAIK, ByteDance builds a large scale OLAP query service with Flink. They
have a large Flink session cluster that runs >100 sql queries per seconds,
where the JobManager process becomes the single-pointed performance
bottleneck. I asked the same question as Chesnay did, that whether this can
be fulfilled by having multiple session clusters plus a load-balancing
service. Seems that's exactly the approach they were using, but with many
limitations. Cc-ed @Shammon, would you like to provide more details?
- One of our users at Alibaba has hundreds of Flink jobs for event
processing. The workload of each job is typically low (tens of records per
hour), meanwhile there's a high requirement on the timeliness (records upon
received must be processed immediately). Consequently, they have a large
session cluster that runs hundreds of long-running jobs. The reason they
don't like multiple session clusters is that, as business develops, they
may frequently create new jobs and retire old ones. It is inconvenient to
maintain / migrate the long-running jobs across multiple clusters.

# My opinion on this proposal

I do see a value behind this proposal. Admittedly, I haven't seen any use
cases that absolutely cannot be solved by having multiple session clusters.
But it would definitely reduce the complexity on the user side if Flink can
support larger scale session clusters. Additionally, there's a chance that
spreading the JobMasters across multiple JobManager processes may also help
with the high availability, that only part of the jobs are affected upon
JobManager failures.

On the other hand, I share the concern that the current design complicates
the coordination & deployment components way more than the benefit it
brings.

I would suggest looking a bit more along the direction of leveraging
standby JobManager processes as Matthias pointed out, see if the benefit is
worth the price.

Best,

Xintong



On Fri, Aug 26, 2022 at 5:55 PM Zheng Yu Chen <ja...@gmail.com> wrote:

> Hi Chesnay ,
> I have also considered the method you mentioned. If we deploy some
> load balancing or intelligent scheduling in front of multiple
> SessionClusters, this may cause the following problems
> ● Insufficient resource utilization. When we distribute these
> resources on each cluster, the job cannot make full use of the overall
> TM resources. Some clusters may have very high workload and some are
> idle, resulting in wasted resources.
> ● The user's usage cost increases, and the user introduces additional
> components to adapt to the SessionCluster. The problem is caused by
> the overload of the JobManager. If there is a solution on the Flink
> side, it will be better.
> Maybe there is a better way to deal with it, I am sorting it out, and
> I will reply with new ideas in the emails later.
>
> Chesnay Schepler <ch...@apache.org> 于2022年8月17日周三 22:31写道:
> >
> > To be honest I'm terrified at the idea of splitting the Dispatcher into
> > several processes, even more so if this is supposed to be opt-in and
> > specific to session mode.
> > It would fragment the coordination layer even more than it already is,
> > and make ops more complicated (yet another set of processes to monitor,
> > configure etc.).
> >
> > I'm not convinced that this proposal really gets us a lot of benefits;
> > and would rather propose that you split your single session cluster into
> > multiple session clusters (with the scheduling component in front of it
> > to distribute jobs) to even the load.
> >
> >  > The currently idling JobManagers could be utilized to take over some
> > of the workload from the leader.
> >
> > This would also be the path I would go down if we'd try to tackle this.
> >
> > On 17/08/2022 16:22, Matthias Pohl wrote:
> > > Hi Conrad,
> > > thanks for reaching out to the community with your proposal. I looked
> > > through FLIP-257 [1]. Your motivation sounds interesting. Can you
> > > elaborate a bit more on the concrete use-cases you have in mind? How
> > > do these match the user-cases of the two favored execution modes, i.e.
> > > Flink's Session and Application mode?
> > >
> > > As mentioned in [2], Application Mode should be preferred for single
> > > long-running jobs to isolate the resources of each of those jobs from
> > > each other. In contrast, Session Mode is the natural choice when
> > > running rather small/short-lived jobs (e.g. FlinkSQL queries) or when
> > > deploying some kind of dev environment for testing out job
> > > implementations. It feels like your use-case is somewhere in between a
> > > bit? It would be interesting to get a better understanding of where
> > > you're coming from. Maybe, you could provide some workload statistics?
> > >
> > > That considered, I guess it's a topic worth looking into. Here are a
> > > few thoughts after looking into FLIP-257:
> > > - As far as I can see, the BlobServer is used for sharing
> > > configuration information (e.g. Classpath information) as part of the
> > > ExecutionGraph instantiation [3]. The JobGraph is not persisted
> > > through the BlobServer but rather stored in the JobGraphStore backed
> > > by the HighAvailabilityServices implementation. The HA side is not
> > > really covered in FLIP-257, yet.
> > > - The approach of having the current Dispatcher living next to the new
> > > JobMasterDispatcher (that encapsulates the logic around distributing
> > > the workload onto multiple runners) leaves me with some doubt whether
> > > there wouldn't be a better separation of concerns here. What about
> > > leaving the Dispatcher as is but adding some abstraction between
> > > JobManagerRunner/JobMaster and the Dispatcher that hides the logic
> > > around whether these instances are "deployed" on the same machine or
> > > somewhere else.
> > > - About distributing JobManager workload: The JobManager already
> > > utilizes leader election for faster recovery. Hence, one can set up
> > > multiple JobManagers in idle mode which wait to gain leadership and
> > > pick up the work (i.e. recovering the jobs) of the previously failed
> > > JobManager leader. What about utilizing this setup: The currently
> > > idling JobManagers could be utilized to take over some of the workload
> > > from the leader. I haven't thought this through, yet. But I'm
> > > wondering whether that would be a path we could go down. This would
> > > enable us to still stick to the JobManager/TaskManager setup which
> > > users are already used to rather than introducing another type of
> > > cluster node.
> > > - The JobManager initialization logic is kind of tricky to get your
> > > head around. There is some overhead, I hope, we could clean up as part
> > > of the efforts of removing the per-Job Mode from Flink [4]. It was
> > > decided to deprecate per-Job Mode in Flink 1.15. But we have to stick
> > > with it for some time (i.e. it's not going to be removed in 1.16)
> > > since it's a quite basic feature users might rely on. This shouldn't
> > > be a blocker. I just wanted to mention it to have it in the back of
> > > our minds when looking into ways to come up with a solid proposal for
> > > FLIP-257.
> > > - My concern is that this FLIP might turn out to be larger than
> > > expected and that it might be worth cutting it down into smaller
> > > chunks with each being covered in a separate FLIP down the road if we
> > > have some agreement and a clearer picture on how this should be
> tackled.
> > >
> > > I'm gonna add Chesnay and David to this discussion.
> > >
> > > Best,
> > > Matthias
> > >
> > >
> > > [1]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > > [2]
> > >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#deployment-modes
> > > [3]
> > >
> https://github.com/apache/flink/blob/9ed70a1e8b5d59abdf9d7673bc5b44d421140ef0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java#L333
> > > [4] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
> > >
> > >
> > > On Tue, Aug 16, 2022 at 11:43 AM Zheng Yu Chen <ja...@gmail.com>
> > > wrote:
> > >
> > >     Hi community ~
> > >
> > >     I think this title should be quite interesting. The idea is to
> > >     reduce the
> > >     workload of the JobManager and make the SessionCluster [2] more
> > >     stable in
> > >     the process of running jobs. I designed a plan for splitting the
> > >     JobManager
> > >     on FLIP-257 [1]:
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > >     <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> >
> > >
> > >     This proposal proposes a splitting scheme for the current process
> > >     and a new
> > >     process implementation idea that is compatible with the original
> > >     process
> > >     model: splitting the internal JobMaster component of the
> > >     JobManager, and
> > >     controlling whether to enable this new process through a parameter
> > >     In the
> > >     split scheme, when the user configures, the JobMaster will make it
> > >     run as
> > >     an independent service, reducing the workload of the JobManager. By
> > >     implementing a new Dispatcher to communicate and interact with a
> > >     single
> > >     split JobMaster or multiple JobMasters, to achieve job management
> > >
> > >     The main features that it provides is:
> > >
> > >        - After the user submits the job, the JobMaster thread was
> > >     split into
> > >        other processes to run in the past. They no longer run in the
> > >     JobManager,
> > >        but in other processes.
> > >        - Users can deploy multiple components mentioned above, which
> run
> > >        multiple JobMaster threads, thereby reducing the workload of
> > >     the JobManager
> > >
> > >     Some of the challenging use cases that these features solve are:
> > >
> > >        - Compatible with the original job running mode (run JobMaster
> > >     Thread on
> > >        JobManager)
> > >        - Implement a new Dispatcher that forwards client operations
> > >     related to
> > >        jobs
> > >
> > >
> > >      I would love to hear and address your thoughts and feedback , and
> if
> > >     possible drive a FLIP-257 !
> > >
> > >
> > >     [1]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > >     <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> >
> > >
> > >     [2]
> > >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
> > >
> > >
> > >     --
> > >
> > >     Have a nice day ~
> > >
> > >     ConradJam
> > >
>
> --
> Best
>
> ConradJam
>

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Posted by Zheng Yu Chen <ja...@gmail.com>.
Hi Chesnay ,
I have also considered the method you mentioned. If we deploy some
load balancing or intelligent scheduling in front of multiple
SessionClusters, this may cause the following problems
● Insufficient resource utilization. When we distribute these
resources on each cluster, the job cannot make full use of the overall
TM resources. Some clusters may have very high workload and some are
idle, resulting in wasted resources.
● The user's usage cost increases, and the user introduces additional
components to adapt to the SessionCluster. The problem is caused by
the overload of the JobManager. If there is a solution on the Flink
side, it will be better.
Maybe there is a better way to deal with it, I am sorting it out, and
I will reply with new ideas in the emails later.

Chesnay Schepler <ch...@apache.org> 于2022年8月17日周三 22:31写道:
>
> To be honest I'm terrified at the idea of splitting the Dispatcher into
> several processes, even more so if this is supposed to be opt-in and
> specific to session mode.
> It would fragment the coordination layer even more than it already is,
> and make ops more complicated (yet another set of processes to monitor,
> configure etc.).
>
> I'm not convinced that this proposal really gets us a lot of benefits;
> and would rather propose that you split your single session cluster into
> multiple session clusters (with the scheduling component in front of it
> to distribute jobs) to even the load.
>
>  > The currently idling JobManagers could be utilized to take over some
> of the workload from the leader.
>
> This would also be the path I would go down if we'd try to tackle this.
>
> On 17/08/2022 16:22, Matthias Pohl wrote:
> > Hi Conrad,
> > thanks for reaching out to the community with your proposal. I looked
> > through FLIP-257 [1]. Your motivation sounds interesting. Can you
> > elaborate a bit more on the concrete use-cases you have in mind? How
> > do these match the user-cases of the two favored execution modes, i.e.
> > Flink's Session and Application mode?
> >
> > As mentioned in [2], Application Mode should be preferred for single
> > long-running jobs to isolate the resources of each of those jobs from
> > each other. In contrast, Session Mode is the natural choice when
> > running rather small/short-lived jobs (e.g. FlinkSQL queries) or when
> > deploying some kind of dev environment for testing out job
> > implementations. It feels like your use-case is somewhere in between a
> > bit? It would be interesting to get a better understanding of where
> > you're coming from. Maybe, you could provide some workload statistics?
> >
> > That considered, I guess it's a topic worth looking into. Here are a
> > few thoughts after looking into FLIP-257:
> > - As far as I can see, the BlobServer is used for sharing
> > configuration information (e.g. Classpath information) as part of the
> > ExecutionGraph instantiation [3]. The JobGraph is not persisted
> > through the BlobServer but rather stored in the JobGraphStore backed
> > by the HighAvailabilityServices implementation. The HA side is not
> > really covered in FLIP-257, yet.
> > - The approach of having the current Dispatcher living next to the new
> > JobMasterDispatcher (that encapsulates the logic around distributing
> > the workload onto multiple runners) leaves me with some doubt whether
> > there wouldn't be a better separation of concerns here. What about
> > leaving the Dispatcher as is but adding some abstraction between
> > JobManagerRunner/JobMaster and the Dispatcher that hides the logic
> > around whether these instances are "deployed" on the same machine or
> > somewhere else.
> > - About distributing JobManager workload: The JobManager already
> > utilizes leader election for faster recovery. Hence, one can set up
> > multiple JobManagers in idle mode which wait to gain leadership and
> > pick up the work (i.e. recovering the jobs) of the previously failed
> > JobManager leader. What about utilizing this setup: The currently
> > idling JobManagers could be utilized to take over some of the workload
> > from the leader. I haven't thought this through, yet. But I'm
> > wondering whether that would be a path we could go down. This would
> > enable us to still stick to the JobManager/TaskManager setup which
> > users are already used to rather than introducing another type of
> > cluster node.
> > - The JobManager initialization logic is kind of tricky to get your
> > head around. There is some overhead, I hope, we could clean up as part
> > of the efforts of removing the per-Job Mode from Flink [4]. It was
> > decided to deprecate per-Job Mode in Flink 1.15. But we have to stick
> > with it for some time (i.e. it's not going to be removed in 1.16)
> > since it's a quite basic feature users might rely on. This shouldn't
> > be a blocker. I just wanted to mention it to have it in the back of
> > our minds when looking into ways to come up with a solid proposal for
> > FLIP-257.
> > - My concern is that this FLIP might turn out to be larger than
> > expected and that it might be worth cutting it down into smaller
> > chunks with each being covered in a separate FLIP down the road if we
> > have some agreement and a clearer picture on how this should be tackled.
> >
> > I'm gonna add Chesnay and David to this discussion.
> >
> > Best,
> > Matthias
> >
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > [2]
> > https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#deployment-modes
> > [3]
> > https://github.com/apache/flink/blob/9ed70a1e8b5d59abdf9d7673bc5b44d421140ef0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java#L333
> > [4] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
> >
> >
> > On Tue, Aug 16, 2022 at 11:43 AM Zheng Yu Chen <ja...@gmail.com>
> > wrote:
> >
> >     Hi community ~
> >
> >     I think this title should be quite interesting. The idea is to
> >     reduce the
> >     workload of the JobManager and make the SessionCluster [2] more
> >     stable in
> >     the process of running jobs. I designed a plan for splitting the
> >     JobManager
> >     on FLIP-257 [1]:
> >     https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> >     <https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process>
> >
> >     This proposal proposes a splitting scheme for the current process
> >     and a new
> >     process implementation idea that is compatible with the original
> >     process
> >     model: splitting the internal JobMaster component of the
> >     JobManager, and
> >     controlling whether to enable this new process through a parameter
> >     In the
> >     split scheme, when the user configures, the JobMaster will make it
> >     run as
> >     an independent service, reducing the workload of the JobManager. By
> >     implementing a new Dispatcher to communicate and interact with a
> >     single
> >     split JobMaster or multiple JobMasters, to achieve job management
> >
> >     The main features that it provides is:
> >
> >        - After the user submits the job, the JobMaster thread was
> >     split into
> >        other processes to run in the past. They no longer run in the
> >     JobManager,
> >        but in other processes.
> >        - Users can deploy multiple components mentioned above, which run
> >        multiple JobMaster threads, thereby reducing the workload of
> >     the JobManager
> >
> >     Some of the challenging use cases that these features solve are:
> >
> >        - Compatible with the original job running mode (run JobMaster
> >     Thread on
> >        JobManager)
> >        - Implement a new Dispatcher that forwards client operations
> >     related to
> >        jobs
> >
> >
> >      I would love to hear and address your thoughts and feedback , and if
> >     possible drive a FLIP-257 !
> >
> >
> >     [1]
> >     https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> >     <https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process>
> >
> >     [2]
> >     https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
> >
> >
> >     --
> >
> >     Have a nice day ~
> >
> >     ConradJam
> >

-- 
Best

ConradJam

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Posted by Chesnay Schepler <ch...@apache.org>.
To be honest I'm terrified at the idea of splitting the Dispatcher into 
several processes, even more so if this is supposed to be opt-in and 
specific to session mode.
It would fragment the coordination layer even more than it already is, 
and make ops more complicated (yet another set of processes to monitor, 
configure etc.).

I'm not convinced that this proposal really gets us a lot of benefits; 
and would rather propose that you split your single session cluster into 
multiple session clusters (with the scheduling component in front of it 
to distribute jobs) to even the load.

 > The currently idling JobManagers could be utilized to take over some 
of the workload from the leader.

This would also be the path I would go down if we'd try to tackle this.

On 17/08/2022 16:22, Matthias Pohl wrote:
> Hi Conrad,
> thanks for reaching out to the community with your proposal. I looked 
> through FLIP-257 [1]. Your motivation sounds interesting. Can you 
> elaborate a bit more on the concrete use-cases you have in mind? How 
> do these match the user-cases of the two favored execution modes, i.e. 
> Flink's Session and Application mode?
>
> As mentioned in [2], Application Mode should be preferred for single 
> long-running jobs to isolate the resources of each of those jobs from 
> each other. In contrast, Session Mode is the natural choice when 
> running rather small/short-lived jobs (e.g. FlinkSQL queries) or when 
> deploying some kind of dev environment for testing out job 
> implementations. It feels like your use-case is somewhere in between a 
> bit? It would be interesting to get a better understanding of where 
> you're coming from. Maybe, you could provide some workload statistics?
>
> That considered, I guess it's a topic worth looking into. Here are a 
> few thoughts after looking into FLIP-257:
> - As far as I can see, the BlobServer is used for sharing 
> configuration information (e.g. Classpath information) as part of the 
> ExecutionGraph instantiation [3]. The JobGraph is not persisted 
> through the BlobServer but rather stored in the JobGraphStore backed 
> by the HighAvailabilityServices implementation. The HA side is not 
> really covered in FLIP-257, yet.
> - The approach of having the current Dispatcher living next to the new 
> JobMasterDispatcher (that encapsulates the logic around distributing 
> the workload onto multiple runners) leaves me with some doubt whether 
> there wouldn't be a better separation of concerns here. What about 
> leaving the Dispatcher as is but adding some abstraction between 
> JobManagerRunner/JobMaster and the Dispatcher that hides the logic 
> around whether these instances are "deployed" on the same machine or 
> somewhere else.
> - About distributing JobManager workload: The JobManager already 
> utilizes leader election for faster recovery. Hence, one can set up 
> multiple JobManagers in idle mode which wait to gain leadership and 
> pick up the work (i.e. recovering the jobs) of the previously failed 
> JobManager leader. What about utilizing this setup: The currently 
> idling JobManagers could be utilized to take over some of the workload 
> from the leader. I haven't thought this through, yet. But I'm 
> wondering whether that would be a path we could go down. This would 
> enable us to still stick to the JobManager/TaskManager setup which 
> users are already used to rather than introducing another type of 
> cluster node.
> - The JobManager initialization logic is kind of tricky to get your 
> head around. There is some overhead, I hope, we could clean up as part 
> of the efforts of removing the per-Job Mode from Flink [4]. It was 
> decided to deprecate per-Job Mode in Flink 1.15. But we have to stick 
> with it for some time (i.e. it's not going to be removed in 1.16) 
> since it's a quite basic feature users might rely on. This shouldn't 
> be a blocker. I just wanted to mention it to have it in the back of 
> our minds when looking into ways to come up with a solid proposal for 
> FLIP-257.
> - My concern is that this FLIP might turn out to be larger than 
> expected and that it might be worth cutting it down into smaller 
> chunks with each being covered in a separate FLIP down the road if we 
> have some agreement and a clearer picture on how this should be tackled.
>
> I'm gonna add Chesnay and David to this discussion.
>
> Best,
> Matthias
>
>
> [1] 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> [2] 
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#deployment-modes
> [3] 
> https://github.com/apache/flink/blob/9ed70a1e8b5d59abdf9d7673bc5b44d421140ef0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java#L333
> [4] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
>
>
> On Tue, Aug 16, 2022 at 11:43 AM Zheng Yu Chen <ja...@gmail.com> 
> wrote:
>
>     Hi community ~
>
>     I think this title should be quite interesting. The idea is to
>     reduce the
>     workload of the JobManager and make the SessionCluster [2] more
>     stable in
>     the process of running jobs. I designed a plan for splitting the
>     JobManager
>     on FLIP-257 [1]:
>     https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
>     <https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process>
>
>     This proposal proposes a splitting scheme for the current process
>     and a new
>     process implementation idea that is compatible with the original
>     process
>     model: splitting the internal JobMaster component of the
>     JobManager, and
>     controlling whether to enable this new process through a parameter
>     In the
>     split scheme, when the user configures, the JobMaster will make it
>     run as
>     an independent service, reducing the workload of the JobManager. By
>     implementing a new Dispatcher to communicate and interact with a
>     single
>     split JobMaster or multiple JobMasters, to achieve job management
>
>     The main features that it provides is:
>
>        - After the user submits the job, the JobMaster thread was
>     split into
>        other processes to run in the past. They no longer run in the
>     JobManager,
>        but in other processes.
>        - Users can deploy multiple components mentioned above, which run
>        multiple JobMaster threads, thereby reducing the workload of
>     the JobManager
>
>     Some of the challenging use cases that these features solve are:
>
>        - Compatible with the original job running mode (run JobMaster
>     Thread on
>        JobManager)
>        - Implement a new Dispatcher that forwards client operations
>     related to
>        jobs
>
>
>      I would love to hear and address your thoughts and feedback , and if
>     possible drive a FLIP-257 !
>
>
>     [1]
>     https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
>     <https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process>
>
>     [2]
>     https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
>
>
>     -- 
>
>     Have a nice day ~
>
>     ConradJam
>

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Posted by Zheng Yu Chen <ja...@gmail.com>.
Zheng Yu Chen
18:00 (2分钟前)
发送至 dev
Thank you for your interest in this program. This proposal only takes
effect for Session Cluster Mode, and will not take effect for Application
Mode. I would like to answer some of your previous questions about patterns
and practical cases first, and then discuss some implementation details
later.
On a SessionCluster cluster in production, I run dozens or hundreds of
jobs, which are not short-running jobs, but long-running jobs. Why they
don't choose Applicantion Mode
   ○ This can save some JVM resources and reduce server costs
   ○ More adequate resource utilization
   ○ Starting Application Mode has a long resource application and waiting
(because SessionCluster has already applied for fixed TM and JM resources
at startup)
But we will also find the following potential problems
   ○ Poor isolation between JobMaster threads in JobManager: When there are
too many jobs, the JobManager is under great pressure. For example, when
running more than 20 jobs, because the JobMaster thread is inside the
JobManager process, it is equivalent to having more than 20 threads in one
maintenance The thread pools compete with each other for resources to
restore or compile JobGraph, etc.
   ○ JobManager's functional responsibilities are too large: With the
development of Flink, there will inevitably be more rich functions running
on JobMaster. As the title says, when in SessionCluster mode, should we
consider JobManager as a routing-like forwarding function , instead of a
component that integrates huge functions, proper splitting reduces the
workload of the JobManager
Based on the above situation, so there is the idea of the title. Can we
introduce the JobMasterContainer component to reduce the impact between
jobs, so that jobs can run more gracefully and for a long time on the
SessionCluster instead of just running rather small/short-lived jobs (e.g.
FlinkSQL queries) or when deploying some kind
of dev environment for testing out job implementations

What do you think ~

Matthias Pohl <ma...@aiven.io.invalid> 于2022年8月17日周三 22:22写道:

> Hi Conrad,
> thanks for reaching out to the community with your proposal. I looked
> through FLIP-257 [1]. Your motivation sounds interesting. Can you elaborate
> a bit more on the concrete use-cases you have in mind? How do these match
> the user-cases of the two favored execution modes, i.e. Flink's Session and
> Application mode?
>
> As mentioned in [2], Application Mode should be preferred for single
> long-running jobs to isolate the resources of each of those jobs from each
> other. In contrast, Session Mode is the natural choice when running rather
> small/short-lived jobs (e.g. FlinkSQL queries) or when deploying some kind
> of dev environment for testing out job implementations. It feels like your
> use-case is somewhere in between a bit? It would be interesting to get a
> better understanding of where you're coming from. Maybe, you could provide
> some workload statistics?
>
> That considered, I guess it's a topic worth looking into. Here are a few
> thoughts after looking into FLIP-257:
> - As far as I can see, the BlobServer is used for sharing configuration
> information (e.g. Classpath information) as part of the ExecutionGraph
> instantiation [3]. The JobGraph is not persisted through the BlobServer but
> rather stored in the JobGraphStore backed by the HighAvailabilityServices
> implementation. The HA side is not really covered in FLIP-257, yet.
> - The approach of having the current Dispatcher living next to the new
> JobMasterDispatcher (that encapsulates the logic around distributing the
> workload onto multiple runners) leaves me with some doubt whether there
> wouldn't be a better separation of concerns here. What about leaving the
> Dispatcher as is but adding some abstraction between
> JobManagerRunner/JobMaster and the Dispatcher that hides the logic around
> whether these instances are "deployed" on the same machine or somewhere
> else.
> - About distributing JobManager workload: The JobManager already utilizes
> leader election for faster recovery. Hence, one can set up multiple
> JobManagers in idle mode which wait to gain leadership and pick up the work
> (i.e. recovering the jobs) of the previously failed JobManager leader. What
> about utilizing this setup: The currently idling JobManagers could be
> utilized to take over some of the workload from the leader. I haven't
> thought this through, yet. But I'm wondering whether that would be a path
> we could go down. This would enable us to still stick to the
> JobManager/TaskManager setup which users are already used to rather than
> introducing another type of cluster node.
> - The JobManager initialization logic is kind of tricky to get your head
> around. There is some overhead, I hope, we could clean up as part of the
> efforts of removing the per-Job Mode from Flink [4]. It was decided to
> deprecate per-Job Mode in Flink 1.15. But we have to stick with it for some
> time (i.e. it's not going to be removed in 1.16) since it's a quite basic
> feature users might rely on. This shouldn't be a blocker. I just wanted to
> mention it to have it in the back of our minds when looking into ways to
> come up with a solid proposal for FLIP-257.
> - My concern is that this FLIP might turn out to be larger than expected
> and that it might be worth cutting it down into smaller chunks with each
> being covered in a separate FLIP down the road if we have some agreement
> and a clearer picture on how this should be tackled.
>
> I'm gonna add Chesnay and David to this discussion.
>
> Best,
> Matthias
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> [2]
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#deployment-modes
> [3]
>
> https://github.com/apache/flink/blob/9ed70a1e8b5d59abdf9d7673bc5b44d421140ef0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java#L333
> [4] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n
>
>
> On Tue, Aug 16, 2022 at 11:43 AM Zheng Yu Chen <ja...@gmail.com>
> wrote:
>
> > Hi community ~
> >
> > I think this title should be quite interesting. The idea is to reduce the
> > workload of the JobManager and make the SessionCluster [2] more stable in
> > the process of running jobs. I designed a plan for splitting the
> JobManager
> > on FLIP-257 [1]:
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> > >
> >
> > This proposal proposes a splitting scheme for the current process and a
> new
> > process implementation idea that is compatible with the original process
> > model: splitting the internal JobMaster component of the JobManager, and
> > controlling whether to enable this new process through a parameter In the
> > split scheme, when the user configures, the JobMaster will make it run as
> > an independent service, reducing the workload of the JobManager. By
> > implementing a new Dispatcher to communicate and interact with a single
> > split JobMaster or multiple JobMasters, to achieve job management
> >
> > The main features that it provides is:
> >
> >    - After the user submits the job, the JobMaster thread was split into
> >    other processes to run in the past. They no longer run in the
> > JobManager,
> >    but in other processes.
> >    - Users can deploy multiple components mentioned above, which run
> >    multiple JobMaster threads, thereby reducing the workload of the
> > JobManager
> >
> > Some of the challenging use cases that these features solve are:
> >
> >    - Compatible with the original job running mode (run JobMaster Thread
> on
> >    JobManager)
> >    - Implement a new Dispatcher that forwards client operations related
> to
> >    jobs
> >
> >
> >  I would love to hear and address your thoughts and feedback , and if
> > possible drive a FLIP-257 !
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> > <
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> > >
> >
> > [2]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
> >
> >
> > --
> >
> > Have a nice day ~
> >
> > ConradJam
> >
>

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Posted by Matthias Pohl <ma...@aiven.io.INVALID>.
Hi Conrad,
thanks for reaching out to the community with your proposal. I looked
through FLIP-257 [1]. Your motivation sounds interesting. Can you elaborate
a bit more on the concrete use-cases you have in mind? How do these match
the user-cases of the two favored execution modes, i.e. Flink's Session and
Application mode?

As mentioned in [2], Application Mode should be preferred for single
long-running jobs to isolate the resources of each of those jobs from each
other. In contrast, Session Mode is the natural choice when running rather
small/short-lived jobs (e.g. FlinkSQL queries) or when deploying some kind
of dev environment for testing out job implementations. It feels like your
use-case is somewhere in between a bit? It would be interesting to get a
better understanding of where you're coming from. Maybe, you could provide
some workload statistics?

That considered, I guess it's a topic worth looking into. Here are a few
thoughts after looking into FLIP-257:
- As far as I can see, the BlobServer is used for sharing configuration
information (e.g. Classpath information) as part of the ExecutionGraph
instantiation [3]. The JobGraph is not persisted through the BlobServer but
rather stored in the JobGraphStore backed by the HighAvailabilityServices
implementation. The HA side is not really covered in FLIP-257, yet.
- The approach of having the current Dispatcher living next to the new
JobMasterDispatcher (that encapsulates the logic around distributing the
workload onto multiple runners) leaves me with some doubt whether there
wouldn't be a better separation of concerns here. What about leaving the
Dispatcher as is but adding some abstraction between
JobManagerRunner/JobMaster and the Dispatcher that hides the logic around
whether these instances are "deployed" on the same machine or somewhere
else.
- About distributing JobManager workload: The JobManager already utilizes
leader election for faster recovery. Hence, one can set up multiple
JobManagers in idle mode which wait to gain leadership and pick up the work
(i.e. recovering the jobs) of the previously failed JobManager leader. What
about utilizing this setup: The currently idling JobManagers could be
utilized to take over some of the workload from the leader. I haven't
thought this through, yet. But I'm wondering whether that would be a path
we could go down. This would enable us to still stick to the
JobManager/TaskManager setup which users are already used to rather than
introducing another type of cluster node.
- The JobManager initialization logic is kind of tricky to get your head
around. There is some overhead, I hope, we could clean up as part of the
efforts of removing the per-Job Mode from Flink [4]. It was decided to
deprecate per-Job Mode in Flink 1.15. But we have to stick with it for some
time (i.e. it's not going to be removed in 1.16) since it's a quite basic
feature users might rely on. This shouldn't be a blocker. I just wanted to
mention it to have it in the back of our minds when looking into ways to
come up with a solid proposal for FLIP-257.
- My concern is that this FLIP might turn out to be larger than expected
and that it might be worth cutting it down into smaller chunks with each
being covered in a separate FLIP down the road if we have some agreement
and a clearer picture on how this should be tackled.

I'm gonna add Chesnay and David to this discussion.

Best,
Matthias


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/overview/#deployment-modes
[3]
https://github.com/apache/flink/blob/9ed70a1e8b5d59abdf9d7673bc5b44d421140ef0/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/DefaultExecutionGraph.java#L333
[4] https://lists.apache.org/thread/b8g76cqgtr2c515rd1bs41vy285f317n


On Tue, Aug 16, 2022 at 11:43 AM Zheng Yu Chen <ja...@gmail.com> wrote:

> Hi community ~
>
> I think this title should be quite interesting. The idea is to reduce the
> workload of the JobManager and make the SessionCluster [2] more stable in
> the process of running jobs. I designed a plan for splitting the JobManager
> on FLIP-257 [1]:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> >
>
> This proposal proposes a splitting scheme for the current process and a new
> process implementation idea that is compatible with the original process
> model: splitting the internal JobMaster component of the JobManager, and
> controlling whether to enable this new process through a parameter In the
> split scheme, when the user configures, the JobMaster will make it run as
> an independent service, reducing the workload of the JobManager. By
> implementing a new Dispatcher to communicate and interact with a single
> split JobMaster or multiple JobMasters, to achieve job management
>
> The main features that it provides is:
>
>    - After the user submits the job, the JobMaster thread was split into
>    other processes to run in the past. They no longer run in the
> JobManager,
>    but in other processes.
>    - Users can deploy multiple components mentioned above, which run
>    multiple JobMaster threads, thereby reducing the workload of the
> JobManager
>
> Some of the challenging use cases that these features solve are:
>
>    - Compatible with the original job running mode (run JobMaster Thread on
>    JobManager)
>    - Implement a new Dispatcher that forwards client operations related to
>    jobs
>
>
>  I would love to hear and address your thoughts and feedback , and if
> possible drive a FLIP-257 !
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process
> >
>
> [2]
>
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
>
>
> --
>
> Have a nice day ~
>
> ConradJam
>

Re: [Discuss] Let's Session Cluster JobManager take a breather (FLIP-257: Flink JobManager Process Split)

Posted by Zheng Yu Chen <ja...@gmail.com>.
Thanks, for the community fallback suggestions. In fact, the problem I want
to solve is to reduce the current workload of the JobManager (as the title
says, more focus is on how to reduce the workload of the JobManager). First
my idea, I thought of reducing the resource overhead of the JobManager in
FLIP. The largest JobMaster migrates to a new component and hopes to share
this part with other work components to reduce the resource occupancy rate
of the JobManager. But later I thought about it and found that it can be
achieved by horizontally expanding JobManager, rather than adding a new
component to increase the overall coordination layer complexity. Maybe this
idea has a simpler implementation (mentioned later)

Here I would like to share the use of Session Cluster and whats problem in
prod env :

    * JobManager OOM: sometimes the operation faces sudden traffic peaks,
JM cannot perform some temporary horizontal expansion, resulting in
excessive pressure and OOM
    * job recovery time is too long: JobManager restart time is too long
for job redeploy after oom

Although the community advised deployed jobs with Application Mode, some
small jobs (such as short-lived batch jobs, simple stream jobs with only
1-3 slots used for a long time, FlinkSQL debugging jobs, etc.) I still
prefer to use Session Cluster because of this As *@David **Morávek*
said, our resources do not need to be initialized

After a few days of thinking, I think a reply to the question mentioned
earlier by *@David Morávek @Matthias Pohl @Chesnay Schepler @Xintong Song*

* The whole coordination layer brings a certain complexity
* Can solve the problem of high JobManager load, but not the best solution

Based on the above situation: using the current FLIP solution may not be
the optimal solution

After reading your comments carefully, I now have a new idea to share it💡

As *@Matthias Poh*l *@Xintong Song* said, we can consider transferring part
of the JobMaster workload to other Standby JobManagers.The benefits of
doing this are as follows:

* Similar to TaskManager for horizontal expansion, when we have a large
cluster of jobs, it can effectively slow down JVM FGC
* job recovery faster now: we move some JobMaster jobs to another candidate
JobManagers. When recovering jobs, we no longer need one JobManager to
recover all jobs
* Compared with the previous scheme, the complexity is reduced, and most of
the current code can be reused instead of breaking or adding a new
coordination layer

For user's use, they only need to configure the switch of this feature and
the number of JobManagers to enjoy the horizontal expansion of JobManager

If the community thinks this solution is feasible, I will rewrite my FLIP
and organize some of my specific ideas

Looking forward to your suggestions~

Zheng Yu Chen <ja...@gmail.com> 于2022年8月16日周二 17:40写道:

> Hi community ~
>
> I think this title should be quite interesting. The idea is to reduce the
> workload of the JobManager and make the SessionCluster [2] more stable in
> the process of running jobs. I designed a plan for splitting the JobManager
> on FLIP-257 [1]:
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process>
>
> This proposal proposes a splitting scheme for the current process and a
> new process implementation idea that is compatible with the original
> process model: splitting the internal JobMaster component of the
> JobManager, and controlling whether to enable this new process through a
> parameter In the split scheme, when the user configures, the JobMaster will
> make it run as an independent service, reducing the workload of the
> JobManager. By implementing a new Dispatcher to communicate and interact
> with a single split JobMaster or multiple JobMasters, to achieve job
> management
>
> The main features that it provides is:
>
>    - After the user submits the job, the JobMaster thread was split into
>    other processes to run in the past. They no longer run in the JobManager,
>    but in other processes.
>    - Users can deploy multiple components mentioned above, which run
>    multiple JobMaster threads, thereby reducing the workload of the JobManager
>
> Some of the challenging use cases that these features solve are:
>
>    - Compatible with the original job running mode (run JobMaster Thread
>    on JobManager)
>    - Implement a new Dispatcher that forwards client operations related
>    to jobs
>
>
>  I would love to hear and address your thoughts and feedback , and if
> possible drive a FLIP-257 !
>
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+Process+Split
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-257+Flink+JobManager+JobMaster+Thread+Split+to+Process>
>
> [2]
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#session-mode
>
>
> --
>
> Have a nice day ~
>
> ConradJam
>


-- 
Best

ConradJam