You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "huntercc (Jira)" <ji...@apache.org> on 2021/08/21 01:30:00 UTC

[jira] [Created] (FLINK-23905) Reduce the load on JobManager when submitting large-scale job with a big user jar

huntercc created FLINK-23905:
--------------------------------

             Summary: Reduce the load on JobManager when submitting large-scale job with a big user jar
                 Key: FLINK-23905
                 URL: https://issues.apache.org/jira/browse/FLINK-23905
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Task
            Reporter: huntercc


As described in FLINK-20612 and FLINK-21731, there are some time-consuming steps in the job startup phase. Recently, we found that when submitting a large-scale job with a large user jar, the time spent on changing the status of a task from deploying to running accounts for a high proportion of the total time-consuming.

In the task initialization stage, the user jar needs to be pulled from the JobManager through BlobService. JobManager has to allocate a lot of computing power to distribute the files, which leads to a heavy load in the start-up stage. More generally, JobManager fails to respond to the RPC request sent by the TaskManager side in time due to high load, causing some timeout exceptions, such as akka timeout exception, which leads to job restart and further prolongs the start-up time of the job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)