You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Junfan Zhang (Jira)" <ji...@apache.org> on 2021/03/13 12:38:00 UTC

[jira] [Created] (FLINK-21768) Optimize system.exit() logic of CliFrontend

Junfan Zhang created FLINK-21768:
------------------------------------

             Summary: Optimize system.exit() logic of CliFrontend
                 Key: FLINK-21768
                 URL: https://issues.apache.org/jira/browse/FLINK-21768
             Project: Flink
          Issue Type: Improvement
          Components: Command Line Client
            Reporter: Junfan Zhang


h2. Why 
We encounter a problem when Oozie integerated with Flink Batch Action. 
Oozie will use a launcher job to start Flink client used to submit Flink job to Hadoop Yarn. 
And when Flink client finished , Oozie will get its exitCode to determine job submission status and then do some extra things.

So how Oozie catch {{System.exit()}}? It will implement JDK SecurityManager. ([Oozie related code link|https://github.com/apache/oozie/blob/f1e01a9e155692aa5632f4573ab1b3ebeab7ef45/sharelib/oozie/src/main/java/org/apache/oozie/action/hadoop/security/LauncherSecurityManager.java#L24]). 

Now when Flink Client finished successfully, it will call {{System.exit(0)}}([Flink related code link|https://github.com/apache/flink/blob/195298aea327b3f98d9852121f0f146368696300/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L1133]) method. 
And then JVM will use LauncherSecurityManager(Oozie implemented) to handle {{System.exit(0)}} method and trigger {{LauncherSecurityManager.checkExit()}} method, and then will throw exception. 
Finally Flink Client will catch its {{throwable}} and call {{System.exit(31)}}([related code link|https://github.com/apache/flink/blob/195298aea327b3f98d9852121f0f146368696300/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L1139]) method again. It will cause Oozie to misjudge the status of the Fllink job.

Actually it's a corner case. In most scenes, the situation I mentioned will not happen. But it's still necessary for us to optimize client exit logic. 

Besides, i think the problem above may also exist in some other frameworks such as linkedin/azakaban and apache/airflow, which are using Flink client to submit batch job.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)