You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Ufuk Celebi (Jira)" <ji...@apache.org> on 2020/08/28 07:23:00 UTC

[jira] [Commented] (FLINK-18828) Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s

    [ https://issues.apache.org/jira/browse/FLINK-18828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186327#comment-17186327 ] 

Ufuk Celebi commented on FLINK-18828:
-------------------------------------

[~fly_in_gis] I think it makes sense to keep a non-zero exit code for failed jobs. How would users figure out whether the job has succeeded or not if we change the exit code?

Regarding the unexpected restarts: What about updating the {{restartPolicy}} to {{Never}} in the spec of the Kubernetes Job (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy)? That way, we would still have the information from the exit code and we wouldn't see any restarts by default. Users would also have the flexibility to change the behaviour depending on their use case by setting {{restartPolicy: OnFailure}} again.

 

> Terminate jobmanager process with zero exit code to avoid unexpected restarting by K8s
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-18828
>                 URL: https://issues.apache.org/jira/browse/FLINK-18828
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.1, 1.12.0, 1.11.1
>            Reporter: Yang Wang
>            Priority: Major
>             Fix For: 1.12.0, 1.11.2, 1.10.3
>
>
> Currently, Flink jobmanager process terminates with a non-zero exit code if the job reaches the {{ApplicationStatus.FAILED}}. It is not ideal in K8s deployment, since non-zero exit code will cause unexpected restarting. Also from a framework's perspective, a FAILED job does not mean that Flink has failed and, hence, the return code could still be 0.
> > Note:
> This is a special case for standalone K8s deployment. For standalone/Yarn/Mesos/native K8s, terminating with non-zero exit code is harmless. And a non-zero exit code could help to check the job result quickly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)