You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Weihua Hu (Jira)" <ji...@apache.org> on 2023/09/18 11:28:00 UTC

[jira] [Commented] (FLINK-33096) Flink on k8s，if one taskmanager pod was crashed，the whole flink job will be failed

    [ https://issues.apache.org/jira/browse/FLINK-33096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766335#comment-17766335 ] 

Weihua Hu commented on FLINK-33096:
-----------------------------------

[~wawa] I think it's because the job is failed after some TM pod crashed. You need pick the restart strategy for your job. Please refer to [https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/state/task_failure_recovery/#restart-strategies]

> Flink on k8s，if one taskmanager pod was crashed，the whole flink job will be failed
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-33096
>                 URL: https://issues.apache.org/jira/browse/FLINK-33096
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.14.3
>            Reporter: wawa
>            Priority: Major
>
> The Flink version is 1.14.3, and the job is submitted to Kubernetes using the Native Kubernetes application mode. During the scheduling process, when a TaskManager pod crashes due to an exception, Kubernetes will attempt to start a new TaskManager pod. However, the scheduling process is halted immediately, resulting in the entire Flink job being terminated. On the other hand, if the JobManager pod crashes, Kubernetes is able to successfully schedule a new JobManager pod. This observation was made during application usage. Can you please help analyze the underlying issue?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)