You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Neelishaa Srivastava (Jira)" <ji...@apache.org> on 2022/07/04 09:31:00 UTC

[jira] [Comment Edited] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

    [ https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17562099#comment-17562099 ] 

Neelishaa Srivastava edited comment on FLINK-25098 at 7/4/22 9:30 AM:
----------------------------------------------------------------------

Hi Till Rohrmann ,

can you please guide us to the similar issue jobmanager pod is stuck in CrashLoopBackOff state seen .The logs are already mentioned by MAU CHEE YEN in the above comment .


was (Author: JIRAUSER292214):
Hi Till Rohrmann ,

can you please guide us to the similar issue jobmanager pod is stuck in CrashLoopBackOff state seen .The logs are already mentioned by MAU CHEE YEN on the above comment .

> Jobmanager CrashLoopBackOff in HA configuration
> -----------------------------------------------
>
>                 Key: FLINK-25098
>                 URL: https://issues.apache.org/jira/browse/FLINK-25098
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.13.2, 1.13.3
>         Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>            Reporter: Adrian Vasiliu
>            Priority: Critical
>         Attachments: JM-FlinkException-checkpointHA.txt, flink_checkpoint_issue.txt, iaf-insights-engine--7fc4-eve-29ee-ep-jobmanager-1-jobmanager.log, jm-flink-ha-jobmanager-log.txt, jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class (shared by all replicas - ReadWriteMany) and mount path set via {{{}high-availability.storageDir: file///<dir>{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524. 
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)