You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Enrique Lacal (Jira)" <ji...@apache.org> on 2022/01/21 09:56:00 UTC
[jira] [Commented] (FLINK-25098) Jobmanager CrashLoopBackOff in HA configuration

    [ https://issues.apache.org/jira/browse/FLINK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479958#comment-17479958 ] 

Enrique Lacal commented on FLINK-25098:
---------------------------------------

Hi [~trohrmann] , 

I didn't manage to replicate the above error and even tried with S3 back in December and left it running for a long period of time and it worked.

Yes, we use StatefulSets for deploying the Flink JMs.

We have found a similar issue to the above that is replicated constantly, here are the logs [^JM-FlinkException-checkpointHA.txt] . We have a manual workaround of deleting the affected HA ConfigMap which points to this checkpoint, but this is not feasible in a production environment. Would really appreciate any thoughts on this, and what sort of solution we could come to. Let me know if you need any more information, I'm trying to get the logs before this occurred.

Thanks,
Enrique

> Jobmanager CrashLoopBackOff in HA configuration
> -----------------------------------------------
>
>                 Key: FLINK-25098
>                 URL: https://issues.apache.org/jira/browse/FLINK-25098
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.13.2, 1.13.3
>         Environment: Reproduced with:
> * Persistent jobs storage provided by the rocks-cephfs storage class.
> * OpenShift 4.9.5.
>            Reporter: Adrian Vasiliu
>            Priority: Critical
>         Attachments: JM-FlinkException-checkpointHA.txt, iaf-insights-engine--7fc4-eve-29ee-ep-jobmanager-1-jobmanager.log, jm-flink-ha-jobmanager-log.txt, jm-flink-ha-tls-proxy-log.txt
>
>
> In a Kubernetes deployment of Flink 1.13.2 (also reproduced with Flink 1.13.3), turning to Flink HA by using 3 replicas of the jobmanager leads to CrashLoopBackoff for all replicas.
> Attaching the full logs of the {{jobmanager}} and {{tls-proxy}} containers of jobmanager pod:
> [^jm-flink-ha-jobmanager-log.txt]
> [^jm-flink-ha-tls-proxy-log.txt]
> Reproduced with:
>  * Persistent jobs storage provided by the {{rocks-cephfs}} storage class (shared by all replicas - ReadWriteMany) and mount path set via {{{}high-availability.storageDir: file///<dir>{}}}.
>  * OpenShift 4.9.5 and also 4.8.x - reproduced in several clusters, it's not a "one-shot" trouble.
> Remarks:
>  * This is a follow-up of https://issues.apache.org/jira/browse/FLINK-22014?focusedCommentId=17450524&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17450524. 
>  * Picked Critical severity as HA is critical for our product.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)