You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Matthias Pohl (Jira)" <ji...@apache.org> on 2024/01/17 10:50:00 UTC

[jira] [Comment Edited] (FLINK-33998) Flink Job Manager restarted after kube-apiserver connection intermittent

    [ https://issues.apache.org/jira/browse/FLINK-33998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807687#comment-17807687 ] 

Matthias Pohl edited comment on FLINK-33998 at 1/17/24 10:49 AM:
-----------------------------------------------------------------

Unfortunately, I don't have the capacity to help you with the investigation. I'm gonna close the issue as "Not a problem" because it seems to be fixed in later versions. Thanks for sharing it anyway.


was (Author: mapohl):
Unfortunately, I don't have the capacity to help you with the investigation. I'm gonna close the issue as "Not a problem" because it seems to be fixed in later versions.

> Flink Job Manager restarted after kube-apiserver connection intermittent
> ------------------------------------------------------------------------
>
>                 Key: FLINK-33998
>                 URL: https://issues.apache.org/jira/browse/FLINK-33998
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.13.6
>         Environment: Kubernetes 1.24
> Flink Operator 1.4
> Flink 1.13.6
>            Reporter: Xiangyan
>            Priority: Major
>         Attachments: audit-log-no-restart.txt, audit-log-restart.txt, connection timeout.png, jm-no-restart4.log, jm-restart4.log
>
>
> We are running Flink on AWS EKS and experienced Job Manager restart issue when EKS control plane scaled up/in.
> I can reproduce this issue in my local environment too.
> Since I have no control of EKS kube-apiserver, I built a Kubernetes cluster by my own with below setup:
>  * Two kube-apiserver, only one is running at a time;
>  * Deploy multiple Flink clusters (with Flink Operator 1.4 and Flink 1.13);
>  * Enable Flink Job Manager HA;
>  * Configure Job Manager leader election timeout;
> {code:java}
> high-availability.kubernetes.leader-election.lease-duration: "60s"
> high-availability.kubernetes.leader-election.renew-deadline: "60s"{code}
> For testing, I switch the running kube-apiserver from one instance to another each time. When the kube-apiserver is switching, I can see that some Job Managers restart, but some are still running normally.
> Here is an example. When kube-apiserver swatched over at 05:{color:#ff0000}{{*53*}}{color}:08, both JM lost connection to kube-apiserver. But there is no more connection error within a few seconds. I guess the connection recovered by retry.
> However, one of the JM (the 2nd one in the attached screen shot) reported "DefaultDispatcherRunner was revoked the leadership" error after the leader election timeout (at 05:{color:#ff0000}{{*54*}}{color}:08) and then restarted itself. While the other JM was still running normally.
> From kube-apiserver audit logs, the normal JM was able to renew leader lease after the interruption. But there is no any lease renew request from the failed JM until it restarted.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)