You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Vijay Jammi <vj...@gmail.com> on 2023/01/05 20:23:54 UTC

Flink Job Manager Recovery from EKS Node Terminations

Hi,

Have a query on the Job Manager HA for flink 1.15.

We currently run a standalone flink cluster with a single JobManager and
multiple TaskManagers, deployed on top of a kubernetes cluster (EKS
cluster) in application mode (reactive mode).

The Task Managers are deployed as a ReplicaSet and the single Job Manager
is configured to be highly available using the Kubernetes HA services with
recovery data being written to S3.
      high-availability.storageDir:
s3://<bucket-name>/flink/<app-name>/recovery

We also have configured our cluster for the rocksdb state backend with
checkpoints being written to S3.
      state.backend: rocksdb
      state.checkpoints.dir: s3://<bucket-name>/flink/<app-name>/checkpoints

Now to test the Job Manager HA, when we delete the job manager deployment
(to simulate job manager crash), we see that Kubernetes (EKS) detects
the failure, launches a new Job Manager pod and is able to recover the
application cluster from the last successful checkpoint (Restoring job
000....0000 from Checkpoint 5 @ 167...3692 for 000....0000 located at
s3://.../checkpoints/00000...0000/chk-5).

However, if we terminate the underlying node (EC2 instance) on which the
Job Manager pod is scheduled, the cluster is unable to recover from this
scenario. What we are seeing is that Kubernetes as usual tries and retries
repeatedly to launch a newer Job Manager but this time the job manager is
unable to find the checkpoint to recover from (No checkpoint found during
restore), eventually going into a CrashLoopBackOff status after max
attempts of restart.

Now the query is will the Job Manager need to be configured to store its
state to a local working directory over persistent volumes? Any pointers on
how we can recover the cluster from such node failures or terminations?

Vijay Jammi

Re: Flink Job Manager Recovery from EKS Node Terminations

Posted by Yang Wang <da...@gmail.com>.

First, JobManager does not store any persistent data to local when the
Kubernetes HA + S3 used.
It means that you do not need to mount a PV for JobMananger deployment.

Secondly, node failures or terminations should not cause
the CrashLoopBackOff status.
One possible reason I could imagine is a bug FLINK-28265[1], which is fixed
in 1.15.3.

BTW, it will be great if you could share the logs of initial JobManager pod
and crashed JobManager pod.

[1]. https://issues.apache.org/jira/browse/FLINK-28265


Best,
Yang


Vijay Jammi <vj...@gmail.com> 于2023年1月6日周五 04:24写道：

> Hi,
>
> Have a query on the Job Manager HA for flink 1.15.
>
> We currently run a standalone flink cluster with a single JobManager and
> multiple TaskManagers, deployed on top of a kubernetes cluster (EKS
> cluster) in application mode (reactive mode).
>
> The Task Managers are deployed as a ReplicaSet and the single Job Manager
> is configured to be highly available using the Kubernetes HA services with
> recovery data being written to S3.
>       high-availability.storageDir:
> s3://<bucket-name>/flink/<app-name>/recovery
>
> We also have configured our cluster for the rocksdb state backend with
> checkpoints being written to S3.
>       state.backend: rocksdb
>       state.checkpoints.dir:
> s3://<bucket-name>/flink/<app-name>/checkpoints
>
> Now to test the Job Manager HA, when we delete the job manager deployment
> (to simulate job manager crash), we see that Kubernetes (EKS) detects
> the failure, launches a new Job Manager pod and is able to recover the
> application cluster from the last successful checkpoint (Restoring job
> 000....0000 from Checkpoint 5 @ 167...3692 for 000....0000 located at
> s3://.../checkpoints/00000...0000/chk-5).
>
> However, if we terminate the underlying node (EC2 instance) on which the
> Job Manager pod is scheduled, the cluster is unable to recover from this
> scenario. What we are seeing is that Kubernetes as usual tries and retries
> repeatedly to launch a newer Job Manager but this time the job manager is
> unable to find the checkpoint to recover from (No checkpoint found during
> restore), eventually going into a CrashLoopBackOff status after max
> attempts of restart.
>
> Now the query is will the Job Manager need to be configured to store its
> state to a local working directory over persistent volumes? Any pointers on
> how we can recover the cluster from such node failures or terminations?
>
> Vijay Jammi
>