You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Averell <lv...@gmail.com> on 2020/06/10 11:02:17 UTC

Automatically resuming failed jobs in K8s

Hi,
I'm running some jobs using native Kubernetes. Sometimes, for some unrelated
issue with our K8s cluster (e.g: K8s node crashed), my Flink pods are gone.
The JM pod, as it is deployed using a deployment, will be re-created
automatically. However, all of my jobs are lost.
What I have to do now are:
1. Re-upload the jars
2. Find the path to the last checkpoint of each job
3. Resubmit the job

Is there any existing option to automate those steps? E.g.
1. Can I use a jar file stored in the JM's file system or on S3 instead of
uploading the jar file via REST interface?
2. When restoring the job, I need to provide the full path of the last
checkpoint (/s3://<base_path>/<prev_job_id>/chk-2345//). Is there any option
to just provide the base_path?
3. Store the info to restore the jobs in the K8s deployment config

Thanks a lot.

Regards,
Averell



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Automatically resuming failed jobs in K8s

Posted by Averell <lv...@gmail.com>.

Thank you very much, Yang.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Automatically resuming failed jobs in K8s

Posted by Yang Wang <da...@gmail.com>.

Hi Averell,

Thanks for trying the native K8s integration. All your issues are due to
high availability
not configured. If you start a HA Flink cluster, like following, then when
JobManager/TaskManager
terminated exceptionally, all the jobs could recover and restore from the
latest checkpoint.
Even you delete the Flink cluster, when you start a new one with same
cluster-id, it could
also be recovered. Note that all the jobs did not fail or was canceled.

Please remember that you need to put the s3 filesystem jar into the plugin
directory in image manually[1].

./bin/kubernetes-session.sh \-Dkubernetes.cluster-id=k8s-ha-session-1
\-Dkubernetes.container.image=<IMAGE> \  -Djobmanager.heap.size=4096m
\-Dtaskmanager.memory.process.size=4096m
\-Dtaskmanager.numberOfTaskSlots=4 \-Dkubernetes.jobmanager.cpu=1
-Dkubernetes.taskmanager.cpu=2 \-Dhigh-availability=zookeeper
\-Dhigh-availability.zookeeper.quorum=<ZK_QUORUM>:2181
\-Dhigh-availability.storageDir=s3://your-s3/flink-ha-k8s
\-Drestart-strategy=fixed-delay
-Drestart-strategy.fixed-delay.attempts=10



Moreover, we do not have a native K8s HA, so we still need to use the ZK
HA. But it is already in plan[2] and
i hope it could be done soon. Then enable the HA for K8s native integration
will be more convenient.

[1].
https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/native_kubernetes.html#using-plugins
[2]. https://issues.apache.org/jira/browse/FLINK-12884

Best,
Yang



Averell <lv...@gmail.com> 于2020年6月10日周三 下午7:02写道：

> Hi,
> I'm running some jobs using native Kubernetes. Sometimes, for some
> unrelated
> issue with our K8s cluster (e.g: K8s node crashed), my Flink pods are gone.
> The JM pod, as it is deployed using a deployment, will be re-created
> automatically. However, all of my jobs are lost.
> What I have to do now are:
> 1. Re-upload the jars
> 2. Find the path to the last checkpoint of each job
> 3. Resubmit the job
>
> Is there any existing option to automate those steps? E.g.
> 1. Can I use a jar file stored in the JM's file system or on S3 instead of
> uploading the jar file via REST interface?
> 2. When restoring the job, I need to provide the full path of the last
> checkpoint (/s3://<base_path>/<prev_job_id>/chk-2345//). Is there any
> option
> to just provide the base_path?
> 3. Store the info to restore the jobs in the K8s deployment config
>
> Thanks a lot.
>
> Regards,
> Averell
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>