You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Kevin Lam <ke...@shopify.com> on 2021/08/19 14:06:12 UTC

Job manager sometimes doesn't restore job from checkpoint post TaskManager failure

Hi all,

I've noticed that sometimes when task managers go down--it looks like the
job is not restored from checkpoint, but instead restarted from a fresh
state (when I go to the job's checkpoint tab in the UI, I don't see the
restore, and the number in the job overview all get reset). Under what
circumstances does this happen?

I've been trying to debug and we really want the job to restore from the
checkpoint at all times for our use case.

We're running Apache Flink 1.13 on Kubernetes in a high availability
set-up.

Thanks in advance!

Re: Job manager sometimes doesn't restore job from checkpoint post TaskManager failure

Posted by Kevin Lam <ke...@shopify.com>.

Hi,

I was able to understand what was causing this. We were using the restart
strategy `fixed-delay` with a maximum number of restarts set to 10. Using
exponential-delay resolved the issue of restarting the job from fresh.

On Thu, Aug 19, 2021 at 2:04 PM Chesnay Schepler <ch...@apache.org> wrote:

> How do you deploy Flink on Kubernetes? Do you use the standalone
> <https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/>
> or native
> <https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/native_kubernetes/>
> mode?
>
> Is it really just task managers going down? It seems unlikely that the
> loss of a TM could have such an effect.
>
> Can you provide us with the JobManager logs at the time the TM crash
> occurred? They should contain some hints as to how Flink handled the TM
> failure.
>
> On 19/08/2021 16:06, Kevin Lam wrote:
>
> Hi all,
>
> I've noticed that sometimes when task managers go down--it looks like the
> job is not restored from checkpoint, but instead restarted from a fresh
> state (when I go to the job's checkpoint tab in the UI, I don't see the
> restore, and the number in the job overview all get reset). Under what
> circumstances does this happen?
>
> I've been trying to debug and we really want the job to restore from the
> checkpoint at all times for our use case.
>
> We're running Apache Flink 1.13 on Kubernetes in a high availability
> set-up.
>
> Thanks in advance!
>
>
>

Re: Job manager sometimes doesn't restore job from checkpoint post TaskManager failure

Posted by Chesnay Schepler <ch...@apache.org>.

How do you deploy Flink on Kubernetes? Do you use the standalone 
<https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/> 
or native 
<https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/native_kubernetes/> 
mode?

Is it really just task managers going down? It seems unlikely that the 
loss of a TM could have such an effect.

Can you provide us with the JobManager logs at the time the TM crash 
occurred? They should contain some hints as to how Flink handled the TM 
failure.

On 19/08/2021 16:06, Kevin Lam wrote:
> Hi all,
>
> I've noticed that sometimes when task managers go down--it looks like 
> the job is not restored from checkpoint, but instead restarted from a 
> fresh state (when I go to the job's checkpoint tab in the UI, I don't 
> see the restore, and the number in the job overview all get reset). 
> Under what circumstances does this happen?
>
> I've been trying to debug and we really want the job to restore from 
> the checkpoint at all times for our use case.
>
> We're running Apache Flink 1.13 on Kubernetes in a high availability 
> set-up.
>
> Thanks in advance!