You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by do...@dbruhn.de on 2017/11/22 09:41:23 UTC

Tooling for resuming from checkpoints

Hey,
we are running Flink 1.3.2 with streaming jobs and we are running into 
issues when we are restarting a complete job (which can happen due to 
various reasons: upgrading of the job, restarting of the cluster, 
failures). The problem is that there is no automated way to find out 
from which checkpoint-metadata (so externalized checkpoint) we should 
resume. There can always be the situation that we are left with multiple 
of those files: Now you want to use the most recent one which is 
successfully written.

Is there any tooling available already which picks the latest good 
checkpoint? Or at least a tool/commandline which we can use to validate 
that a checkpoint is valid so we can pick the latest one?

How are others handling this? Manually?

Would be happy to get some input there,
Dominik

Re: Tooling for resuming from checkpoints

Posted by Timo Walther <tw...@apache.org>.

Hi Dominik,

the Web UI shows you the status of a checkpoint [0], so it might be 
possible to retrieve the information via REST calls. Usually, you should 
perform a savepoint for planned restarts. If a savepoint is successful 
you can be sure to restart from it.

Otherwise the platform from data Artisans might be interesting for you 
[1], it aims to improve the deployment for streaming application 
lifecycles (disclaimer: I work for them).

Regards,
Timo


[0] 
https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/checkpoint_monitoring.html
[1] https://data-artisans.com/da-platform-2


Am 11/22/17 um 10:41 AM schrieb dominik@dbruhn.de:
> Hey,
> we are running Flink 1.3.2 with streaming jobs and we are running into 
> issues when we are restarting a complete job (which can happen due to 
> various reasons: upgrading of the job, restarting of the cluster, 
> failures). The problem is that there is no automated way to find out 
> from which checkpoint-metadata (so externalized checkpoint) we should 
> resume. There can always be the situation that we are left with 
> multiple of those files: Now you want to use the most recent one which 
> is successfully written.
>
> Is there any tooling available already which picks the latest good 
> checkpoint? Or at least a tool/commandline which we can use to 
> validate that a checkpoint is valid so we can pick the latest one?
>
> How are others handling this? Manually?
>
> Would be happy to get some input there,
> Dominik