You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Sameer Wadkar <sa...@axiomine.com> on 2018/10/17 23:19:33 UTC

State Recovery when job fails and auto-recovers

Hi,

We have a job which is using ValueState. We have turned off checkpoints. The state is backed by rocksdb which is backed by S3. 

 If the job fails for any exception (ex. Partitions not available or an occasional S3 404 error) and auto-recovers, is the entire state lost or does it continue from the last saved state. We see that the job has the same identifier. We don’t mind losing data during the small interval when the job is recovering. But because we are using ValueState as a custom global window to accumulate state for a key over a 3 hour window we don’t want to lose all of it. 

Checkpointing is not an option because it takes longer per checkpoint and the state is huge. 

Thanks,
Sameer

Sent from my iPhone

Re: State Recovery when job fails and auto-recovers

Posted by Hequn Cheng <ch...@gmail.com>.

Hi Sameer,

In case of a failure, the job will restarts the operators and resets them
to the latest successful checkpoint. So if you turn off checkpoints, all
data will be lost.
Generally speaking, snapshots are very light-weight and can be drawn
frequently without much impact on performance. If it do affect performance
of your job and you don't want to lose all of your state, you can try to
increase the checkpoint interval.

> // start a checkpoint every 600000 ms (10min)
> env.enableCheckpointing(600000);

Best, Hequn

On Thu, Oct 18, 2018 at 7:19 AM Sameer Wadkar <sa...@axiomine.com> wrote:

> Hi,
>
> We have a job which is using ValueState. We have turned off checkpoints.
> The state is backed by rocksdb which is backed by S3.
>
>  If the job fails for any exception (ex. Partitions not available or an
> occasional S3 404 error) and auto-recovers, is the entire state lost or
> does it continue from the last saved state. We see that the job has the
> same identifier. We don’t mind losing data during the small interval when
> the job is recovering. But because we are using ValueState as a custom
> global window to accumulate state for a key over a 3 hour window we don’t
> want to lose all of it.
>
> Checkpointing is not an option because it takes longer per checkpoint and
> the state is huge.
>
> Thanks,
> Sameer
>
> Sent from my iPhone