You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Chen-Che Huang <ac...@gmail.com> on 2022/01/27 03:09:56 UTC

Questions about checkpoint retention

Hi all,


To minimize the recovery time from failure, we employ incremental, retained
checkpoint with `state.checkpoints.num-retained

as 10` in our Flink apps. With this setting, Flink automatically creates
new checkpoints regularly and keeps only the latest 10

checkpoints. Besides, for app upgrade and better reliability, we have a
cron job which creates savepoints at regular intervals.



We have two questions for checkpoint retention.

   1. When our cron job creates a savepoint called SP, it seems those
   checkpoints created earlier SP still cannot be deleted. We thought the new
   checkpoints are generated based on SP and thus old checkpoints before SP
   will be useless. However, it seems the checkpoint mechanism doesn't work as
   we thought. Is what we thought correct?
   2. To save storage cost, we’d like to know what checkpoints can be
   deleted. Currently, each version of our app has 10 checkpoints. We wonder
   whether we can delete checkpoints generated for previous versions of our
   apps?


Any comment is appreciated!


Best wishes,

Chen-Che


An example is below. (checkpoint is generated every 30 mins while savepoint
is created every 2 hours)

1:00 Flink create checkpoint

1:30 Flink create checkpoint

2:00 Flink create checkpoint

2:30 Flink create checkpoint

3:00 Cronjob create savepoint (SP)

3:30 Flink create checkpoint

4:00 Flink create checkpoint

.

.

.

Re: Questions about checkpoint retention

Posted by "ChangZhuo Chen (陳昌倬)" <cz...@czchen.org>.
On Fri, Jan 28, 2022 at 02:43:11PM +0800, Caizhi Weng wrote:
> Chen-Che Huang <ac...@gmail.com> 于2022年1月27日周四 11:10写道:
> > We have two questions for checkpoint retention.
> >
> >    1. When our cron job creates a savepoint called SP, it seems those
> >    checkpoints created earlier SP still cannot be deleted. We thought the new
> >    checkpoints are generated based on SP and thus old checkpoints before SP
> >    will be useless. However, it seems the checkpoint mechanism doesn't work as
> >    we thought. Is what we thought correct?
> >    2. To save storage cost, we’d like to know what checkpoints can be
> >    deleted. Currently, each version of our app has 10 checkpoints. We wonder
> >    whether we can delete checkpoints generated for previous versions of our
> >    apps?

Some details below:

* We have two GCS buckets to store checkpoints and savepoints, like the
  following:

  * gs://flink-checkpoints has no retention configuration.
  * gs://flink-savepoints has retention 5 days.

* The checkpoint configuration are:
  * state.backend.incremental: true
  * RETAIN_ON_CANCELLATION
* We create savepoint every 4 hours for recovery.
* The business requires to have up to 180 days historical data.

The questions are:

* We want to set retention on gs://flink-checkpoints to reduce storage
  cost. However, Flink sometimes cannot restore from checkpoint due to
  missing data when retention is configured on gs://flink-checkpoints.
  Is there any way to config retention safely for Flink?

* We don't use DELETE_ON_CANCELLATION to avoid deleting state data by
  accidently.


-- 
ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org
http://czchen.info/
Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B

Re: Questions about checkpoint retention

Posted by Caizhi Weng <ts...@gmail.com>.
Hi!

So you'd like to remove all checkpoints after a savepoint is completed?
Could you elaborate more on why you'd like to retain 10 checkpoints? For
most of the cases retaining one checkpoint is enough.

Also you mentioned that you're keeping 10 checkpoints for each version of
your app. For each version of your app do you have a corresponding job
running? If yes then these checkpoints are needed. If you're only running
the latest version of the job and you never decide to recover from an older
version then checkpoints for older versions can be discarded. In short, it
all depends on your needs.

Chen-Che Huang <ac...@gmail.com> 于2022年1月27日周四 11:10写道:

> Hi all,
>
>
> To minimize the recovery time from failure, we employ incremental,
> retained checkpoint with `state.checkpoints.num-retained
>
> as 10` in our Flink apps. With this setting, Flink automatically creates
> new checkpoints regularly and keeps only the latest 10
>
> checkpoints. Besides, for app upgrade and better reliability, we have a
> cron job which creates savepoints at regular intervals.
>
>
>
> We have two questions for checkpoint retention.
>
>    1. When our cron job creates a savepoint called SP, it seems those
>    checkpoints created earlier SP still cannot be deleted. We thought the new
>    checkpoints are generated based on SP and thus old checkpoints before SP
>    will be useless. However, it seems the checkpoint mechanism doesn't work as
>    we thought. Is what we thought correct?
>    2. To save storage cost, we’d like to know what checkpoints can be
>    deleted. Currently, each version of our app has 10 checkpoints. We wonder
>    whether we can delete checkpoints generated for previous versions of our
>    apps?
>
>
> Any comment is appreciated!
>
>
> Best wishes,
>
> Chen-Che
>
>
> An example is below. (checkpoint is generated every 30 mins while
> savepoint is created every 2 hours)
>
> 1:00 Flink create checkpoint
>
> 1:30 Flink create checkpoint
>
> 2:00 Flink create checkpoint
>
> 2:30 Flink create checkpoint
>
> 3:00 Cronjob create savepoint (SP)
>
> 3:30 Flink create checkpoint
>
> 4:00 Flink create checkpoint
>
> .
>
> .
>
> .
>