You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Robin Cassan <ro...@contentsquare.com> on 2021/09/01 09:38:15 UTC

Re: Cleaning old incremental checkpoint files

Thanks Robert for your answer, this seems to be what we observed when we
tried to delete the first time: Flink complained about missing files.
I'm wondering then how are people cleaning their storage for incremental
checkpoints? Is there any guarantee when using TTLs that after the TTL has
expired, no more file older than the TTL will be needed in the shared
folder?

Le mar. 3 août 2021 à 13:29, Robert Metzger <rm...@apache.org> a écrit :

> Hi Robin,
>
> Let's say you have two checkpoints #1 and #2, where #1 has been created by
> an old version or your job, and #2 has been created by the new version.
> When can you delete #1?
> In #1, there's a directory "/shared" that contains data that is also used
> by #2, because of the incremental nature of the checkpoints.
>
> You can not delete the data in the /shared directory, as this data is
> potentially still in use.
>
> I know this is only a partial answer to your question. I'll try to find
> out more details and extend my answer later.
>
>
> On Thu, Jul 29, 2021 at 2:31 PM Robin Cassan <
> robin.cassan@contentsquare.com> wrote:
>
>> Hi all!
>>
>> We've happily been running a Flink job in production for a year now, with
>> the RocksDB state backend and incremental retained checkpointing on S3. We
>> often release new versions of our jobs, which means we cancel the running
>> one and submit another while restoring the previous jobId's last retained
>> checkpoint.
>>
>> This works fine, but we also need to clean old files from S3 which are
>> starting to pile up. We are wondering two things:
>> - once the newer job has restored the older job's checkpoint, is it safe
>> to delete it? Or will the newer job's checkpoints reference files from the
>> older job, in which case deleting the old checkpoints might cause errors
>> during the next restore?
>> - also, since all our state has a 7 days TTL, is it safe to set a 7 or 8
>> days retention policy on S3 which would automatically clean old files, or
>> could we still need to retain files older than 7 days even with the TTL?
>>
>> Don't hesitate to ask me if anything is not clear enough!
>>
>> Thanks,
>> Robin
>>
>

Re: Cleaning old incremental checkpoint files

Posted by Yun Tang <my...@live.com>.
Hi Robin,

You could use Checkpoints#loadCheckpointMetadata[1] to analysis the checkpoint meta data.

For the problem of make checkpoint self-contained, you might be interested in the ticket [2]


[1] https://github.com/apache/flink/blob/8debdd06be0e917610c50a77893f7ade45cee98f/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/Checkpoints.java#L99
[2] https://issues.apache.org/jira/browse/FLINK-24149

Best
Yun Tang
________________________________
From: Robin Cassan <ro...@contentsquare.com>
Sent: Tuesday, September 7, 2021 20:17
To: Yun Tang <my...@live.com>
Cc: Robert Metzger <rm...@apache.org>; user <us...@flink.apache.org>
Subject: Re: Cleaning old incremental checkpoint files

Hey Yun, thanks for the answer!

How would you analyze the checkpoint metadata? Would you build a program with the State Processor API library, or is there a better way to do it?
I believe the option you mention would indeed facilitate cleaning, it would still be manual (because we can't set a periodic deletion) but at least we can safely remove old folders with this option

Thanks,
Robin

Le ven. 3 sept. 2021 à 18:21, Yun Tang <my...@live.com>> a écrit :
Hi Robin,

It's not easy to clean incremental checkpoints as different job instances have different checkpoint sub-directory (due to different job id). You could analysis your checkpoint metadata to see what files are still useful in older checkpoint directory.

BTW, I also think of a possible solution to provide the ability to re-upload all files under some specific configured option so that we could let new job get decoupled with older checkpoints. Do you think that could resolve your case?

Best
Yun Tang
________________________________
From: Robin Cassan <ro...@contentsquare.com>>
Sent: Wednesday, September 1, 2021 17:38
To: Robert Metzger <rm...@apache.org>>
Cc: user <us...@flink.apache.org>>
Subject: Re: Cleaning old incremental checkpoint files

Thanks Robert for your answer, this seems to be what we observed when we tried to delete the first time: Flink complained about missing files.
I'm wondering then how are people cleaning their storage for incremental checkpoints? Is there any guarantee when using TTLs that after the TTL has expired, no more file older than the TTL will be needed in the shared folder?

Le mar. 3 août 2021 à 13:29, Robert Metzger <rm...@apache.org>> a écrit :
Hi Robin,

Let's say you have two checkpoints #1 and #2, where #1 has been created by an old version or your job, and #2 has been created by the new version.
When can you delete #1?
In #1, there's a directory "/shared" that contains data that is also used by #2, because of the incremental nature of the checkpoints.

You can not delete the data in the /shared directory, as this data is potentially still in use.

I know this is only a partial answer to your question. I'll try to find out more details and extend my answer later.


On Thu, Jul 29, 2021 at 2:31 PM Robin Cassan <ro...@contentsquare.com>> wrote:
Hi all!

We've happily been running a Flink job in production for a year now, with the RocksDB state backend and incremental retained checkpointing on S3. We often release new versions of our jobs, which means we cancel the running one and submit another while restoring the previous jobId's last retained checkpoint.

This works fine, but we also need to clean old files from S3 which are starting to pile up. We are wondering two things:
- once the newer job has restored the older job's checkpoint, is it safe to delete it? Or will the newer job's checkpoints reference files from the older job, in which case deleting the old checkpoints might cause errors during the next restore?
- also, since all our state has a 7 days TTL, is it safe to set a 7 or 8 days retention policy on S3 which would automatically clean old files, or could we still need to retain files older than 7 days even with the TTL?

Don't hesitate to ask me if anything is not clear enough!

Thanks,
Robin

Re: Cleaning old incremental checkpoint files

Posted by Robin Cassan <ro...@contentsquare.com>.
Hey Yun, thanks for the answer!

How would you analyze the checkpoint metadata? Would you build a program
with the State Processor API library, or is there a better way to do it?
I believe the option you mention would indeed facilitate cleaning, it would
still be manual (because we can't set a periodic deletion) but at least we
can safely remove old folders with this option

Thanks,
Robin

Le ven. 3 sept. 2021 à 18:21, Yun Tang <my...@live.com> a écrit :

> Hi Robin,
>
> It's not easy to clean incremental checkpoints as different job instances
> have different checkpoint sub-directory (due to different job id). You
> could analysis your checkpoint metadata to see what files are still useful
> in older checkpoint directory.
>
> BTW, I also think of a possible solution to provide the ability to
> re-upload all files under some specific configured option so that we could
> let new job get decoupled with older checkpoints. Do you think that could
> resolve your case?
>
> Best
> Yun Tang
> ------------------------------
> *From:* Robin Cassan <ro...@contentsquare.com>
> *Sent:* Wednesday, September 1, 2021 17:38
> *To:* Robert Metzger <rm...@apache.org>
> *Cc:* user <us...@flink.apache.org>
> *Subject:* Re: Cleaning old incremental checkpoint files
>
> Thanks Robert for your answer, this seems to be what we observed when we
> tried to delete the first time: Flink complained about missing files.
> I'm wondering then how are people cleaning their storage for incremental
> checkpoints? Is there any guarantee when using TTLs that after the TTL has
> expired, no more file older than the TTL will be needed in the shared
> folder?
>
> Le mar. 3 août 2021 à 13:29, Robert Metzger <rm...@apache.org> a
> écrit :
>
> Hi Robin,
>
> Let's say you have two checkpoints #1 and #2, where #1 has been created by
> an old version or your job, and #2 has been created by the new version.
> When can you delete #1?
> In #1, there's a directory "/shared" that contains data that is also used
> by #2, because of the incremental nature of the checkpoints.
>
> You can not delete the data in the /shared directory, as this data is
> potentially still in use.
>
> I know this is only a partial answer to your question. I'll try to find
> out more details and extend my answer later.
>
>
> On Thu, Jul 29, 2021 at 2:31 PM Robin Cassan <
> robin.cassan@contentsquare.com> wrote:
>
> Hi all!
>
> We've happily been running a Flink job in production for a year now, with
> the RocksDB state backend and incremental retained checkpointing on S3. We
> often release new versions of our jobs, which means we cancel the running
> one and submit another while restoring the previous jobId's last retained
> checkpoint.
>
> This works fine, but we also need to clean old files from S3 which are
> starting to pile up. We are wondering two things:
> - once the newer job has restored the older job's checkpoint, is it safe
> to delete it? Or will the newer job's checkpoints reference files from the
> older job, in which case deleting the old checkpoints might cause errors
> during the next restore?
> - also, since all our state has a 7 days TTL, is it safe to set a 7 or 8
> days retention policy on S3 which would automatically clean old files, or
> could we still need to retain files older than 7 days even with the TTL?
>
> Don't hesitate to ask me if anything is not clear enough!
>
> Thanks,
> Robin
>
>

Re: Cleaning old incremental checkpoint files

Posted by Yun Tang <my...@live.com>.
Hi Robin,

It's not easy to clean incremental checkpoints as different job instances have different checkpoint sub-directory (due to different job id). You could analysis your checkpoint metadata to see what files are still useful in older checkpoint directory.

BTW, I also think of a possible solution to provide the ability to re-upload all files under some specific configured option so that we could let new job get decoupled with older checkpoints. Do you think that could resolve your case?

Best
Yun Tang
________________________________
From: Robin Cassan <ro...@contentsquare.com>
Sent: Wednesday, September 1, 2021 17:38
To: Robert Metzger <rm...@apache.org>
Cc: user <us...@flink.apache.org>
Subject: Re: Cleaning old incremental checkpoint files

Thanks Robert for your answer, this seems to be what we observed when we tried to delete the first time: Flink complained about missing files.
I'm wondering then how are people cleaning their storage for incremental checkpoints? Is there any guarantee when using TTLs that after the TTL has expired, no more file older than the TTL will be needed in the shared folder?

Le mar. 3 août 2021 à 13:29, Robert Metzger <rm...@apache.org>> a écrit :
Hi Robin,

Let's say you have two checkpoints #1 and #2, where #1 has been created by an old version or your job, and #2 has been created by the new version.
When can you delete #1?
In #1, there's a directory "/shared" that contains data that is also used by #2, because of the incremental nature of the checkpoints.

You can not delete the data in the /shared directory, as this data is potentially still in use.

I know this is only a partial answer to your question. I'll try to find out more details and extend my answer later.


On Thu, Jul 29, 2021 at 2:31 PM Robin Cassan <ro...@contentsquare.com>> wrote:
Hi all!

We've happily been running a Flink job in production for a year now, with the RocksDB state backend and incremental retained checkpointing on S3. We often release new versions of our jobs, which means we cancel the running one and submit another while restoring the previous jobId's last retained checkpoint.

This works fine, but we also need to clean old files from S3 which are starting to pile up. We are wondering two things:
- once the newer job has restored the older job's checkpoint, is it safe to delete it? Or will the newer job's checkpoints reference files from the older job, in which case deleting the old checkpoints might cause errors during the next restore?
- also, since all our state has a 7 days TTL, is it safe to set a 7 or 8 days retention policy on S3 which would automatically clean old files, or could we still need to retain files older than 7 days even with the TTL?

Don't hesitate to ask me if anything is not clear enough!

Thanks,
Robin