You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Konstantin Knauf (Jira)" <ji...@apache.org> on 2022/03/02 12:46:00 UTC
[jira] [Comment Edited] (FLINK-26273) Test checkpoints restore modes & formats
[ https://issues.apache.org/jira/browse/FLINK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497438#comment-17497438 ]
Konstantin Knauf edited comment on FLINK-26273 at 3/2/22, 12:45 PM:
--------------------------------------------------------------------
[~dwysakowicz] Looks very good. Here's what I did. Did I miss anything from your perspective? I opened two tickets (linked to this ticket) with CLI related issues. I also opened a hotfix PR with some documentation improvements: [https://github.com/apache/flink/pull/18909]
*Commit:* 0ff6f2cc78f
1. *Taking Canonical/Native Savepoints/Retained Checkpoint*
Ran TopSpeedWindowing in Standalone Application Mode with RocksDB (incremental) and checkpoints retained on cancellation three times:
* stopped with native savepoint (Savepoint ID: savepoint-b6dea9-ec57dcec988e)
* stopped with canonical savepoint (Savepoint ID: savepoint-0d0bb8-3cfceefe4dec)
* cancelled (JobID: c40b0839cfa6a454919597819e8e84f6)
Checkpoint Directory
{noformat}
/tmp/flink-checkpoints
├── 0d0bb8faccf2eb8124d086a5355428a8
│ ├── shared
│ └── taskowned
├── b6dea9642f5159f83c32eca3fc40082a
│ ├── shared
│ └── taskowned
└── c40b0839cfa6a454919597819e8e84f6
├── chk-13
│ └── _metadata
├── shared
│ └── 1d438c44-c7a6-49c0-8053-1e5689a6df5c
└── taskowned
{noformat}
Savepoint Directory
{noformat}
/tmp/flink-savepoints
├── savepoint-0d0bb8-3cfceefe4dec
│ └── _metadata
└── savepoint-b6dea9-ec57dcec988e
├── dd200786-54e3-4af3-a6f4-2943ff73bc14
└── _metadata
{noformat}
2. *Two Jobs can be Started from Native Savepoint without Claiming and take a full checkpoint*
Started 2 TopSpeedWindowing Jobs (aca6b1fc37c489d608b8ab9d562cd569 & 634d99afcf280d7e6eefd7d9f2b0ec37) without claiming from Native Savepoint and confirmed that a full snapshot was taken for both of them (I took the fact that the "Checkpointed Data Size"="Full Checkpoint Data Size" for the first checkpoint only as sign that this is the case.). Cancelled both jobs.
3. *Two Jobs can be Started from Retained Checkpoint without Claiming and take a full checkpoint*
Like Step 2a just using the retained checkpoint from Step 1 instead of native savepoint.
4. *Job can claim retained checkpoint and continuous to checkpoint incrementally, retained checkpoint is cleaned up*
Started TopSpeedWindowing with Claiming from the Retained Checkpoint of Step 1. Confirmed that the first Checkpoint is incremental and confirmed that the original checkpoint directory is empty after a few checkpoints.
{code:bash}
/tmp/flink-checkpoints/c40b0839cfa6a454919597819e8e84f6
├── shared
└── taskowned
{code}
4. *Job can claim moved, native savepoint and continuous to checkpoint incrementally, retained checkpoint is cleaned up*
Copied Native Savepoint from Step 1 to a different directory. Everything else like in 3. The directory of the moved Savepoint does not exist after a few checkpoints and the first checkpoint is incremental.
5. *Native Savepoint can be removed after first successful checkpoint and recovery still works*
Started TopSpeedWindowing from Native Savepoint. After one checkpoint, removed savepoint, killed Taskmanager, restarted Taskmanager and Job recovered and continued checkpointing.
was (Author: knaufk):
[~dwysakowicz] Looks very good. Here's what I did. Did I miss anything from your perspective? I opened two tickets (linked to this ticket) with CLI related issues. I also opened a hotfix PR with some documentation improvements: https://github.com/apache/flink/pull/18909
*Commit:* 0ff6f2cc78f
1. *Taking Canonical/Native Savepoints/Retained Checkpoint*
Ran TopSpeedWindowing in Standalone Application Mode with RocksDB (incremental) and checkpoints retained on cancellation three times:
* stopped with native savepoint (Savepoint ID: savepoint-b6dea9-ec57dcec988e)
* stopped with canonical savepoint (Savepoint ID: savepoint-0d0bb8-3cfceefe4dec)
* cancelled (JobID: c40b0839cfa6a454919597819e8e84f6)
Checkpoint Directory
{noformat}
/tmp/flink-checkpoints
├── 0d0bb8faccf2eb8124d086a5355428a8
│ ├── shared
│ └── taskowned
├── b6dea9642f5159f83c32eca3fc40082a
│ ├── shared
│ └── taskowned
└── c40b0839cfa6a454919597819e8e84f6
├── chk-13
│ └── _metadata
├── shared
│ └── 1d438c44-c7a6-49c0-8053-1e5689a6df5c
└── taskowned
{noformat}
Savepoint Directory
{noformat}
/tmp/flink-savepoints
├── savepoint-0d0bb8-3cfceefe4dec
│ └── _metadata
└── savepoint-b6dea9-ec57dcec988e
├── dd200786-54e3-4af3-a6f4-2943ff73bc14
└── _metadata
{noformat}
2. *Two Jobs can be Started from Native Savepoint without Claiming and take a full checkpoint*
Started 2 TopSpeedWindowing Jobs (aca6b1fc37c489d608b8ab9d562cd569 & 634d99afcf280d7e6eefd7d9f2b0ec37) without claiming from Native Savepoint and confirmed that a full snapshot was taken for both of them (I took the fact that the "Checkpointed Data Size"="Full Checkpoint Data Size" for the first checkpoint only as sign that this is the case.). Cancelled both jobs.
3. *Two Jobs can be Started from Retained Checkpoint without Claiming and take a full checkpoint*
Like Step 2a just using the retained checkpoint from Step 1 instead of native savepoint.
4. *Job can be claim retained checkpoint and continuous to checkpoint incrementally, retained checkpoint is cleaned up*
Started TopSpeedWindowing with Claiming from the Retained Checkpoint of Step 1. Confirmed that the first Checkpoint is incremental and confirmed that the original checkpoint directory is empty after a few checkpoints.
{code:bash}
/tmp/flink-checkpoints/c40b0839cfa6a454919597819e8e84f6
├── shared
└── taskowned
{code}
4. *Job can be claim moved, native savepoint and continuous to checkpoint incrementally, retained checkpoint is cleaned up*
Copied Native Savepoint from Step 1 to a different directory. Everything else like in 3. The directory of the moved Savepoint does not exist after a few checkpoints and the first checkpoint is incremental.
5. *Native Savepoint can be removed after first successful checkpoint and recovery still works*
Started TopSpeedWindowing from Native Savepoint. After one checkpoint, removed savepoint, killed Taskmanager, restarted Taskmanager and Job recovered and continued checkpointing.
> Test checkpoints restore modes & formats
> ----------------------------------------
>
> Key: FLINK-26273
> URL: https://issues.apache.org/jira/browse/FLINK-26273
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Reporter: Dawid Wysakowicz
> Assignee: Konstantin Knauf
> Priority: Blocker
> Labels: release-testing
> Fix For: 1.15.0
>
>
> We should test manually changes introduced in [FLINK-25276] & [FLINK-25154]
> Proposal:
> Take canonical savepoint/native savepoint/externalised checkpoint (with RocksDB), and perform claim (1)/no claim (2) recoveries, and verify that in:
> # after a couple of checkpoints claimed files have been cleaned up
> # that after a single successful checkpoint, you can remove the start up files and failover the job without any errors.
> # take a native, incremental RocksDB savepoint, move to a different directory, restore from it
> documentation:
> # https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#restore-mode
> # https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#savepoint-format
--
This message was sent by Atlassian Jira
(v8.20.1#820001)