You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Konstantin Knauf (Jira)" <ji...@apache.org> on 2022/03/02 12:46:00 UTC

[jira] [Comment Edited] (FLINK-26273) Test checkpoints restore modes & formats

    [ https://issues.apache.org/jira/browse/FLINK-26273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497438#comment-17497438 ] 

Konstantin Knauf edited comment on FLINK-26273 at 3/2/22, 12:45 PM:
--------------------------------------------------------------------

[~dwysakowicz] Looks very good. Here's what I did. Did I miss anything from your perspective? I opened two tickets (linked to this ticket) with CLI related issues. I also opened a hotfix PR with some documentation improvements: [https://github.com/apache/flink/pull/18909]

*Commit:* 0ff6f2cc78f

1. *Taking Canonical/Native Savepoints/Retained Checkpoint*

Ran TopSpeedWindowing in Standalone Application Mode with RocksDB (incremental) and checkpoints retained on cancellation three times:
 * stopped with native savepoint (Savepoint ID: savepoint-b6dea9-ec57dcec988e)
 * stopped with canonical savepoint (Savepoint ID: savepoint-0d0bb8-3cfceefe4dec)
 * cancelled (JobID: c40b0839cfa6a454919597819e8e84f6)

Checkpoint Directory
{noformat}
/tmp/flink-checkpoints
├── 0d0bb8faccf2eb8124d086a5355428a8
│   ├── shared
│   └── taskowned
├── b6dea9642f5159f83c32eca3fc40082a
│   ├── shared
│   └── taskowned
└── c40b0839cfa6a454919597819e8e84f6
    ├── chk-13
    │   └── _metadata
    ├── shared
    │   └── 1d438c44-c7a6-49c0-8053-1e5689a6df5c
    └── taskowned
{noformat}
Savepoint Directory
{noformat}
/tmp/flink-savepoints
├── savepoint-0d0bb8-3cfceefe4dec
│   └── _metadata
└── savepoint-b6dea9-ec57dcec988e
    ├── dd200786-54e3-4af3-a6f4-2943ff73bc14
    └── _metadata
{noformat}
2. *Two Jobs can be Started from Native Savepoint without Claiming and take a full checkpoint*

Started 2 TopSpeedWindowing Jobs (aca6b1fc37c489d608b8ab9d562cd569 & 634d99afcf280d7e6eefd7d9f2b0ec37) without claiming from Native Savepoint and confirmed that a full snapshot was taken for both of them (I took the fact that the "Checkpointed Data Size"="Full Checkpoint Data Size" for the first checkpoint only as sign that this is the case.). Cancelled both jobs.

3. *Two Jobs can be Started from Retained Checkpoint without Claiming and take a full checkpoint*

Like Step 2a just using the retained checkpoint from Step 1 instead of native savepoint.

4. *Job can claim retained checkpoint and continuous to checkpoint incrementally, retained checkpoint is cleaned up*

Started TopSpeedWindowing with Claiming from the Retained Checkpoint of Step 1. Confirmed that the first Checkpoint is incremental and confirmed that the original checkpoint directory is empty after a few checkpoints.
{code:bash}
/tmp/flink-checkpoints/c40b0839cfa6a454919597819e8e84f6
├── shared
└── taskowned
{code}
4. *Job can claim moved, native savepoint and continuous to checkpoint incrementally, retained checkpoint is cleaned up*

Copied Native Savepoint from Step 1 to a different directory. Everything else like in 3. The directory of the moved Savepoint does not exist after a few checkpoints and the first checkpoint is incremental.

5. *Native Savepoint can be removed after first successful checkpoint and recovery still works*

Started TopSpeedWindowing from Native Savepoint. After one checkpoint, removed savepoint, killed Taskmanager, restarted Taskmanager and Job recovered and continued checkpointing.


was (Author: knaufk):
[~dwysakowicz] Looks very good. Here's what I did. Did I miss anything from your perspective? I opened two tickets (linked to this ticket) with CLI related issues. I also opened a hotfix PR with some documentation improvements: https://github.com/apache/flink/pull/18909

*Commit:* 0ff6f2cc78f

1. *Taking Canonical/Native Savepoints/Retained Checkpoint*

Ran TopSpeedWindowing in Standalone Application Mode with RocksDB (incremental) and checkpoints retained on cancellation three times: 
* stopped with native savepoint (Savepoint ID: savepoint-b6dea9-ec57dcec988e)
* stopped with canonical savepoint (Savepoint ID: savepoint-0d0bb8-3cfceefe4dec)
* cancelled (JobID: c40b0839cfa6a454919597819e8e84f6) 

Checkpoint Directory
{noformat}
/tmp/flink-checkpoints
├── 0d0bb8faccf2eb8124d086a5355428a8
│   ├── shared
│   └── taskowned
├── b6dea9642f5159f83c32eca3fc40082a
│   ├── shared
│   └── taskowned
└── c40b0839cfa6a454919597819e8e84f6
    ├── chk-13
    │   └── _metadata
    ├── shared
    │   └── 1d438c44-c7a6-49c0-8053-1e5689a6df5c
    └── taskowned
{noformat}

Savepoint Directory


{noformat}
/tmp/flink-savepoints
├── savepoint-0d0bb8-3cfceefe4dec
│   └── _metadata
└── savepoint-b6dea9-ec57dcec988e
    ├── dd200786-54e3-4af3-a6f4-2943ff73bc14
    └── _metadata
{noformat}

2. *Two Jobs can be Started from Native Savepoint without Claiming and take a full checkpoint*

Started 2 TopSpeedWindowing Jobs (aca6b1fc37c489d608b8ab9d562cd569 & 634d99afcf280d7e6eefd7d9f2b0ec37) without claiming from Native Savepoint and confirmed that a full snapshot was taken for both of them (I took the fact that the "Checkpointed Data Size"="Full Checkpoint Data Size" for the first checkpoint only as sign that this is the case.). Cancelled both jobs. 

3. *Two Jobs can be Started from Retained Checkpoint without Claiming and take a full checkpoint*

Like Step 2a just using the retained checkpoint from Step 1 instead of native savepoint.

4. *Job can be claim retained checkpoint and continuous to checkpoint incrementally, retained checkpoint is cleaned up*

Started TopSpeedWindowing with Claiming from the Retained Checkpoint of Step 1. Confirmed that the first Checkpoint is incremental and confirmed that the original checkpoint directory is empty after a few checkpoints. 

{code:bash}
/tmp/flink-checkpoints/c40b0839cfa6a454919597819e8e84f6
├── shared
└── taskowned
{code}

4. *Job can be claim moved, native savepoint and continuous to checkpoint incrementally, retained checkpoint is cleaned up*

Copied Native Savepoint from Step 1 to a different directory. Everything else like in 3. The directory of the moved Savepoint does not exist after a few checkpoints and the first checkpoint is incremental.

5. *Native Savepoint can be removed after first successful checkpoint and recovery still works*

Started TopSpeedWindowing from Native Savepoint. After one checkpoint, removed savepoint, killed Taskmanager, restarted Taskmanager and Job recovered and continued checkpointing. 







> Test checkpoints restore modes & formats
> ----------------------------------------
>
>                 Key: FLINK-26273
>                 URL: https://issues.apache.org/jira/browse/FLINK-26273
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>            Reporter: Dawid Wysakowicz
>            Assignee: Konstantin Knauf
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.15.0
>
>
> We should test manually changes introduced in [FLINK-25276] & [FLINK-25154]
> Proposal: 
> Take canonical savepoint/native savepoint/externalised checkpoint (with RocksDB), and perform claim (1)/no claim (2) recoveries, and verify that in:
> # after a couple of checkpoints claimed files have been cleaned up
> # that after a single successful checkpoint, you can remove the start up files and failover the job without any errors.
> # take a native, incremental RocksDB savepoint, move to a different directory, restore from it
> documentation:
> # https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#restore-mode
> # https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/savepoints/#savepoint-format



--
This message was sent by Atlassian Jira
(v8.20.1#820001)