You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yanfei Lei (Jira)" <ji...@apache.org> on 2023/04/19 04:24:00 UTC
[jira] [Commented] (FLINK-31766) Restoring from a retained checkpoint that was generated with changelog backend enabled might fail due to missing files
[ https://issues.apache.org/jira/browse/FLINK-31766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713888#comment-17713888 ]
Yanfei Lei commented on FLINK-31766:
------------------------------------
After reproducing FLINK-31593 locally, I think the root cause is that `StatefulJobSavepointMigrationITCase` partially moves the snapshot files to a new [directory|https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/checkpointing/utils/SnapshotMigrationTestBase.java#L337-L342].
All state backends in `StatefulJobSavepointMigrationITCase` are non-incremental, all files are placed in chk-x folder. But the files of changelog state backend are not completely placed under chk-x, some files are placed under taskowned folder, something like:
{code:java}
├── chk-2
│ ├── 5487d0fd-a361-4085-8ee0-7364ffd4511a
│ ├── _metadata
│ └── d3596cf7-3c6e-4081-b37b-f5a3e1a40086
├── shared
└── taskowned
├── 01aefc31-8ee1-41a8-9cd3-a94ccf85052f
├── 02bf09d3-73db-4c45-b6a1-15987659e3e6
├── 0c456b9b-9f90-4696-a2be-16e5938358ae {code}
This also explains why this issue didn't show up earlier:
1. If the version <= 1.15, changelog state backend is disabled.
2. If the version >= 1.16, change state backend is randomly turned on, when the changelog is turned off, this issue would not be triggered.
So I have two questions:
# Whether the incremental rocksdb state backend should be tested here?
# Do we need to change the move function to support testing of changelog state backend?
[~roman] could you please help take a look?
> Restoring from a retained checkpoint that was generated with changelog backend enabled might fail due to missing files
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-31766
> URL: https://issues.apache.org/jira/browse/FLINK-31766
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.17.0, 1.16.1, 1.18.0
> Reporter: Matthias Pohl
> Priority: Major
> Attachments: FLINK-31593.StatefulJobSavepointMigrationITCase.create_snapshot.log, FLINK-31593.StatefulJobSavepointMigrationITCase.verify_snapshot.log
>
>
> in FLINK-31593 we discovered a instability when generating the test data for {{StatefulJobSavepointMigrationITCase}} and {{StatefulJobWBroadcastStateMigrationITCase}}. It appears that files are deleted that shouldn't be deleted (see [~Yanfei Lei]'s [comment in FLINK-31593|https://issues.apache.org/jira/browse/FLINK-31593?focusedCommentId=17706679&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17706679]).
> It's quite reproducible when generating the 1.17 test data for {{StatefulJobWBroadcastStateMigrationITCase}} and doing a test run to verify it.
> I'm attaching the debug logs of such two runs that I generated for FLINK-31593 in this issue as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)