You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Matthias Pohl (Jira)" <ji...@apache.org> on 2023/03/24 18:26:00 UTC

[jira] [Comment Edited] (FLINK-28440) EventTimeWindowCheckpointingITCase failed with restore

    [ https://issues.apache.org/jira/browse/FLINK-28440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704750#comment-17704750 ] 

Matthias Pohl edited comment on FLINK-28440 at 3/24/23 6:25 PM:
----------------------------------------------------------------

I try to finalize FLINK-31593 but ran into an issue with the {{StatefulJobSavepointMigrationITCase}} for Flink version 1.17 with RocksDB state backend and {{SnapshotType.CHECKPOINT}}. The following exception is exposed through the logs:
{code}
34632 [Flat Map -> Sink: Unnamed (2/4)#30] WARN  org.apache.flink.runtime.taskmanager.Task [] - Flat Map -> Sink: Unnamed (2/4)#30 (98755c0995b5d43eae13ca81da224424_d1392353922252257afa15351a98bae9_1_30) switched from INITIALIZING to FAILED with failure cause:
java.lang.Exception: Exception while creating StreamOperatorStateContext.
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:258) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:256) ~[classes/:?]
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106) ~[classes/:?]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:747) ~[classes/:?]
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55) ~[classes/:?]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:722) ~[classes/:?]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:688) ~[classes/:?]
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952) ~[classes/:?]
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:921) [classes/:?]
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) [classes/:?]
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) [classes/:?]
	at java.lang.Thread.run(Thread.java:750) [?:1.8.0_345]
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for StreamFlatMap_d1392353922252257afa15351a98bae9_(2/4) from any of the 1 provided restore options.
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:355) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:166) ~[classes/:?]
	... 11 more
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:405) ~[classes/:?]
	at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:510) ~[classes/:?]
	at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:99) ~[classes/:?]
	at org.apache.flink.state.changelog.AbstractChangelogStateBackend.lambda$createKeyedStateBackend$1(AbstractChangelogStateBackend.java:145) ~[classes/:?]
	at org.apache.flink.state.changelog.restore.ChangelogBackendRestoreOperation.restore(ChangelogBackendRestoreOperation.java:72) ~[classes/:?]
	at org.apache.flink.state.changelog.ChangelogStateBackend.restore(ChangelogStateBackend.java:94) ~[classes/:?]
	at org.apache.flink.state.changelog.AbstractChangelogStateBackend.createKeyedStateBackend(AbstractChangelogStateBackend.java:136) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:338) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:355) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:166) ~[classes/:?]
	... 11 more
Caused by: java.io.FileNotFoundException: /tmp/junit5080387696602759609/checkpoints_9efe7b0c-8da8-4bef-9020-57abf2e29ac9/5fe4979761499f945215ac42815e65d4/taskowned/61058282-6146-48c0-9ece-b5d6bdd44234 (No such file or directory)
	at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_345]
	at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_345]
	at java.io.FileInputStream.<init>(FileInputStream.java:138) ~[?:1.8.0_345]
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50) ~[classes/:?]
	at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) ~[classes/:?]
	at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:69) ~[classes/:?]
	at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:127) ~[classes/:?]
	at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:110) ~[classes/:?]
	at org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49) ~[classes/:?]
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) ~[?:1.8.0_345]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_345]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_345]
	... 1 more
{code}

which looks like the same issue we're having here in FLINK-28440. [~Yanfei Lei] can you give some guidance here? I created the reference test data for 1.17 based on {{release-1.17.0}} and ran the test based on {{release-1.17}}.


was (Author: mapohl):
I try to finalize FLINK-31593 but ran into an issue with the {{StatefulJobSavepointMigrationITCase}} for Flink version 1.17 with RocksDB state backend and {{SnapshotType.CHECKPOINT}}. The following exception is exposed through the logs:
{code}
34632 [Flat Map -> Sink: Unnamed (2/4)#30] WARN  org.apache.flink.runtime.taskmanager.Task [] - Flat Map -> Sink: Unnamed (2/4)#30 (98755c0995b5d43eae13ca81da224424_d1392353922252257afa15351a98bae9_1_30) switched from INITIALIZING to FAILED with failure cause:
java.lang.Exception: Exception while creating StreamOperatorStateContext.
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:258) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:256) ~[classes/:?]
	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106) ~[classes/:?]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:747) ~[classes/:?]
	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55) ~[classes/:?]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:722) ~[classes/:?]
	at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:688) ~[classes/:?]
	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952) ~[classes/:?]
	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:921) [classes/:?]
	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) [classes/:?]
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) [classes/:?]
	at java.lang.Thread.run(Thread.java:750) [?:1.8.0_345]
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for StreamFlatMap_d1392353922252257afa15351a98bae9_(2/4) from any of the 1 provided restore options.
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:355) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:166) ~[classes/:?]
	... 11 more
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:405) ~[classes/:?]
	at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:510) ~[classes/:?]
	at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:99) ~[classes/:?]
	at org.apache.flink.state.changelog.AbstractChangelogStateBackend.lambda$createKeyedStateBackend$1(AbstractChangelogStateBackend.java:145) ~[classes/:?]
	at org.apache.flink.state.changelog.restore.ChangelogBackendRestoreOperation.restore(ChangelogBackendRestoreOperation.java:72) ~[classes/:?]
	at org.apache.flink.state.changelog.ChangelogStateBackend.restore(ChangelogStateBackend.java:94) ~[classes/:?]
	at org.apache.flink.state.changelog.AbstractChangelogStateBackend.createKeyedStateBackend(AbstractChangelogStateBackend.java:136) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:338) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:355) ~[classes/:?]
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:166) ~[classes/:?]
	... 11 more
Caused by: java.io.FileNotFoundException: /tmp/junit5080387696602759609/checkpoints_9efe7b0c-8da8-4bef-9020-57abf2e29ac9/5fe4979761499f945215ac42815e65d4/taskowned/61058282-6146-48c0-9ece-b5d6bdd44234 (No such file or directory)
	at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_345]
	at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_345]
	at java.io.FileInputStream.<init>(FileInputStream.java:138) ~[?:1.8.0_345]
	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50) ~[classes/:?]
	at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) ~[classes/:?]
	at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:69) ~[classes/:?]
	at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.downloadDataForStateHandle(RocksDBStateDownloader.java:127) ~[classes/:?]
	at org.apache.flink.contrib.streaming.state.RocksDBStateDownloader.lambda$createDownloadRunnables$0(RocksDBStateDownloader.java:110) ~[classes/:?]
	at org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49) ~[classes/:?]
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640) ~[?:1.8.0_345]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_345]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_345]
	... 1 more
{code}

which looks like the same issue we're having here in FLINK-28440. [~Yanfei Lei] can you give some guidance here?

> EventTimeWindowCheckpointingITCase failed with restore
> ------------------------------------------------------
>
>                 Key: FLINK-28440
>                 URL: https://issues.apache.org/jira/browse/FLINK-28440
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / State Backends
>    Affects Versions: 1.16.0, 1.17.0
>            Reporter: Huang Xingbo
>            Assignee: Yanfei Lei
>            Priority: Critical
>              Labels: auto-deprioritized-critical, pull-request-available, test-stability
>             Fix For: 1.16.2, 1.18.0
>
>         Attachments: image-2023-02-01-00-51-54-506.png, image-2023-02-01-01-10-01-521.png, image-2023-02-01-01-19-12-182.png, image-2023-02-01-16-47-23-756.png, image-2023-02-01-16-57-43-889.png, image-2023-02-02-10-52-56-599.png, image-2023-02-03-10-09-07-586.png, image-2023-02-03-12-03-16-155.png, image-2023-02-03-12-03-56-614.png
>
>
> {code:java}
> Caused by: java.lang.Exception: Exception while creating StreamOperatorStateContext.
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:256)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:268)
> 	at org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.initializeStateAndOpenOperators(RegularOperatorChain.java:106)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:722)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreInternal(StreamTask.java:698)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:665)
> 	at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
> 	at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:904)
> 	at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for WindowOperator_0a448493b4782967b150582570326227_(2/4) from any of the 1 provided restore options.
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:353)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:165)
> 	... 11 more
> Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: /tmp/junit1835099326935900400/junit1113650082510421526/52ee65b7-033f-4429-8ddd-adbe85e27ced (No such file or directory)
> 	at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:321)
> 	at org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:87)
> 	at org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.hasNext(StateChangelogHandleStreamHandleReader.java:69)
> 	at org.apache.flink.state.changelog.restore.ChangelogBackendRestoreOperation.readBackendHandle(ChangelogBackendRestoreOperation.java:96)
> 	at org.apache.flink.state.changelog.restore.ChangelogBackendRestoreOperation.restore(ChangelogBackendRestoreOperation.java:75)
> 	at org.apache.flink.state.changelog.ChangelogStateBackend.restore(ChangelogStateBackend.java:92)
> 	at org.apache.flink.state.changelog.AbstractChangelogStateBackend.createKeyedStateBackend(AbstractChangelogStateBackend.java:136)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:336)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
> 	... 13 more
> Caused by: java.io.FileNotFoundException: /tmp/junit1835099326935900400/junit1113650082510421526/52ee65b7-033f-4429-8ddd-adbe85e27ced (No such file or directory)
> 	at java.io.FileInputStream.open0(Native Method)
> 	at java.io.FileInputStream.open(FileInputStream.java:195)
> 	at java.io.FileInputStream.<init>(FileInputStream.java:138)
> 	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
> 	at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134)
> 	at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:87)
> 	at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:72)
> 	at org.apache.flink.changelog.fs.StateChangeFormat.read(StateChangeFormat.java:92)
> 	at org.apache.flink.runtime.state.changelog.StateChangelogHandleStreamHandleReader$1.advance(StateChangelogHandleStreamHandleReader.java:85)
> 	... 21 more
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=37772&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=8916
> Other tests where this stacktrace was observed in test failures is {{ChangelogRecoveryITCase}} (FLINK-30107) and {{ChangelogRecoverySwitchStateBackendITCase}} (FLINK-28898).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)