You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Niklas Semmler (Jira)" <ji...@apache.org> on 2022/01/27 14:17:00 UTC

[jira] [Comment Edited] (FLINK-25564) TaskManagerProcessFailureStreamingRecoveryITCase>AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure fails on AZP

    [ https://issues.apache.org/jira/browse/FLINK-25564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483165#comment-17483165 ] 

Niklas Semmler edited comment on FLINK-25564 at 1/27/22, 2:16 PM:
------------------------------------------------------------------

The error is caused by an exception while cleaning up the [checkpoint directory.|#L105].] The "_metadata" file is removed during the clean up process. (To be exact while the process is trying to ensure that the file is not read-only).

I am not sure, why the file disappears. I'd guess that the Flink Cluster does a separate clean-up. Possibly [here]([https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCompletedCheckpointStorageLocation.java#L70)] and deletes the "_metadata" file at the same time. The likelihood of the two clean-ups to interlock is small, so this could be a reason why this doesn't come up often.

Two options to solve this would be to:
 # Wait to give the Checkpoint cleanup initiated by the Flink cluster more time.
 # Ignore the exception thrown by the cleanup initiated by the test


was (Author: JIRAUSER281719):
The error is caused by an exception while cleaning up the [checkpoint directory|[https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-tests/src/test/java/org/apache/flink/test/recovery/TaskManagerProcessFailureStreamingRecoveryITCase.java#L105].] The "_metadata" file is removed during the clean up process. (To be exact while the process is trying to ensure that the file is not read-only).

I am not sure, why the file disappears. I'd guess that the Flink Cluster does a separate clean-up. Possibly [here]([https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-runtime/src/main/java/org/apache/flink/runtime/state/filesystem/FsCompletedCheckpointStorageLocation.java#L70)] and deletes the "_metadata" file at the same time. The likelihood of the two clean-ups to interlock is small, so this could be a reason why this doesn't come up often.

Two options to solve this would be to:
 * Wait to give the Checkpoint cleanup initiated by the Flink cluster more time.
 * Ignore the exception thrown by the cleanup initiated by the test

> TaskManagerProcessFailureStreamingRecoveryITCase>AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure fails on AZP
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-25564
>                 URL: https://issues.apache.org/jira/browse/FLINK-25564
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Till Rohrmann
>            Assignee: Niklas Semmler
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.15.0
>
>
> The test {{TaskManagerProcessFailureStreamingRecoveryITCase>AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure}} fails on AZP with
> {code}
> Jan 07 05:07:22 [ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 31.057 s <<< FAILURE! - in org.apache.flink.test.recovery.TaskManagerProcessFailureStreamingRecoveryITCase
> Jan 07 05:07:22 [ERROR] org.apache.flink.test.recovery.TaskManagerProcessFailureStreamingRecoveryITCase.testTaskManagerProcessFailure  Time elapsed: 31.012 s  <<< FAILURE!
> Jan 07 05:07:22 java.lang.AssertionError: The program encountered a IOExceptionList : /tmp/junit2133275241637829858/junit7793757951823298127
> Jan 07 05:07:22 	at org.junit.Assert.fail(Assert.java:89)
> Jan 07 05:07:22 	at org.apache.flink.test.recovery.AbstractTaskManagerProcessFailureRecoveryTest.testTaskManagerProcessFailure(AbstractTaskManagerProcessFailureRecoveryTest.java:205)
> Jan 07 05:07:22 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> Jan 07 05:07:22 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Jan 07 05:07:22 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Jan 07 05:07:22 	at java.lang.reflect.Method.invoke(Method.java:498)
> Jan 07 05:07:22 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> Jan 07 05:07:22 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Jan 07 05:07:22 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> Jan 07 05:07:22 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Jan 07 05:07:22 	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> Jan 07 05:07:22 	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> Jan 07 05:07:22 	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> Jan 07 05:07:22 	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> Jan 07 05:07:22 	at org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> Jan 07 05:07:22 	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
> Jan 07 05:07:22 	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Jan 07 05:07:22 	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> Jan 07 05:07:22 	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> Jan 07 05:07:22 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> Jan 07 05:07:22 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> Jan 07 05:07:22 	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> Jan 07 05:07:22 	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> Jan 07 05:07:22 	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> Jan 07 05:07:22 	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> Jan 07 05:07:22 	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> Jan 07 05:07:22 	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Jan 07 05:07:22 	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> Jan 07 05:07:22 	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
> Jan 07 05:07:22 	at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
> Jan 07 05:07:22 	at org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:42)
> Jan 07 05:07:22 	at org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:80)
> Jan 07 05:07:22 	at org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:72)
> Jan 07 05:07:22 	at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:107)
> Jan 07 05:07:22 	at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:88)
> Jan 07 05:07:22 	at org.junit.platform.launcher.core.EngineExecutionOrchestrator.lambda$execute$0(EngineExecutionOrchestrator.java:54)
> Jan 07 05:07:22 	at org.junit.platform.launcher.core.EngineExecutionOrchestrator.withInterceptedStreams(EngineExecutionOrchestrator.java:67)
> Jan 07 05:07:22 	at org.junit.platform.launcher.core.EngineExecutionOrchestrator.execute(EngineExecutionOrchestrator.java:52)
> Jan 07 05:07:22 	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:114)
> Jan 07 05:07:22 	at org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:86)
> Jan 07 05:07:22 	at org.junit.platform.launcher.core.DefaultLauncherSession$DelegatingLauncher.execute(DefaultLauncherSession.java:86)
> Jan 07 05:07:22 	at org.junit.platform.launcher.core.SessionPerRequestLauncher.execute(SessionPerRequestLauncher.java:53)
> Jan 07 05:07:22 	at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.execute(JUnitPlatformProvider.java:188)
> Jan 07 05:07:22 	at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:154)
> Jan 07 05:07:22 	at org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:124)
> Jan 07 05:07:22 	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428)
> Jan 07 05:07:22 	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
> Jan 07 05:07:22 	at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=29069&view=logs&j=4d4a0d10-fca2-5507-8eed-c07f0bdf4887&t=7b25afdf-cc6c-566f-5459-359dc2585798&l=16224



--
This message was sent by Atlassian Jira
(v8.20.1#820001)