You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Arvid Heise (Jira)" <ji...@apache.org> on 2020/10/12 19:15:00 UTC

[jira] [Commented] (FLINK-19585) UnalignedCheckpointCompatibilityITCase.test:97->runAndTakeSavepoint: "Not all required tasks are currently running."

    [ https://issues.apache.org/jira/browse/FLINK-19585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212608#comment-17212608 ] 

Arvid Heise commented on FLINK-19585:
-------------------------------------

Hard to debug as the logs are not telling much. Added `TestLogger` in a hotfix commit. Let's dig in if it pops up again.

 

Basic suspicions:
 * Job is stopped with a savepoint but a final checkpoint is triggered (maybe queued in coordinator) but the tasks have been completed already
 * Job is stopped with savepoint but somehow tasks are stopped before savepoint is actually started (ongoing UC?)
 * Job has not actually started (there seems to be a sleep which is ofc not very fool-proof).

> UnalignedCheckpointCompatibilityITCase.test:97->runAndTakeSavepoint: "Not all required tasks are currently running."
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-19585
>                 URL: https://issues.apache.org/jira/browse/FLINK-19585
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.0
>            Reporter: Robert Metzger
>            Priority: Critical
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=7419&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=f508e270-48d6-5f1e-3138-42a17e0714f0
> {code}
> 2020-10-12T10:27:51.7667213Z [ERROR] Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 13.146 s <<< FAILURE! - in org.apache.flink.test.checkpointing.UnalignedCheckpointCompatibilityITCase
> 2020-10-12T10:27:51.7675454Z [ERROR] test[type: SAVEPOINT, startAligned: false](org.apache.flink.test.checkpointing.UnalignedCheckpointCompatibilityITCase)  Time elapsed: 2.168 s  <<< ERROR!
> 2020-10-12T10:27:51.7676759Z java.util.concurrent.ExecutionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Not all required tasks are currently running.
> 2020-10-12T10:27:51.7686572Z 	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2020-10-12T10:27:51.7688239Z 	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> 2020-10-12T10:27:51.7689543Z 	at org.apache.flink.test.checkpointing.UnalignedCheckpointCompatibilityITCase.runAndTakeSavepoint(UnalignedCheckpointCompatibilityITCase.java:113)
> 2020-10-12T10:27:51.7690681Z 	at org.apache.flink.test.checkpointing.UnalignedCheckpointCompatibilityITCase.test(UnalignedCheckpointCompatibilityITCase.java:97)
> 2020-10-12T10:27:51.7691513Z 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-10-12T10:27:51.7692182Z 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-10-12T10:27:51.7692964Z 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-10-12T10:27:51.7693655Z 	at java.lang.reflect.Method.invoke(Method.java:498)
> 2020-10-12T10:27:51.7694489Z 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-10-12T10:27:51.7707103Z 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-10-12T10:27:51.7729199Z 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-10-12T10:27:51.7730097Z 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-10-12T10:27:51.7730833Z 	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-10-12T10:27:51.7731500Z 	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-10-12T10:27:51.7732086Z 	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-10-12T10:27:51.7732781Z 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-10-12T10:27:51.7733563Z 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-10-12T10:27:51.7734735Z 	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-10-12T10:27:51.7735400Z 	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-10-12T10:27:51.7736075Z 	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-10-12T10:27:51.7736757Z 	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-10-12T10:27:51.7737432Z 	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-10-12T10:27:51.7738081Z 	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-10-12T10:27:51.7739008Z 	at org.junit.runners.Suite.runChild(Suite.java:128)
> 2020-10-12T10:27:51.7739583Z 	at org.junit.runners.Suite.runChild(Suite.java:27)
> 2020-10-12T10:27:51.7740173Z 	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-10-12T10:27:51.7740800Z 	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-10-12T10:27:51.7741470Z 	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-10-12T10:27:51.7742150Z 	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-10-12T10:27:51.7742808Z 	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-10-12T10:27:51.7743457Z 	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-10-12T10:27:51.7768250Z 	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> 2020-10-12T10:27:51.7769287Z 	at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> 2020-10-12T10:27:51.7770227Z 	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> 2020-10-12T10:27:51.7771168Z 	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> 2020-10-12T10:27:51.7772013Z 	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> 2020-10-12T10:27:51.7772894Z 	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> 2020-10-12T10:27:51.7773673Z 	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> 2020-10-12T10:27:51.7774734Z 	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> 2020-10-12T10:27:51.7775697Z Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointException: Not all required tasks are currently running.
> 2020-10-12T10:27:51.7776658Z 	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> 2020-10-12T10:27:51.7777468Z 	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> 2020-10-12T10:27:51.7778379Z 	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> 2020-10-12T10:27:51.7779152Z 	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2020-10-12T10:27:51.7779888Z 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-10-12T10:27:51.7780806Z 	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-10-12T10:27:51.7781692Z 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$null$0(CheckpointCoordinator.java:467)
> 2020-10-12T10:27:51.7782539Z 	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2020-10-12T10:27:51.7783358Z 	at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> 2020-10-12T10:27:51.7784089Z 	at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> 2020-10-12T10:27:51.7785057Z 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$triggerSavepointInternal$1(CheckpointCoordinator.java:463)
> 2020-10-12T10:27:51.7785854Z 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 2020-10-12T10:27:51.7786452Z 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 2020-10-12T10:27:51.7787161Z 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> 2020-10-12T10:27:51.7788496Z 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> 2020-10-12T10:27:51.7789333Z 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 2020-10-12T10:27:51.7790043Z 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 2020-10-12T10:27:51.7790770Z 	at java.lang.Thread.run(Thread.java:748)
> 2020-10-12T10:27:51.7791415Z Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Not all required tasks are currently running.
> 2020-10-12T10:27:51.7792516Z 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.getTriggerExecutions(CheckpointCoordinator.java:1724)
> 2020-10-12T10:27:51.7793448Z 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.startTriggeringCheckpoint(CheckpointCoordinator.java:510)
> 2020-10-12T10:27:51.7794766Z 	at java.util.Optional.ifPresent(Optional.java:159)
> 2020-10-12T10:27:51.7795546Z 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:500)
> 2020-10-12T10:27:51.7796558Z 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.lambda$triggerSavepointInternal$1(CheckpointCoordinator.java:458)
> 2020-10-12T10:27:51.7797253Z 	... 7 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)