You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (Jira)" <ji...@apache.org> on 2020/01/16 14:02:00 UTC

[jira] [Comment Edited] (FLINK-15150) ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange failed on Travis

    [ https://issues.apache.org/jira/browse/FLINK-15150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016965#comment-17016965 ] 

Chesnay Schepler edited comment on FLINK-15150 at 1/16/20 2:01 PM:
-------------------------------------------------------------------

What seems to happen is that, while JobMaster is in the process of allocating slots, the test is calling {{Dispatcher#shutDownCluster}}. The dispatcher starts shutting down components, but the JobMaster is _not_ shut down first.
This results in a job failure since the slotmanager shuts down during the allocation, which fails the allocation, which fails the task, which fails the job.
Job restarts are forbidden by the restart strategy, so the job ends up in a terminal state, and subsequent dispatchers don't recover the job.

MInimized log with some additional log messages added:
{code}
...
18660 [dispatcher-5] INFO  o.a.f.r.dispatcher.StandaloneDispatcher  - Submitting job 411378b9f7181f9ba431960b08b701d7 (Blocking test job).
18660 [dispatcher-5] DEBUG o.a.f.r.jobmanager.ZooKeeperJobGraphStore  - Adding job graph 411378b9f7181f9ba431960b08b701d7 to flink/default/jobgraphs/411378b9f7181f9ba431960b08b701d7.
18670 [dispatcher-5] INFO  o.a.f.r.jobmanager.ZooKeeperJobGraphStore  - Added JobGraph(jobId: 411378b9f7181f9ba431960b08b701d7) to ZooKeeper.
18670 [dispatcher-2] INFO  o.a.f.r.rpc.akka.AkkaRpcService  - Starting RPC endpoint for o.a.f.r.jobmaster.JobMaster at akka://flink/user/jobmanager_4 .
...
18700 [main-EventThread] INFO  o.a.f.r.jobmaster.JobManagerRunnerImpl  - JobManager runner for job Blocking test job (411378b9f7181f9ba431960b08b701d7) was granted leadership with session id 3ac17f36-f3d9-4b63-ae1e-18572db0cb23 at akka://flink/user/jobmanager_4.
18700 [main-EventThread] DEBUG o.a.f.r.highavailability.zookeeper.ZooKeeperRunningJobsRegistry  - Setting scheduling state to RUNNING for job 411378b9f7181f9ba431960b08b701d7.
...
18720 [dispatcher-4] INFO  o.a.f.r.executiongraph.ExecutionGraph  - Job Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state CREATED to RUNNING.
18720 [main] INFO  org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase  - Initiated cluster shutdown.
...
18736 [dispatcher-2] INFO  o.a.f.r.jobmaster.JobMaster  - JobManager successfully registered at ResourceManager, leader id: a64a7f6db3146cf378a7d898cf6044d1.
18736 [dispatcher-2] INFO  o.a.f.r.jobmaster.slotpool.SlotPoolImpl  - Requesting new slot [SlotRequestId{7a9512bdaf255cc687212211142abd4b}] and profile ResourceProfile{UNKNOWN} from resource manager.
18736 [dispatcher-3] INFO  o.a.f.r.resourcemanager.StandaloneResourceManager  - Request slot with profile ResourceProfile{UNKNOWN} for job 411378b9f7181f9ba431960b08b701d7 with allocation id 0e5171488810f0cc1c3265e9b278bb17.
...
18740 [dispatcher-3] INFO  o.a.f.r.resourcemanager.slotmanager.SlotManagerImpl  - Suspending the SlotManager.
18740 [dispatcher-5] INFO  o.a.f.r.jobmaster.slotpool.SlotPoolImpl  - Failing pending slot request [SlotRequestId{7a9512bdaf255cc687212211142abd4b}]: The slot manager is being suspended.
...
18750 [dispatcher-5] INFO  o.a.f.r.executiongraph.ExecutionGraph  - blocking operator (1/1) (927b48379633675aad810f12af2470a6) switched from SCHEDULED to FAILED.
java.util.concurrent.CompletionException: o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is being suspended.
...
18750 [dispatcher-5] INFO  o.a.f.r.executiongraph.ExecutionGraph  - Job Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state RUNNING to FAILING.
o.a.f.r.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
Caused by: java.util.concurrent.CompletionException: o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is being suspended.
...
18760 [dispatcher-5] INFO  o.a.f.r.executiongraph.ExecutionGraph  - Job Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state FAILING to FAILED.
o.a.f.r.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
Caused by: java.util.concurrent.CompletionException: o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is being suspended.
...
18765 [jobmanager-future-thread-2] DEBUG o.a.f.r.jobmaster.JobManagerRunnerImpl  - Job 411378b9f7181f9ba431960b08b701d7 reached globally terminal state FAILED.
18765 [jobmanager-future-thread-2] DEBUG o.a.f.r.jobmaster.JobManagerRunnerImpl  - Unregistering job from 411378b9f7181f9ba431960b08b701d7 from high-availability.
18765 [jobmanager-future-thread-2] DEBUG o.a.f.r.highavailability.zookeeper.ZooKeeperRunningJobsRegistry  - Setting scheduling state to DONE for job 411378b9f7181f9ba431960b08b701d7.
...
18770 [mini-cluster-io-thread-2] INFO  o.a.f.r.jobmanager.ZooKeeperJobGraphStore  - Recovered JobGraph(jobId: 411378b9f7181f9ba431960b08b701d7).
18770 [mini-cluster-io-thread-2] INFO  o.a.f.r.dispatcher.runner.SessionDispatcherLeaderProcess  - Successfully recovered 1 persisted job graphs.
...
23748 [main-EventThread] INFO  o.a.f.r.jobmaster.JobManagerRunnerImpl  - Granted leader ship but job 411378b9f7181f9ba431960b08b701d7 has been finished.
{code}

From what I can tell the shutdown logic should be adjusted to first shutdown the JobMaster before shutting down any service the JM may be relying on.
Unfortunately the shutdown logic is one of the most obfuscated things I have seen in a while, and I can't make heads-or-tails of how it is even supposed to work.

[~trohrmann] WDYT?


was (Author: zentol):
What seems to happen is that, while JobMaster is in the process of allocating slots, the test is calling {{Dispatcher#shutDownCluster}}. The dispatcher starts shutting down components, but the JobMaster is _not_ shut down first.
This results in a job failure since the slotpool is shutdown during the allocation, which fails the allocation, which fails the task, which fails the job.
Job restarts are forbidden by the restart strategy, so the job ends up in a terminal state, and subsequent dispatchers don't recover the job.

MInimized log with some additional log messages added:
{code}
...
18660 [dispatcher-5] INFO  o.a.f.r.dispatcher.StandaloneDispatcher  - Submitting job 411378b9f7181f9ba431960b08b701d7 (Blocking test job).
18660 [dispatcher-5] DEBUG o.a.f.r.jobmanager.ZooKeeperJobGraphStore  - Adding job graph 411378b9f7181f9ba431960b08b701d7 to flink/default/jobgraphs/411378b9f7181f9ba431960b08b701d7.
18670 [dispatcher-5] INFO  o.a.f.r.jobmanager.ZooKeeperJobGraphStore  - Added JobGraph(jobId: 411378b9f7181f9ba431960b08b701d7) to ZooKeeper.
18670 [dispatcher-2] INFO  o.a.f.r.rpc.akka.AkkaRpcService  - Starting RPC endpoint for o.a.f.r.jobmaster.JobMaster at akka://flink/user/jobmanager_4 .
...
18700 [main-EventThread] INFO  o.a.f.r.jobmaster.JobManagerRunnerImpl  - JobManager runner for job Blocking test job (411378b9f7181f9ba431960b08b701d7) was granted leadership with session id 3ac17f36-f3d9-4b63-ae1e-18572db0cb23 at akka://flink/user/jobmanager_4.
18700 [main-EventThread] DEBUG o.a.f.r.highavailability.zookeeper.ZooKeeperRunningJobsRegistry  - Setting scheduling state to RUNNING for job 411378b9f7181f9ba431960b08b701d7.
...
18720 [dispatcher-4] INFO  o.a.f.r.executiongraph.ExecutionGraph  - Job Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state CREATED to RUNNING.
18720 [main] INFO  org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase  - Initiated cluster shutdown.
...
18736 [dispatcher-2] INFO  o.a.f.r.jobmaster.JobMaster  - JobManager successfully registered at ResourceManager, leader id: a64a7f6db3146cf378a7d898cf6044d1.
18736 [dispatcher-2] INFO  o.a.f.r.jobmaster.slotpool.SlotPoolImpl  - Requesting new slot [SlotRequestId{7a9512bdaf255cc687212211142abd4b}] and profile ResourceProfile{UNKNOWN} from resource manager.
18736 [dispatcher-3] INFO  o.a.f.r.resourcemanager.StandaloneResourceManager  - Request slot with profile ResourceProfile{UNKNOWN} for job 411378b9f7181f9ba431960b08b701d7 with allocation id 0e5171488810f0cc1c3265e9b278bb17.
...
18740 [dispatcher-3] INFO  o.a.f.r.resourcemanager.slotmanager.SlotManagerImpl  - Suspending the SlotManager.
18740 [dispatcher-5] INFO  o.a.f.r.jobmaster.slotpool.SlotPoolImpl  - Failing pending slot request [SlotRequestId{7a9512bdaf255cc687212211142abd4b}]: The slot manager is being suspended.
...
18750 [dispatcher-5] INFO  o.a.f.r.executiongraph.ExecutionGraph  - blocking operator (1/1) (927b48379633675aad810f12af2470a6) switched from SCHEDULED to FAILED.
java.util.concurrent.CompletionException: o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is being suspended.
...
18750 [dispatcher-5] INFO  o.a.f.r.executiongraph.ExecutionGraph  - Job Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state RUNNING to FAILING.
o.a.f.r.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
Caused by: java.util.concurrent.CompletionException: o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is being suspended.
...
18760 [dispatcher-5] INFO  o.a.f.r.executiongraph.ExecutionGraph  - Job Blocking test job (411378b9f7181f9ba431960b08b701d7) switched from state FAILING to FAILED.
o.a.f.r.JobException: Recovery is suppressed by NoRestartBackoffTimeStrategy
Caused by: java.util.concurrent.CompletionException: o.a.f.r.resourcemanager.slotmanager.SlotManagerException: The slot manager is being suspended.
...
18765 [jobmanager-future-thread-2] DEBUG o.a.f.r.jobmaster.JobManagerRunnerImpl  - Job 411378b9f7181f9ba431960b08b701d7 reached globally terminal state FAILED.
18765 [jobmanager-future-thread-2] DEBUG o.a.f.r.jobmaster.JobManagerRunnerImpl  - Unregistering job from 411378b9f7181f9ba431960b08b701d7 from high-availability.
18765 [jobmanager-future-thread-2] DEBUG o.a.f.r.highavailability.zookeeper.ZooKeeperRunningJobsRegistry  - Setting scheduling state to DONE for job 411378b9f7181f9ba431960b08b701d7.
...
18770 [mini-cluster-io-thread-2] INFO  o.a.f.r.jobmanager.ZooKeeperJobGraphStore  - Recovered JobGraph(jobId: 411378b9f7181f9ba431960b08b701d7).
18770 [mini-cluster-io-thread-2] INFO  o.a.f.r.dispatcher.runner.SessionDispatcherLeaderProcess  - Successfully recovered 1 persisted job graphs.
...
23748 [main-EventThread] INFO  o.a.f.r.jobmaster.JobManagerRunnerImpl  - Granted leader ship but job 411378b9f7181f9ba431960b08b701d7 has been finished.
{code}

From what I can tell the shutdown logic should be adjusted to first shutdown the JobMaster before shutting down any service the JM may be relying on.
Unfortunately the shutdown logic is one of the most obfuscated things I have seen in a while, and I can't make heads-or-tails of how it is even supposed to work.

[~trohrmann] WDYT?

> ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange failed on Travis
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-15150
>                 URL: https://issues.apache.org/jira/browse/FLINK-15150
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.10.0
>            Reporter: Congxian Qiu(klion26)
>            Assignee: Chesnay Schepler
>            Priority: Critical
>             Fix For: 1.10.0
>
>
>  
> 06:37:18.423 [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 14.014 s <<< FAILURE! - in org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase
> 375406:37:18.423 [ERROR] testJobExecutionOnClusterWithLeaderChange(org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase) Time elapsed: 14.001 s <<< ERROR!
> 3755java.util.concurrent.ExecutionException: org.apache.flink.util.FlinkException: JobMaster has been shut down.
> 3756 at org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.lambda$testJobExecutionOnClusterWithLeaderChange$1(ZooKeeperLeaderElectionITCase.java:131)
> 3757 at org.apache.flink.test.runtime.leaderelection.ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange(ZooKeeperLeaderElectionITCase.java:131)
> 3758Caused by: org.apache.flink.util.FlinkException: JobMaster has been shut down.
> 3759
>  
> [https://travis-ci.com/flink-ci/flink/jobs/264972218]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)