You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user-zh@flink.apache.org by yidan zhao <hi...@gmail.com> on 2021/11/01 06:25:15 UTC

standalone集群重启后自动回复任务,任务的jobmaster如果失败会导致JM进程失败

如题,这个问题之前遇到过,当时我email问的是集群不断重启。
这次也是这个问题,集群不断重启,但分析下原因如题。看日志片段如下:

2021-11-01 14:05:36,954 INFO  [78-cluster-io-thread-1]
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:181)
- Recovered JobGraph(jobId: dfced635fd8c224222a9cbaaf1c5054f).
2021-11-01 14:05:36,954 INFO  [78-cluster-io-thread-1]
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:125)
- Successfully recovered 1 persisted job graphs.
2021-11-01 14:05:36,962 INFO  [78-cluster-io-thread-1]
org.apache.flink.runtime.rpc.akka.AkkaRpcService.startServer(AkkaRpcService.java:232)
- Starting RPC endpoint for
org.apache.flink.runtime.dispatcher.StandaloneDispatcher at
akka://flink/user/rpc/dispatcher_1 .
2021-11-01 14:05:44,810 INFO  [94-flink-akka.actor.default-dispatcher-30]
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.start(DefaultLeaderElectionService.java:93)
- Starting DefaultLeaderElectionService with
ZooKeeperLeaderElectionDriver{leaderPath='/leader/dfced635fd8c224222a9cbaaf1c5054f/job_manager_lock'}.
2021-11-01 14:05:44,836 ERROR [94-flink-akka.actor.default-dispatcher-30]
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.onFatalError(ClusterEntrypoint.java:454)
- Fatal error occurred in the cluster entrypoint.
org.apache.flink.util.FlinkException: JobMaster for job
dfced635fd8c224222a9cbaaf1c5054f failed.
        at
org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:873)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]

如上,恢复了jobgraph,开启 leader 选举(看起来像是jobmaster的leader选举服务),然后jobmaster 挂了。


如上,我想知道为什么jobmaster挂了就会导致 standalone JM 进程失败呢?
JM进程是所有任务公用,即使启动后之前的某个job无法恢复,也没必要因此就挂掉吧。

Re: standalone集群重启后自动回复任务,任务的jobmaster如果失败会导致JM进程失败

Posted by yidan zhao <hi...@gmail.com>.
补充个更完整的日志:
....
2021-11-01 14:15:15,849 INFO  [78-cluster-io-thread-1]
org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:181)
- Recovered JobGraph(jobId: dfced635fd8c224222a9cbaaf1c5054f).
2021-11-01 14:15:15,849 INFO  [78-cluster-io-thread-1]
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:125)
- Successfully recovered 1 persisted job graphs.
2021-11-01 14:15:15,856 INFO  [78-cluster-io-thread-1]
org.apache.flink.runtime.rpc.akka.AkkaRpcService.startServer(AkkaRpcService.java:232)
- Starting RPC endpoint for
org.apache.flink.runtime.dispatcher.StandaloneDispatcher at
akka://flink/user/rpc/dispatcher_1 .
2021-11-01 14:15:22,867 INFO  [30-flink-akka.actor.default-dispatcher-3]
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.start(DefaultLeaderElectionService.java:93)
- Starting DefaultLeaderElectionService with
ZooKeeperLeaderElectionDriver{leaderPath='/leader/dfced635fd8c224222a9cbaaf1c5054f/job_manager_lock'}.

2021-11-01 14:15:22,892 ERROR [30-flink-akka.actor.default-dispatcher-3]
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.onFatalError(ClusterEntrypoint.java:454)
- Fatal error occurred in the cluster entrypoint.
org.apache.flink.util.FlinkException: JobMaster for job
dfced635fd8c224222a9cbaaf1c5054f failed.
at
org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:873)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:459)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$3(Dispatcher.java:418)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822)
~[?:1.8.0_152]
at
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797)
~[?:1.8.0_152]
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
~[?:1.8.0_152]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
Caused by: org.apache.flink.runtime.jobmaster.JobNotFinishedException: The
job (dfced635fd8c224222a9cbaaf1c5054f) has not been finished.
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.jobAlreadyDone(JobMasterServiceLeadershipRunner.java:288)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.verifyJobSchedulingStatusAndCreateJobMasterServiceProcess(JobMasterServiceLeadershipRunner.java:276)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$null$8(JobMasterServiceLeadershipRunner.java:262)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.util.function.ThrowingRunnable.lambda$unchecked$0(ThrowingRunnable.java:49)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfValidLeader(JobMasterServiceLeadershipRunner.java:496)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$startJobMasterServiceProcessAsync$9(JobMasterServiceLeadershipRunner.java:258)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705)
~[?:1.8.0_152]
at
java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:717)
~[?:1.8.0_152]
at
java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2010)
~[?:1.8.0_152]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.startJobMasterServiceProcessAsync(JobMasterServiceLeadershipRunner.java:256)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.lambda$grantLeadership$7(JobMasterServiceLeadershipRunner.java:249)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.runIfStateRunning(JobMasterServiceLeadershipRunner.java:464)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner.grantLeadership(JobMasterServiceLeadershipRunner.java:248)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.onGrantLeadership(DefaultLeaderElectionService.java:211)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.isLeader(ZooKeeperLeaderElectionDriver.java:166)
~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:693)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$9.apply(LeaderLatch.java:689)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.setLeadership(LeaderLatch.java:688)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.checkLeadership(LeaderLatch.java:567)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch.access$700(LeaderLatch.java:65)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.recipes.leader.LeaderLatch$7.processResult(LeaderLatch.java:618)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.sendToBackgroundCallback(CuratorFrameworkImpl.java:883)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:653)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.WatcherRemovalFacade.processBackgroundOperation(WatcherRemovalFacade.java:152)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.GetChildrenBuilderImpl$2.processResult(GetChildrenBuilderImpl.java:187)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:601)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
at
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0]
2021-11-01 14:15:22,896 INFO  [27-StandaloneSessionClusterEntrypoint
shutdown hook]
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.shutDownAsync(ClusterEntrypoint.java:481)
- Shutting StandaloneSessionClusterEntrypoint down with application status
UNKNOWN. Diagnostics Cluster entrypoint has been closed externally..
2021-11-01 14:15:22,897 INFO  [27-StandaloneSessionClusterEntrypoint
shutdown hook]
org.apache.flink.runtime.rest.RestServerEndpoint.closeAsync(RestServerEndpoint.java:309)
- Shutting down rest endpoint.
2021-11-01 14:15:22,923 INFO  [52-BlobServer shutdown hook]
org.apache.flink.runtime.blob.BlobServer.close(BlobServer.java:345) -
Stopped BLOB server at 0.0.0.0:41066
2021-11-01 14:15:22,937 INFO  [100-ForkJoinPool.commonPool-worker-22]
org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.lambda$shutDownInternal$5(WebMonitorEndpoint.java:964)
- Removing cache directory
/tmp/flink-web-85060404-ac4d-44ff-8ffe-bc2235ff0acf/flink-web-ui
2021-11-01 14:15:22,937 INFO  [100-ForkJoinPool.commonPool-worker-22]
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:101)
- Stopping DefaultLeaderElectionService.
2021-11-01 14:15:22,938 INFO  [100-ForkJoinPool.commonPool-worker-22]
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.close(ZooKeeperLeaderElectionDriver.java:132)
- Closing
ZooKeeperLeaderElectionDriver{leaderPath='/leader/rest_server_lock'}
2021-11-01 14:15:22,943 INFO  [100-ForkJoinPool.commonPool-worker-22]
org.apache.flink.runtime.rest.RestServerEndpoint.lambda$closeAsync$1(RestServerEndpoint.java:317)
- Shut down complete.
2021-11-01 14:15:22,943 INFO  [100-ForkJoinPool.commonPool-worker-22]
org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent.closeAsyncInternal(DispatcherResourceManagerComponent.java:162)
- Closing components.
2021-11-01 14:15:22,943 INFO  [100-ForkJoinPool.commonPool-worker-22]
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService.stop(DefaultLeaderRetrievalService.java:106)
- Stopping DefaultLeaderRetrievalService.
2021-11-01 14:15:22,943 INFO  [100-ForkJoinPool.commonPool-worker-22]
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.close(ZooKeeperLeaderRetrievalDriver.java:108)
- Closing
ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/dispatcher_lock'}.
2021-11-01 14:15:22,943 INFO  [100-ForkJoinPool.commonPool-worker-22]
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService.stop(DefaultLeaderRetrievalService.java:106)
- Stopping DefaultLeaderRetrievalService.
2021-11-01 14:15:22,943 INFO  [100-ForkJoinPool.commonPool-worker-22]
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver.close(ZooKeeperLeaderRetrievalDriver.java:108)
- Closing
ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/resource_manager_lock'}.
2021-11-01 14:15:22,943 INFO  [100-ForkJoinPool.commonPool-worker-22]
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:101)
- Stopping DefaultLeaderElectionService.
2021-11-01 14:15:22,943 INFO  [100-ForkJoinPool.commonPool-worker-22]
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.close(ZooKeeperLeaderElectionDriver.java:132)
- Closing
ZooKeeperLeaderElectionDriver{leaderPath='/leader/dispatcher_lock'}
2021-11-01 14:15:22,944 INFO  [100-ForkJoinPool.commonPool-worker-22]
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.closeInternal(AbstractDispatcherLeaderProcess.java:134)
- Stopping SessionDispatcherLeaderProcess.
2021-11-01 14:15:22,945 INFO  [29-flink-akka.actor.default-dispatcher-2]
org.apache.flink.runtime.util.OperaInstanceMigrateManager.stopMigrateCheck(OperaInstanceMigrateManager.java:179)
- Start to stop Migrate check...
2021-11-01 14:15:22,945 INFO  [29-flink-akka.actor.default-dispatcher-2]
org.apache.flink.runtime.util.OperaInstanceMigrateManager.stopMigrateCheck(OperaInstanceMigrateManager.java:184)
- Start to stop jmHeartbeat report...
2021-11-01 14:15:22,946 INFO  [29-flink-akka.actor.default-dispatcher-2]
org.apache.flink.runtime.util.OperaInstanceMigrateManager.stopMigrateCheck(OperaInstanceMigrateManager.java:189)
- Shutdown executorService...
2021-11-01 14:15:22,946 INFO  [29-flink-akka.actor.default-dispatcher-2]
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager.close(DeclarativeSlotManager.java:240)
- Closing the slot manager.
2021-11-01 14:15:22,947 INFO  [29-flink-akka.actor.default-dispatcher-2]
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager.suspend(DeclarativeSlotManager.java:212)
- Suspending the slot manager.
2021-11-01 14:15:22,950 INFO  [29-flink-akka.actor.default-dispatcher-2]
org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.stop(DefaultLeaderElectionService.java:101)
- Stopping DefaultLeaderElectionService.
2021-11-01 14:15:22,950 INFO  [29-flink-akka.actor.default-dispatcher-2]
org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver.close(ZooKeeperLeaderElectionDriver.java:132)
- Closing
ZooKeeperLeaderElectionDriver{leaderPath='/leader/resource_manager_lock'}


yidan zhao <hi...@gmail.com> 于2021年11月1日周一 下午2:25写道:

> 如题,这个问题之前遇到过,当时我email问的是集群不断重启。
> 这次也是这个问题,集群不断重启,但分析下原因如题。看日志片段如下:
>
> 2021-11-01 14:05:36,954 INFO  [78-cluster-io-thread-1]
> org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:181)
> - Recovered JobGraph(jobId: dfced635fd8c224222a9cbaaf1c5054f).
> 2021-11-01 14:05:36,954 INFO  [78-cluster-io-thread-1]
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:125)
> - Successfully recovered 1 persisted job graphs.
> 2021-11-01 14:05:36,962 INFO  [78-cluster-io-thread-1]
> org.apache.flink.runtime.rpc.akka.AkkaRpcService.startServer(AkkaRpcService.java:232)
> - Starting RPC endpoint for
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher at
> akka://flink/user/rpc/dispatcher_1 .
> 2021-11-01 14:05:44,810 INFO  [94-flink-akka.actor.default-dispatcher-30]
> org.apache.flink.runtime.leaderelection.DefaultLeaderElectionService.start(DefaultLeaderElectionService.java:93)
> - Starting DefaultLeaderElectionService with
> ZooKeeperLeaderElectionDriver{leaderPath='/leader/dfced635fd8c224222a9cbaaf1c5054f/job_manager_lock'}.
> 2021-11-01 14:05:44,836 ERROR [94-flink-akka.actor.default-dispatcher-30]
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.onFatalError(ClusterEntrypoint.java:454)
> - Fatal error occurred in the cluster entrypoint.
> org.apache.flink.util.FlinkException: JobMaster for job
> dfced635fd8c224222a9cbaaf1c5054f failed.
>         at
> org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:873)
> ~[flink-dist_2.11-1.13.0.1-sc-SNAPSHOT.jar:1.13.0.1-sc-SNAPSHOT]
>
> 如上,恢复了jobgraph,开启 leader 选举(看起来像是jobmaster的leader选举服务),然后jobmaster 挂了。
>
>
> 如上,我想知道为什么jobmaster挂了就会导致 standalone JM 进程失败呢?
> JM进程是所有任务公用,即使启动后之前的某个job无法恢复,也没必要因此就挂掉吧。
>
>