You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Jun Qin (Jira)" <ji...@apache.org> on 2021/10/14 09:04:00 UTC

[jira] [Commented] (FLINK-24543) Zookeeper connection issue causes inconsistent state in Flink

    [ https://issues.apache.org/jira/browse/FLINK-24543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17428710#comment-17428710 ] 

Jun Qin commented on FLINK-24543:
---------------------------------

One thought after discussed with [~trohrmann]: the reason for the KeeperException$NodeExistsException may be due to the zookeeper client's retries as it might not get the response from the zookeeper server in time. This should be handled correctly by curator transactions but it seems not the case. 

> Zookeeper connection issue causes inconsistent state in Flink
> -------------------------------------------------------------
>
>                 Key: FLINK-24543
>                 URL: https://issues.apache.org/jira/browse/FLINK-24543
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.13.2
>            Reporter: Jun Qin
>            Priority: Major
>
> Env: Flink 1.13.2 with Zookeeper HA
> Here is what happened:
> {code:bash}
> # checkpoint 1116 was triggered
> 2021-10-09 00:16:49,555 [Checkpoint Timer] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 1116 (type=CHECKPOINT) @ 1633738609538 for job a8a4fb85b681a897ba118db64333c9e5.
> # a few seconds later, zookeeper connection suspended, it turned out to be a disk issue at zookeeper side caused slow fsync and commit)
> 2021-10-09 00:16:58,563 [Curator-ConnectionStateManager-0] WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
> 2021-10-09 00:16:58,563 [Curator-ConnectionStateManager-0] WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver [] - Connection to ZooKeeper suspended. The contender LeaderContender: DefaultDispatcherRunner no longer participates in the leader election.
> # job was switching to suspended
> 2021-10-09 00:16:58,564 [flink-akka.actor.default-dispatcher-61] INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Disconnect job manager b79b79fe513fb5f47e7bf447b7d9448c@akka.tcp://flink@flink-...-jobmanager:50010/user/rpc/jobmanager_3 for job a8a4fb85b681a897ba118db64333c9e5 from the resource manager.
> 2021-10-09 00:16:58,565 [flink-akka.actor.default-dispatcher-92] INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager b79b79fe513fb5f47e7bf447b7d9448c@akka.tcp://flink@flink-...-jobmanager:50010/user/rpc/jobmanager_3 for job a8a4fb85b681a897ba118db64333c9e5.
> 2021-10-09 00:16:58,565 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Stopping the JobMaster for job Flink ...(a8a4fb85b681a897ba118db64333c9e5).
> 2021-10-09 00:16:58,567 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job Flink ... (a8a4fb85b681a897ba118db64333c9e5) switched from state RUNNING to SUSPENDED.
> 2021-10-09 00:16:58,570 [flink-akka.actor.default-dispatcher-86] INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Closing ZookeeperLeaderRetrievalDriver{retrievalPath='/leader/a8a4fb85b681a897ba118db64333c9e5/job_manager_lock'}.
> 2021-10-09 00:16:58,667 [flink-akka.actor.default-dispatcher-92] INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job a8a4fb85b681a897ba118db64333c9e5 reached terminal state SUSPENDED.
> # zookeeper connector restored
> 2021-10-09 00:17:08,225 [Curator-ConnectionStateManager-0] INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver [] - Connection to ZooKeeper was reconnected. Leader election can be restarted.
> # received checkpoint acknowledgement, trying to finalize it, then failed to add to zookeeper due to KeeperException$NodeExistsException
> 2021-10-09 00:17:14,382 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: ... (1/5) (09d25852e3e206d6b7fe0d6bc965870f) switched from RUNNING to CANCELING.
> 2021-10-09 00:17:14,382 [jobmanager-future-thread-1] WARN  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Error while processing AcknowledgeCheckpoint message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete the pending checkpoint 1116. Failure reason: Failure to finalize checkpoint.
> 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1227) 
> 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:1072) 
> 	at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$acknowledgeCheckpoint$1(ExecutionGraphHandler.java:89) 
> 	at org.apache.flink.runtime.scheduler.ExecutionGraphHandler.lambda$processCheckpointCoordinatorMessage$3(ExecutionGraphHandler.java:119) 
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
> 	at java.lang.Thread.run(Thread.java:834) [?:?]
> Caused by: org.apache.flink.runtime.persistence.StateHandleStore$AlreadyExistException: ZooKeeper node /0000000000000001116 already exists.
> 	at org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.lambda$addAndLock$0(ZooKeeperStateHandleStore.java:179) 
> 	at java.util.Optional.map(Optional.java:265) ~[?:?]
> 	at org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.addAndLock(ZooKeeperStateHandleStore.java:177) 
> 	at org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.addCheckpoint(DefaultCompletedCheckpointStore.java:182) 
> 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1209) 
> 	... 9 more
> Caused by: org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
> 	at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:122) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0-stream1]
> 	at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:1015) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0-stream1]
> 	at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:919) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0-stream1]
> 	at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorTransactionImpl.doOperation(CuratorTransactionImpl.java:197) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0-stream1]
> 	at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorTransactionImpl.access$000(CuratorTransactionImpl.java:37) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0-stream1]
> 	at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:130) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0-stream1]
> 	at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorTransactionImpl$2.call(CuratorTransactionImpl.java:126) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0-stream1]
> 	at org.apache.flink.shaded.curator4.org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0-stream1]
> 	at org.apache.flink.shaded.curator4.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0-stream1]
> 	at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorTransactionImpl.commit(CuratorTransactionImpl.java:123) ~[flink-shaded-zookeeper-3.4.14.jar:3.4.14-13.0-stream1]
> 	at org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.writeStoreHandleTransactionally(ZooKeeperStateHandleStore.java:204) 
> 	at org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.addAndLock(ZooKeeperStateHandleStore.java:165) 
> 	at org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.addCheckpoint(DefaultCompletedCheckpointStore.java:182) 
> 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1209) 
> ... 9 more
> # checkpoint coordinator was stopping
> 2021-10-09 00:17:14,385 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Stopping checkpoint coordinator for job a8a4fb85b681a897ba118db64333c9e5.
> 2021-10-09 00:17:14,401 [flink-akka.actor.default-dispatcher-90] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job a8a4fb85b681a897ba118db64333c9e5 has been suspended.
> # clean up
> 2021-10-09 00:17:14,403 [AkkaRpcService-Supervisor-Termination-Future-Executor-thread-1] INFO  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionDriver [] - Closing ZooKeeperLeaderElectionDriver{leaderPath='/leader/a8a4fb85b681a897ba118db64333c9e5/job_manager_lock'}
> 2021-10-09 00:17:14,404 [cluster-io-thread-2] INFO  org.apache.flink.runtime.jobmanager.DefaultJobGraphStore     [] - Released job graph a8a4fb85b681a897ba118db64333c9e5 from ZooKeeperStateHandleStore{namespace='flink/flink-.../jobgraphs'}.
> # however, during recovery, checkpoint 1116 was found in zookeeper, but the metadata file /mnt/flink/.flink/ha/flink-.../completedCheckpoint42683d1121c7 was cleaned up due to the KeeperException$NodeExistsException happened before
> 2021-10-09 00:18:18,678 [jobmanager-future-thread-1] INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Recovering checkpoints from ZooKeeperStateHandleStore{namespace='flink/flink-.../checkpoints/a8a4fb85b681a897ba118db64333c9e5'}.
> 2021-10-09 00:18:18,686 [jobmanager-future-thread-1] INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Found 4 checkpoints in ZooKeeperStateHandleStore{namespace='flink/flink-.../checkpoints/a8a4fb85b681a897ba118db64333c9e5'}.
> 2021-10-09 00:18:18,686 [jobmanager-future-thread-1] INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to fetch 4 checkpoints from storage.
> 2021-10-09 00:18:18,686 [jobmanager-future-thread-1] INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to retrieve checkpoint 1113.
> 2021-10-09 00:18:18,689 [jobmanager-future-thread-1] INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to retrieve checkpoint 1114.
> 2021-10-09 00:18:18,691 [jobmanager-future-thread-1] INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to retrieve checkpoint 1115.
> 2021-10-09 00:18:18,693 [jobmanager-future-thread-1] INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to retrieve checkpoint 1116.
> 2021-10-09 00:18:18,700 [flink-akka.actor.default-dispatcher-18] ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error occurred in the cluster entrypoint.
> org.apache.flink.util.FlinkException: JobMaster for job a8a4fb85b681a897ba118db64333c9e5 failed.
> 	at org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:873) 
> 	at org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:459) 
> 	at org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:436) 
> 	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$3(Dispatcher.java:415) 
> 	at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) ~[?:?]
> 	at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) ~[?:?]
> 	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) ~[?:?]
> 	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) 
> 	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) 
> 	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) 
> 	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) 
> 	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
> 	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
> 	at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) 
> 	at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) 
> 	at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
> 	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
> 	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) 
> 	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) 
> 	at akka.actor.Actor.aroundReceive(Actor.scala:517) 
> 	at akka.actor.Actor.aroundReceive$(Actor.scala:515) 
> 	at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 
> 	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 
> 	at akka.actor.ActorCell.invoke(ActorCell.scala:561) 
> 	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 
> 	at akka.dispatch.Mailbox.run(Mailbox.scala:225) 
> 	at akka.dispatch.Mailbox.exec(Mailbox.scala:235) 
> 	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
> 	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
> 	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
> 	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
> Caused by: org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster.
> 	at org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97) 
> 	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
> 	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
> 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
> 	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705) ~[?:?]
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
> 	at java.lang.Thread.run(Thread.java:834) ~[?:?]
> Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve checkpoint 1116 from state handle under /0000000000000001116. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
> 	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) ~[?:?]
> 	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) ~[?:?]
> 	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702) ~[?:?]
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
> 	at java.lang.Thread.run(Thread.java:834) ~[?:?]
> Caused by: java.lang.RuntimeException: org.apache.flink.util.FlinkException: Could not retrieve checkpoint 1116 from state handle under /0000000000000001116. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
> 	at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:316) 
> 	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:114) 
> 	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ~[?:?]
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
> 	at java.lang.Thread.run(Thread.java:834) ~[?:?]
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve checkpoint 1116 from state handle under /0000000000000001116. This indicates that the retrieved state handle is broken. Try cleaning the state handle store.
> 	at org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.retrieveCompletedCheckpoint(DefaultCompletedCheckpointStore.java:309) 
> 	at org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.recover(DefaultCompletedCheckpointStore.java:151) 
> 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1513) 
> 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreInitialCheckpointIfPresent(CheckpointCoordinator.java:1476) 
> 	at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:134) 
> 	at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:342) 
> 	at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:190) 
> 	at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:122) 
> 	at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132) 
> 	at org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110) 
> 	at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340) 
> 	at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:317) 
> 	at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:107) 
> 	at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95) 
> 	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112) 
> 	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ~[?:?]
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
> 	at java.lang.Thread.run(Thread.java:834) ~[?:?]
> Caused by: java.io.FileNotFoundException: /mnt/flink/.flink/ha/flink-.../completedCheckpoint42683d1121c7 (No such file or directory)
> 	at java.io.FileInputStream.open0(Native Method) ~[?:?]
> 	at java.io.FileInputStream.open(FileInputStream.java:219) ~[?:?]
> 	at java.io.FileInputStream.<init>(FileInputStream.java:157) ~[?:?]
> 	at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50) 
> 	at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) 
> 	at org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68) 
> 	at org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:66) 
> 	at org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58) 
> 	at org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.retrieveCompletedCheckpoint(DefaultCompletedCheckpointStore.java:298) 
> 	at org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.recover(DefaultCompletedCheckpointStore.java:151) 
> 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1513) 
> 	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreInitialCheckpointIfPresent(CheckpointCoordinator.java:1476) 
> 	at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:134) 
> 	at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:342) 
> 	at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:190) 
> 	at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:122) 
> 	at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132) 
> 	at org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110) 
> 	at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340) 
> 	at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:317) 
> 	at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:107) 
> 	at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95) 
> 	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112) 
> 	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ~[?:?]
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
> 	at java.lang.Thread.run(Thread.java:834) ~[?:?]
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)