You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2020/07/02 09:16:00 UTC
[jira] [Commented] (FLINK-12382) HA + ResourceManager exception: Fencing token not set

    [ https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150106#comment-17150106 ] 

Till Rohrmann commented on FLINK-12382:
---------------------------------------

Hi [~gael], the attached log files don't seem to be complete which makes it impossible to understand what was going on. In order to further debug the problem the complete logs and also the logs of the {{TaskManagers}} would be helpful. 

Moreover, it would be helpful to exactly know the environment in which you are running Flink. Recently, there has been a related issue which you can find here FLINK-18367.

> HA + ResourceManager exception: Fencing token not set
> -----------------------------------------------------
>
>                 Key: FLINK-12382
>                 URL: https://issues.apache.org/jira/browse/FLINK-12382
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.8.0
>         Environment: Same all all previous bugs filed by myself, today, but this time with HA with zetcd.
>            Reporter: Henrik
>            Priority: Major
>         Attachments: jobmanager_log, jobmanager_stdout
>
>
> I'm testing zetcd + session jobs in k8s, and testing what happens when I kill both the job-cluster and task-manager at the same time, but maintain ZK/zetcd up and running.
> Then I get this stacktrace, that's completely non-actionable for me, and also resolves itself. I expect a number of retries, and if this exception is part of the protocol signalling to retry, then it should not be printed as a log entry.
> This might be related to an older bug: [https://jira.apache.org/jira/browse/FLINK-7734]
> {code:java}
> [tm] 2019-05-01 11:32:01,641 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor            - Registration at ResourceManager failed due to an error
> [tm] java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token is null.
> [tm]     at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> [tm]     at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> [tm]     at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> [tm]     at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> [tm]     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> [tm]     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> [tm]     at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815)
> [tm]     at akka.dispatch.OnComplete.internal(Future.scala:258)
> [tm]     at akka.dispatch.OnComplete.internal(Future.scala:256)
> [tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> [tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> [tm]     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> [tm]     at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
> [tm]     at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> [tm]     at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> [tm]     at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
> [tm]     at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97)
> [tm]     at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446)
> [tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token is null.
> [tm]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> [tm]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [tm]     ... 9 more
> [tm] 2019-05-01 11:32:01,650 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Pausing and re-attempting registration in 10000 ms
> [tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
> [tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close JobManager connection for job 00000000000000000000000000000000.
> [tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692).
> [tm] 2019-05-01 11:32:03,071 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692) switched from RUNNING to FAILED.
> [tm] org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000 lost the leadership.
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182)
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136)
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
> [tm]     ... 15 more
> [tm] 2019-05-01 11:32:03,071 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692).
> [tm] 2019-05-01 11:32:03,085 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally user_sessions -> (Sink: sink_example_sessions, Filter, Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25).
> [tm] 2019-05-01 11:32:03,085 INFO  org.apache.flink.runtime.taskmanager.Task                     - user_sessions -> (Sink: sink_example_sessions, Filter, Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25) switched from RUNNING to FAILED.
> [tm] org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000 lost the leadership.
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182)
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136)
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
> [tm]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
> [tm]     at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
> [tm]     ... 15 more
> {code}
> tm stands for taskmanager in this deployment.
> EDIT: this also happens if you just temporarily disable network routing; it never recovers on its own despite having HA configured! In this case, it's the job manager that keeps crashing.
> {code:java}
> [job] 2019-05-01 13:03:32,299 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Unhandled exception.
> [job] org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(aa2545e3e2ca903b1a0f331235954917, LocalRpcInvocation(requestMultipleJobDetails(Time))) sent to akka.tcp://flink@analytics-job:6123/user/dispatcher because the fencing token is null.
> [job]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
> [job]     at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [job]     at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> [job]     at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [job]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [job]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [job]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [job]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [job]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [job]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [job]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [job]     at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [job]     at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [job]     at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [job]     at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [job] 2019-05-01 13:03:33,308 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   - Unhandled exception.{code}
>  Then killing both the TM and JM/RM doesn't work. I let it linger for 5 minutes in the broken state, and then:
> {code:java}
> //snip
> Fatal error occurred while executing the TaskManager. Shutting it down.
> // snip
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)