You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2018/10/02 12:12:00 UTC
[jira] [Comment Edited] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader

    [ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635360#comment-16635360 ] 

Till Rohrmann edited comment on FLINK-10475 at 10/2/18 12:11 PM:
-----------------------------------------------------------------

Hi [~Jamalarm], this sounds as if ZooKeeper did not notice the one JM being killed. Thus, it could simply be a ZooKeeper setup problem. 

In order to further debug the problem, it would be helpful to get the logs of the JobManagers.

The error messages originate from the REST handlers and are not a critical problem.


was (Author: till.rohrmann):
Hi [~Jamalarm], this sounds as if ZooKeeper did not notice the one JM being killed. Thus, it could simply be a ZooKeeper setup problem. 

In order to further debug the problem, it would be helpful to get the logs of the JobManagers.

> Standalone HA - Leader election is not triggered on loss of leader
> ------------------------------------------------------------------
>
>                 Key: FLINK-10475
>                 URL: https://issues.apache.org/jira/browse/FLINK-10475
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.5.4
>            Reporter: Thomas Wozniakowski
>            Priority: Blocker
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4. Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got stuck.
> The logs of the remaining job managers were full of this:
> {quote}
> 2018-10-01 15:35:44,558 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not retrieve the redirect address.
> java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
> 	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> 	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> 	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> 	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> 	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> 	at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
> 	at akka.dispatch.OnComplete.internal(Future.scala:258)
> 	at akka.dispatch.OnComplete.internal(Future.scala:256)
> 	at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> 	at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> 	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 	at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
> 	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> 	at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> 	at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
> 	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> 	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 	at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
> 	at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
> 	at java.lang.Thread.run(Thread.java:745)
> {quote}
> Please give me a shout if I can provide any more useful information



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)