You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (JIRA)" <ji...@apache.org> on 2018/02/28 10:25:00 UTC

[jira] [Created] (FLINK-8803) Mini Cluster Shutdown with HA unstable, causing test failures

Stephan Ewen created FLINK-8803:
-----------------------------------

             Summary: Mini Cluster Shutdown with HA unstable, causing test failures
                 Key: FLINK-8803
                 URL: https://issues.apache.org/jira/browse/FLINK-8803
             Project: Flink
          Issue Type: Bug
          Components: Tests
            Reporter: Stephan Ewen


When the Mini Cluster is created for HA tests with ZooKeeper, the shutdown is unstable.

It looks like ZooKeeper may be shut down before the JobManager is shut down, causing the shutdown procedure of the JobManager (specifically {{ZooKeeperSubmittedJobGraphStore.removeJobGraph}}) to block until tests time out.

Full log: https://api.travis-ci.org/v3/job/346853707/log.txt

Note that no ZK threads are alive any more, seems ZK is shut down already.

Relevant Stack Traces:

{code}

"main" #1 prio=5 os_prio=0 tid=0x00007f973800a800 nid=0x43b4 waiting on condition [0x00007f973eb0b000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x000000008966cf18> (a scala.concurrent.impl.Promise$CompletionLatch)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
	at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:212)
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:222)
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:157)
	at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
	at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169)
	at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
	at scala.concurrent.Await$.ready(package.scala:169)
	at org.apache.flink.runtime.minicluster.FlinkMiniCluster.startInternalShutdown(FlinkMiniCluster.scala:469)
	at org.apache.flink.runtime.minicluster.FlinkMiniCluster.stop(FlinkMiniCluster.scala:435)
	at org.apache.flink.runtime.minicluster.FlinkMiniCluster.closeAsync(FlinkMiniCluster.scala:719)
	at org.apache.flink.test.util.MiniClusterResource.after(MiniClusterResource.java:104)
	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:50)
...
{code}

{code}
"flink-akka.actor.default-dispatcher-2" #1012 prio=5 os_prio=0 tid=0x00007f97394fa800 nid=0x3328 waiting on condition [0x00007f971db29000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000087f82a70> (a java.util.concurrent.CountDownLatch$Sync)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
	at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
	at org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.internalBlockUntilConnectedOrTimedOut(CuratorZookeeperClient.java:336)
	at org.apache.flink.shaded.curator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
	at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.DeleteBuilderImpl.pathInForeground(DeleteBuilderImpl.java:241)
	at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:225)
	at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.DeleteBuilderImpl.forPath(DeleteBuilderImpl.java:35)
	at org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.release(ZooKeeperStateHandleStore.java:478)
	at org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.releaseAndTryRemove(ZooKeeperStateHandleStore.java:435)
	at org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore.releaseAndTryRemove(ZooKeeperStateHandleStore.java:405)
	at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.removeJobGraph(ZooKeeperSubmittedJobGraphStore.java:266)
	- locked <0x00000000807f4258> (a java.lang.Object)
	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$1.apply$mcV$sp(JobManager.scala:1727)
	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$1.apply(JobManager.scala:1723)
	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$1.apply(JobManager.scala:1723)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)