You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Robert Metzger (JIRA)" <ji...@apache.org> on 2015/02/17 17:00:13 UTC

[jira] [Reopened] (FLINK-1556) JobClient does not wait until a job failed completely if submission exception

     [ https://issues.apache.org/jira/browse/FLINK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Metzger reopened FLINK-1556:
-----------------------------------
      Assignee: Till Rohrmann

The issue hasn't been implemented/tested properly.

I'm reopening it.

I was submitting a job with a wrong directory configured. I got the following output:
{code}
./flink run -v -p 152 -c com.github.projectflink.avro.GenerateLineitems ../../testjob/flink-jobs/target/flink-jobs-0.1-SNAPSHOT.jar -p 144 -o hdfs:///user/robert/datasets/tpch100/
16:55:04,748 WARN  org.apache.hadoop.util.NativeCodeLoader                       - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found a yarn properties file (.yarn-properties) file, using "cloud-31.dima.tu-berlin.de:33806" to connect to the JobManager
Submission of job with ID 11bc00bb2a27105221a4137048e3a763 was unsuccessful, because File or directory already exists. Existing files and directories are not overwritten in NO_OVERWRITE mode. Use OVERWRITE mode to overwrite existing files and directories..
{code}
But the CliFrontend didn't stop. (I waited for more than a minute).

jstack:

{code}
$ jstack 25475
2015-02-17 16:56:43
Full thread dump Java HotSpot(TM) 64-Bit Server VM (24.76-b04 mixed mode):

"Attach Listener" daemon prio=10 tid=0x00000000013c9000 nid=0x63dc waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Hashed wheel timer #1" daemon prio=10 tid=0x0000000001375000 nid=0x63a9 waiting on condition [0x00007f3582a96000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:483)
	at org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:392)
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
	at java.lang.Thread.run(Thread.java:745)

"New I/O server boss #6" daemon prio=10 tid=0x00007f3584123000 nid=0x63a8 runnable [0x00007f3582b97000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
	- locked <0x000000070ac52878> (a sun.nio.ch.Util$2)
	- locked <0x000000070ac52868> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000070ac52750> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
	at org.jboss.netty.channel.socket.nio.NioServerBoss.select(NioServerBoss.java:163)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
	at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
	at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

"New I/O worker #5" daemon prio=10 tid=0x00007f3584124800 nid=0x63a7 runnable [0x00007f3582c98000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
	- locked <0x000000070ac35e20> (a sun.nio.ch.Util$2)
	- locked <0x000000070ac35e10> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000070ac35cf8> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
	at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:415)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
	at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

"New I/O worker #4" daemon prio=10 tid=0x00007f35843f0800 nid=0x63a6 runnable [0x00007f3582d99000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
	- locked <0x000000070ac25598> (a sun.nio.ch.Util$2)
	- locked <0x000000070ac25588> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000070ac25470> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
	at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:415)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
	at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

"New I/O boss #3" daemon prio=10 tid=0x00007f3584530000 nid=0x63a5 runnable [0x00007f3582e9a000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
	- locked <0x000000070abf0888> (a sun.nio.ch.Util$2)
	- locked <0x000000070abf0878> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000070abf0760> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
	at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:415)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
	at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
	at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

"New I/O worker #2" daemon prio=10 tid=0x00007f3584187000 nid=0x63a4 runnable [0x00007f3582f9b000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
	- locked <0x000000070ab400b8> (a sun.nio.ch.Util$2)
	- locked <0x000000070ab400a8> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000070ab3ff90> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
	at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:415)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
	at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

"New I/O worker #1" daemon prio=10 tid=0x00000000013e1000 nid=0x63a3 runnable [0x00007f358309b000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
	- locked <0x000000070ab0d8d0> (a sun.nio.ch.Util$2)
	- locked <0x000000070ab0d868> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000070ab0d738> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
	at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:415)
	at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
	at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
	at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

"flink-akka.remote.default-remote-dispatcher-6" daemon prio=10 tid=0x00007f358c6ef800 nid=0x63a2 waiting on condition [0x00007f358319d000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x000000070a54f350> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
	at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

"flink-akka.remote.default-remote-dispatcher-5" daemon prio=10 tid=0x00007f3584184000 nid=0x63a1 waiting on condition [0x00007f358329e000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x000000070a54f350> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
	at scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135)
	at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

"flink-akka.actor.default-dispatcher-4" daemon prio=10 tid=0x00000000011bd000 nid=0x63a0 waiting on condition [0x00007f358339f000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000709ac8bd8> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
	at scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135)
	at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

"flink-akka.actor.default-dispatcher-3" daemon prio=10 tid=0x00007f358c6b9000 nid=0x639f waiting on condition [0x00007f35834a0000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x0000000709ac8bd8> (a akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
	at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

"flink-scheduler-1" daemon prio=10 tid=0x00007f358c61e800 nid=0x639d sleeping[0x00007f3588322000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at akka.actor.LightArrayRevolverScheduler.waitNanos(Scheduler.scala:226)
	at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:405)
	at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
	at java.lang.Thread.run(Thread.java:745)

"Service Thread" daemon prio=10 tid=0x00007f358c07d800 nid=0x6398 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007f358c07b000 nid=0x6397 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007f358c078000 nid=0x6396 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007f358c075800 nid=0x6395 runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007f358c04c000 nid=0x6394 in Object.wait() [0x00007f35906eb000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x0000000703e84858> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
	- locked <0x0000000703e84858> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)

"Reference Handler" daemon prio=10 tid=0x00007f358c04a000 nid=0x6393 in Object.wait() [0x00007f35907ec000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x0000000703e84470> (a java.lang.ref.Reference$Lock)
	at java.lang.Object.wait(Object.java:503)
	at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
	- locked <0x0000000703e84470> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x0000000000fb5800 nid=0x6384 waiting on condition [0x00007f35b5f4f000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x000000070b19afb0> (a scala.concurrent.impl.Promise$CompletionLatch)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
	at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
	at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
	at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
	at scala.concurrent.Await$.result(package.scala:107)
	at org.apache.flink.runtime.client.JobClient$.submitJobAndWait(JobClient.scala:226)
	at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.scala)
	at org.apache.flink.client.program.Client.run(Client.java:342)
	at org.apache.flink.client.program.Client.run(Client.java:310)
	at org.apache.flink.client.program.Client.run(Client.java:304)
	at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55)
	at com.github.projectflink.avro.GenerateLineitems.main(GenerateLineitems.java:54)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437)
	at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353)
	at org.apache.flink.client.program.Client.run(Client.java:254)
	at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:371)
	at org.apache.flink.client.CliFrontend.run(CliFrontend.java:344)
	at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1087)
	at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114)

"VM Thread" prio=10 tid=0x00007f358c046000 nid=0x6392 runnable 

{code}


> JobClient does not wait until a job failed completely if submission exception
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-1556
>                 URL: https://issues.apache.org/jira/browse/FLINK-1556
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>
> If an exception occurs during job submission the {{JobClient}} received a {{SubmissionFailure}}. Upon receiving this message, the {{JobClient}} terminates itself and returns the error to the {{Client}}. This indicates to the user that the job has been completely failed which is not necessarily true. 
> If the user directly after such a failure submits another job, then it might be the case that not all slots of the formerly failed job are returned. This can lead to a {{NoRessourceAvailableException}}.
> We can solve this problem by waiting for the completion of the job failure in the {{JobClient}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)