You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Rinat <r....@cleverdata.ru> on 2017/09/04 11:29:38 UTC

Flink Job Deployment (Not enough resources)

Hi everyone, I’ve got the following problem, when I’m trying to submit new job and if cluster has not enough resources, job submission fails with the following exception
But in YARN job hangs and wait’s for requested resources. When resources become available, job successfully runs.

What can I do to be sure that job startup is completed successfully or completely failed  ?

Thx.

The program finished with the following exception:\n\njava.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised\n\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)\n\tat org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)\n\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)\n\tat org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)\n\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)\n\tat org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)\n\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)\n\tat org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)\n\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)\n\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)\n\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)\n\tat java.security.AccessController.doPrivileged(Native Method)\n\tat javax.security.auth.Subject.doAs(Subject.java:422)\n\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)\n\tat org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)\n\tat org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)\nCaused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.\n\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)\n\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)\n\t... 15 more\nCaused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.\n\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)\n\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)\n\t... 16 more\nCaused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]\n\tat scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)\n\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)\n\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)\n\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)\n\tat scala.concurrent.Await$.result(package.scala:190)\n\tat scala.concurrent.Await.result(package.scala)\n\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:77)\n\t... 17 more", "stderr_lines": ["", "------------------------------------------------------------", " The program finished with the following exception:", "", "java.lang.RuntimeException: Unable to tell application master to stop once the specified job has been finised", "\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)", "\tat org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)", "\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)", "\tat org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)", "\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)", "\tat org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)", "\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)", "\tat org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)", "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)", "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)", "\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)", "\tat java.security.AccessController.doPrivileged(Native Method)", "\tat javax.security.auth.Subject.doAs(Subject.java:422)", "\tat org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)", "\tat org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)", "\tat org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)", "Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running.", "\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)", "\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)", "\t... 15 more", "Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway.", "\tat org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)", "\tat org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)", "\t... 16 more", "Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]", "\tat scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)", "\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)", "\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)", "\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)"

Re: Flink Job Deployment (Not enough resources)

Posted by Fabian Hueske <fh...@gmail.com>.
Hi Rinat,

No, Flink does not have a switch to immediately cancel a job if it cannot
allocate enough resources.
Maybe YARN has a configuration parameter to define a timeout after which a
job is canceled if no resource become available.

2017-09-04 13:29 GMT+02:00 Rinat <r....@cleverdata.ru>:

> Hi everyone, I’ve got the following problem, when I’m trying to submit new
> job and if cluster has not enough resources, job submission fails with the
> following exception
> But in *YARN *job hangs and wait’s for requested resources. When
> resources become available, job successfully runs.
>
> What can I do to be sure that job startup is completed successfully or
> completely failed  ?
>
> Thx.
>
> *The program finished with the following
> exception:\n\njava.lang.RuntimeException: Unable to tell application master
> to stop once the specified job has been finised*\n\tat
> org.apache.flink.yarn.YarnClusterClient.stopAfterJob(
> YarnClusterClient.java:177)\n\tat org.apache.flink.yarn.
> YarnClusterClient.submitJob(YarnClusterClient.java:201)\n\tat
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)\n\tat
> org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(
> DetachedEnvironment.java:76)\n\tat org.apache.flink.client.
> program.ClusterClient.run(ClusterClient.java:387)\n\tat
> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)\n\tat
> org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)\n\tat
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)\n\tat
> org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)\n\tat
> org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)\n\tat
> org.apache.flink.runtime.security.HadoopSecurityContext$1.run(
> HadoopSecurityContext.java:43)\n\tat java.security.
> AccessController.doPrivileged(Native Method)\n\tat
> javax.security.auth.Subject.doAs(Subject.java:422)\n\tat
> org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1656)\n\tat org.apache.flink.runtime.security.
> HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)\n\tat
> org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)\nCaused
> by: org.apache.flink.util.FlinkException: Could not connect to the
> leading JobManager. Please check that the JobManager is running.\n\tat
> org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)\n\tat
> org.apache.flink.yarn.YarnClusterClient.stopAfterJob(
> YarnClusterClient.java:171)\n\t... 15 more\nCaused by:
> org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could
> not retrieve the leader gateway.\n\tat org.apache.flink.runtime.util.
> LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79)\n\tat
> org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)\n\t...
> 16 more\nCaused by: java.util.concurrent.TimeoutException: Futures timed
> out after [10000 milliseconds]\n\tat scala.concurrent.impl.Promise$
> DefaultPromise.ready(Promise.scala:219)\n\tat
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)\n\tat
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)\n\tat
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)\n\tat
> scala.concurrent.Await$.result(package.scala:190)\n\tat
> scala.concurrent.Await.result(package.scala)\n\tat
> org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(
> LeaderRetrievalUtils.java:77)\n\t... 17 more", "stderr_lines": ["",
> "------------------------------------------------------------", " The
> program finished with the following exception:", "",
> "java.lang.RuntimeException: Unable to tell application master to stop once
> the specified job has been finised", "\tat org.apache.flink.yarn.
> YarnClusterClient.stopAfterJob(YarnClusterClient.java:177)", "\tat
> org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:201)",
> "\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:442)",
> "\tat org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:76)",
> "\tat org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)",
> "\tat org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:838)",
> "\tat org.apache.flink.client.CliFrontend.run(CliFrontend.java:259)",
> "\tat org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1086)",
> "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1133)",
> "\tat org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1130)",
> "\tat org.apache.flink.runtime.security.HadoopSecurityContext$1.run(
> HadoopSecurityContext.java:43)", "\tat java.security.
> AccessController.doPrivileged(Native Method)", "\tat
> javax.security.auth.Subject.doAs(Subject.java:422)", "\tat
> org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1656)", "\tat org.apache.flink.runtime.security.
> HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)", "\tat
> org.apache.flink.client.CliFrontend.main(CliFrontend.java:1129)", "Caused
> by: org.apache.flink.util.FlinkException: Could not connect to the
> leading JobManager. Please check that the JobManager is running.", "\tat
> org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:789)",
> "\tat org.apache.flink.yarn.YarnClusterClient.stopAfterJob(YarnClusterClient.java:171)",
> "\t... 15 more", "Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:
> Could not retrieve the leader gateway.", "\tat
> org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(
> LeaderRetrievalUtils.java:79)", "\tat org.apache.flink.client.
> program.ClusterClient.getJobManagerGateway(ClusterClient.java:784)",
> "\t... 16 more", "Caused by: java.util.concurrent.TimeoutException:
> Futures timed out after [10000 milliseconds]", "\tat
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)",
> "\tat scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)",
> "\tat scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)",
> "\tat scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(
> BlockContext.scala:53)"
>