You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Link Qian <fa...@outlook.com> on 2017/06/23 09:58:49 UTC

Container exited with a non-zero exit code 1

Hello,


I submit a spark job to YARN cluster with spark-submit command. the environment is CDH 5.4 with spark 1.3.0. which has 6 compute nodes which 64G memory per node. The YARN sets 16G max of memory for every container. The job requests 6 of 8G memory of executors, and  8G of driver. However, I alway get the errors after try submit the job several times.  Any help?


--- ------------here are the error logs of Application Master for the job --------------


17/06/22 15:18:44 INFO yarn.YarnAllocator: Completed container container_1498115278902_0001_02_000013 (state: COMPLETE, exit status: 1)
17/06/22 15:18:44 INFO yarn.YarnAllocator: Container marked as failed: container_1498115278902_0001_02_000013. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1498115278902_0001_02_000013
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
    at org.apache.hadoop.util.Shell.run(Shell.java:455)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1



--------  Here is the yarn application logs of the job.

LogLength:2611
Log Contents:
17/06/22 15:18:09 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
17/06/22 15:18:10 INFO spark.SecurityManager: Changing view acls to: yarn,root
17/06/22 15:18:10 INFO spark.SecurityManager: Changing modify acls to: yarn,root
17/06/22 15:18:10 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, root); users with modify permissions: Set(yarn, root)
17/06/22 15:18:10 INFO slf4j.Slf4jLogger: Slf4jLogger started
17/06/22 15:18:10 INFO Remoting: Starting remoting
17/06/22 15:18:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@dn006:45701]
17/06/22 15:18:10 INFO Remoting: Remoting now listens on addresses: [akka.tcp://driverPropsFetcher@dn006:45701]
17/06/22 15:18:10 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 45701.
17/06/22 15:18:40 WARN security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1684)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:139)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:235)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:107)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:155)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    ... 4 more


---- a snippet of RM log for the job ---------

2017-06-22 15:18:41,586 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1498115278902_0001_02_000014 of capacity
 <memory:6656, vCores:4> on host dn006:8041, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:32> available, release resources=true
2017-06-22 15:18:41,586 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1498115278902_0001_000002 release
d container container_1498115278902_0001_02_000014 on node: host: dn006:8041 #containers=0 available=8192 used=0 with event: FINISHED
2017-06-22 15:18:41,677 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1498115278902_0001_02_000012 Container Transitioned fro
m RUNNING to COMPLETED
2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Completed container: container_1498115278902_0001_02_000012 in st
ate: COMPLETED event:FINISHED
2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root    OPERATION=AM Released Container    TARGET=SchedulerApp    RESULT=SU
CCESS    APPID=application_1498115278902_0001    CONTAINERID=container_1498115278902_0001_02_000012
2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1498115278902_0001_02_000012 of capacity
 <memory:6656, vCores:4> on host dn003:8041, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:32> available, release resources=true
2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1498115278902_0001_000002 release
d container container_1498115278902_0001_02_000012 on node: host: dn003:8041 #containers=0 available=8192 used=0 with event: FINISHED
2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1498115278902_0001_02_000010 Container Transitioned fro
m RUNNING to COMPLETED
2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Completed container: container_1498115278902_0001_02_000010 in st
ate: COMPLETED event:FINISHED


thanks in advance.

Link Qian


Container exited with a non-zero exit code 1

Posted by Link Qian <fa...@outlook.com>.
any suggestion from spark dev group?

________________________________
From: Link Qian <fa...@outlook.com>
Sent: Friday, June 23, 2017 9:58 AM
To: user@spark.apache.org
Subject: Container exited with a non-zero exit code 1


Hello,


I submit a spark job to YARN cluster with spark-submit command. the environment is CDH 5.4 with spark 1.3.0. which has 6 compute nodes which 64G memory per node. The YARN sets 16G max of memory for every container. The job requests 6 of 8G memory of executors, and  8G of driver. However, I alway get the errors after try submit the job several times.  Any help?


--- ------------here are the error logs of Application Master for the job --------------


17/06/22 15:18:44 INFO yarn.YarnAllocator: Completed container container_1498115278902_0001_02_000013 (state: COMPLETE, exit status: 1)
17/06/22 15:18:44 INFO yarn.YarnAllocator: Container marked as failed: container_1498115278902_0001_02_000013. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1498115278902_0001_02_000013
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
    at org.apache.hadoop.util.Shell.run(Shell.java:455)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1



--------  Here is the yarn application logs of the job.

LogLength:2611
Log Contents:
17/06/22 15:18:09 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
17/06/22 15:18:10 INFO spark.SecurityManager: Changing view acls to: yarn,root
17/06/22 15:18:10 INFO spark.SecurityManager: Changing modify acls to: yarn,root
17/06/22 15:18:10 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, root); users with modify permissions: Set(yarn, root)
17/06/22 15:18:10 INFO slf4j.Slf4jLogger: Slf4jLogger started
17/06/22 15:18:10 INFO Remoting: Starting remoting
17/06/22 15:18:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@dn006:45701]
17/06/22 15:18:10 INFO Remoting: Remoting now listens on addresses: [akka.tcp://driverPropsFetcher@dn006:45701]
17/06/22 15:18:10 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 45701.
17/06/22 15:18:40 WARN security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1684)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:139)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:235)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
    at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
    at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
    at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
    at scala.concurrent.Await$.result(package.scala:107)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:155)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    ... 4 more


---- a snippet of RM log for the job ---------

2017-06-22 15:18:41,586 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1498115278902_0001_02_000014 of capacity
 <memory:6656, vCores:4> on host dn006:8041, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:32> available, release resources=true
2017-06-22 15:18:41,586 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1498115278902_0001_000002 release
d container container_1498115278902_0001_02_000014 on node: host: dn006:8041 #containers=0 available=8192 used=0 with event: FINISHED
2017-06-22 15:18:41,677 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1498115278902_0001_02_000012 Container Transitioned fro
m RUNNING to COMPLETED
2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Completed container: container_1498115278902_0001_02_000012 in st
ate: COMPLETED event:FINISHED
2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root    OPERATION=AM Released Container    TARGET=SchedulerApp    RESULT=SU
CCESS    APPID=application_1498115278902_0001    CONTAINERID=container_1498115278902_0001_02_000012
2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Released container container_1498115278902_0001_02_000012 of capacity
 <memory:6656, vCores:4> on host dn003:8041, which currently has 0 containers, <memory:0, vCores:0> used and <memory:8192, vCores:32> available, release resources=true
2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application attempt appattempt_1498115278902_0001_000002 release
d container container_1498115278902_0001_02_000012 on node: host: dn003:8041 #containers=0 available=8192 used=0 with event: FINISHED
2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1498115278902_0001_02_000010 Container Transitioned fro
m RUNNING to COMPLETED
2017-06-22 15:18:41,678 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt: Completed container: container_1498115278902_0001_02_000010 in st
ate: COMPLETED event:FINISHED


thanks in advance.

Link Qian