You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Arun Ahuja (JIRA)" <ji...@apache.org> on 2014/11/20 17:19:35 UTC
[jira] [Comment Edited] (SPARK-3837) Warn when YARN is killing containers for exceeding memory limits

    [ https://issues.apache.org/jira/browse/SPARK-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219544#comment-14219544 ] 

Arun Ahuja edited comment on SPARK-3837 at 11/20/14 4:19 PM:
-------------------------------------------------------------

There might still be a related issue here, in the Yarn NodeManager logs I see

{noformat}
2:46:21.815 AM  WARN    org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
Container [pid=1088,containerID=container_1416279928169_0045_01_000010] is running beyond physical memory limits. Current usage: 31.1 GB of 31 GB physical memory used; 33.1 GB of 65.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1416279928169_0045_01_000010 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
                |- 1088 7191 1088 1088 (bash) 0 0 110804992 339 /bin/bash -c /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms28672m -Xmx28672m  -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.io.tmpdir=/data/02/mapred/local/yarn/nm/usercache/ahujaa01/appcache/application_1416279928169_0045/container_1416279928169_0045_01_000010/tmp '-Dspark.akka.timeout=10000' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ csmaz10-16 :47605/user/CoarseGrainedScheduler 82 csmaz08-11 4 application_1416279928169_0045 1> /var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010/stdout 2> /var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010/stderr
                        |- 1096 1088 1088 1088 (java) 506520 17351 35377422336 8141313 /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms28672m -Xmx28672m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.io.tmpdir=/data/02/mapred/local/yarn/nm/usercache/ahujaa01/appcache/application_1416279928169_0045/container_1416279928169_0045_01_000010/tmp -Dspark.akka.timeout=10000 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@csmaz10:47605/user/CoarseGrainedScheduler 82 csmaz08-11 4 application_1416279928169_0045
{noformat}

But when I run
{noformat}
yarn logs --applicationId application_1416279928169_0045 | grep Container | grep failed

or

yarn logs --applicationId application_1416279928169_0045 | grep Container | grep killed

{noformat}

There are no results.  Also error being reported are fetch failures from the external shuffle service
{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to connect to  csmau08-11 :7337
        at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
        at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
        at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:89)
        at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
{noformat}


The node that the fetch failed from and that was killed yarn are both csmaz08-11

That executor does log the following however

{noformat}
14/11/20 02:45:18 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
{noformat}

Following the thread a bit, cmaz08-11 had a FetchFailure from csmau08-01, which also has the following in their NM logs
{noformat}
Container [pid=490,containerID=container_1416279928169_0045_01_000024] is running beyond physical memory limits. Current usage: 31.1 GB of 31 GB physical memory used; 32.6 GB of 65.1 GB virtual memory used. Killing container.
{noformat}



was (Author: arahuja):
There might still be a related issue here, in the Yarn NodeManager logs I see

{noformat}
2:46:21.815 AM  WARN    org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
Container [pid=1088,containerID=container_1416279928169_0045_01_000010] is running beyond physical memory limits. Current usage: 31.1 GB of 31 GB physical memory used; 33.1 GB of 65.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1416279928169_0045_01_000010 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
                |- 1088 7191 1088 1088 (bash) 0 0 110804992 339 /bin/bash -c /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms28672m -Xmx28672m  -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.io.tmpdir=/data/02/mapred/local/yarn/nm/usercache/ahujaa01/appcache/application_1416279928169_0045/container_1416279928169_0045_01_000010/tmp '-Dspark.akka.timeout=10000' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ csmaz10-16 :47605/user/CoarseGrainedScheduler 82 csmaz08-11 4 application_1416279928169_0045 1> /var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010/stdout 2> /var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010/stderr
                        |- 1096 1088 1088 1088 (java) 506520 17351 35377422336 8141313 /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms28672m -Xmx28672m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.io.tmpdir=/data/02/mapred/local/yarn/nm/usercache/ahujaa01/appcache/application_1416279928169_0045/container_1416279928169_0045_01_000010/tmp -Dspark.akka.timeout=10000 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@csmaz10:47605/user/CoarseGrainedScheduler 82 csmaz08-11 4 application_1416279928169_0045
{noformat}

But when I run
{noformat}
yarn logs --applicationId application_1416279928169_0045 | grep Container | grep failed

or

yarn logs --applicationId application_1416279928169_0045 | grep Container | grep killed

{norformat}

There are no results.  Also error being reported are fetch failures from the external shuffle service
{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to connect to  csmau08-11 :7337
        at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
        at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
        at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
        at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:89)
        at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
        at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
{noformat}


The node that the fetch failed from and that was killed yarn are both csmaz08-11

That executor does log the following however

{noformat}
14/11/20 02:45:18 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
{noformat}

Following the thread a bit, cmaz08-11 had a FetchFailure from csmau08-01, which also has the following in their NM logs
{noformat}
Container [pid=490,containerID=container_1416279928169_0045_01_000024] is running beyond physical memory limits. Current usage: 31.1 GB of 31 GB physical memory used; 32.6 GB of 65.1 GB virtual memory used. Killing container.
{noformat}


> Warn when YARN is killing containers for exceeding memory limits
> ----------------------------------------------------------------
>
>                 Key: SPARK-3837
>                 URL: https://issues.apache.org/jira/browse/SPARK-3837
>             Project: Spark
>          Issue Type: Improvement
>          Components: YARN
>    Affects Versions: 1.1.0
>            Reporter: Sandy Ryza
>
> YARN now lets application masters know when it kills their containers for exceeding memory limits.  Spark should log something when this happens.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org