You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Arun Ahuja (JIRA)" <ji...@apache.org> on 2014/11/20 17:19:35 UTC
[jira] [Comment Edited] (SPARK-3837) Warn when YARN is killing
containers for exceeding memory limits
[ https://issues.apache.org/jira/browse/SPARK-3837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14219544#comment-14219544 ]
Arun Ahuja edited comment on SPARK-3837 at 11/20/14 4:19 PM:
-------------------------------------------------------------
There might still be a related issue here, in the Yarn NodeManager logs I see
{noformat}
2:46:21.815 AM WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
Container [pid=1088,containerID=container_1416279928169_0045_01_000010] is running beyond physical memory limits. Current usage: 31.1 GB of 31 GB physical memory used; 33.1 GB of 65.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1416279928169_0045_01_000010 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 1088 7191 1088 1088 (bash) 0 0 110804992 339 /bin/bash -c /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms28672m -Xmx28672m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.io.tmpdir=/data/02/mapred/local/yarn/nm/usercache/ahujaa01/appcache/application_1416279928169_0045/container_1416279928169_0045_01_000010/tmp '-Dspark.akka.timeout=10000' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ csmaz10-16 :47605/user/CoarseGrainedScheduler 82 csmaz08-11 4 application_1416279928169_0045 1> /var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010/stdout 2> /var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010/stderr
|- 1096 1088 1088 1088 (java) 506520 17351 35377422336 8141313 /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms28672m -Xmx28672m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.io.tmpdir=/data/02/mapred/local/yarn/nm/usercache/ahujaa01/appcache/application_1416279928169_0045/container_1416279928169_0045_01_000010/tmp -Dspark.akka.timeout=10000 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@csmaz10:47605/user/CoarseGrainedScheduler 82 csmaz08-11 4 application_1416279928169_0045
{noformat}
But when I run
{noformat}
yarn logs --applicationId application_1416279928169_0045 | grep Container | grep failed
or
yarn logs --applicationId application_1416279928169_0045 | grep Container | grep killed
{noformat}
There are no results. Also error being reported are fetch failures from the external shuffle service
{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to connect to csmau08-11 :7337
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:89)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{noformat}
The node that the fetch failed from and that was killed yarn are both csmaz08-11
That executor does log the following however
{noformat}
14/11/20 02:45:18 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
{noformat}
Following the thread a bit, cmaz08-11 had a FetchFailure from csmau08-01, which also has the following in their NM logs
{noformat}
Container [pid=490,containerID=container_1416279928169_0045_01_000024] is running beyond physical memory limits. Current usage: 31.1 GB of 31 GB physical memory used; 32.6 GB of 65.1 GB virtual memory used. Killing container.
{noformat}
was (Author: arahuja):
There might still be a related issue here, in the Yarn NodeManager logs I see
{noformat}
2:46:21.815 AM WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
Container [pid=1088,containerID=container_1416279928169_0045_01_000010] is running beyond physical memory limits. Current usage: 31.1 GB of 31 GB physical memory used; 33.1 GB of 65.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1416279928169_0045_01_000010 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 1088 7191 1088 1088 (bash) 0 0 110804992 339 /bin/bash -c /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms28672m -Xmx28672m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.io.tmpdir=/data/02/mapred/local/yarn/nm/usercache/ahujaa01/appcache/application_1416279928169_0045/container_1416279928169_0045_01_000010/tmp '-Dspark.akka.timeout=10000' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ csmaz10-16 :47605/user/CoarseGrainedScheduler 82 csmaz08-11 4 application_1416279928169_0045 1> /var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010/stdout 2> /var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010/stderr
|- 1096 1088 1088 1088 (java) 506520 17351 35377422336 8141313 /usr/java/jdk1.7.0_45-cloudera/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms28672m -Xmx28672m -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.io.tmpdir=/data/02/mapred/local/yarn/nm/usercache/ahujaa01/appcache/application_1416279928169_0045/container_1416279928169_0045_01_000010/tmp -Dspark.akka.timeout=10000 -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1416279928169_0045/container_1416279928169_0045_01_000010 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@csmaz10:47605/user/CoarseGrainedScheduler 82 csmaz08-11 4 application_1416279928169_0045
{noformat}
But when I run
{noformat}
yarn logs --applicationId application_1416279928169_0045 | grep Container | grep failed
or
yarn logs --applicationId application_1416279928169_0045 | grep Container | grep killed
{norformat}
There are no results. Also error being reported are fetch failures from the external shuffle service
{noformat}
org.apache.spark.shuffle.FetchFailedException: Failed to connect to csmau08-11 :7337
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:89)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:44)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{noformat}
The node that the fetch failed from and that was killed yarn are both csmaz08-11
That executor does log the following however
{noformat}
14/11/20 02:45:18 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
{noformat}
Following the thread a bit, cmaz08-11 had a FetchFailure from csmau08-01, which also has the following in their NM logs
{noformat}
Container [pid=490,containerID=container_1416279928169_0045_01_000024] is running beyond physical memory limits. Current usage: 31.1 GB of 31 GB physical memory used; 32.6 GB of 65.1 GB virtual memory used. Killing container.
{noformat}
> Warn when YARN is killing containers for exceeding memory limits
> ----------------------------------------------------------------
>
> Key: SPARK-3837
> URL: https://issues.apache.org/jira/browse/SPARK-3837
> Project: Spark
> Issue Type: Improvement
> Components: YARN
> Affects Versions: 1.1.0
> Reporter: Sandy Ryza
>
> YARN now lets application masters know when it kills their containers for exceeding memory limits. Spark should log something when this happens.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org