You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by pranavkrs <pr...@yahoo.com> on 2015/03/09 22:22:29 UTC

yarn + spark deployment issues (high memory consumption and task hung)

Yarn+ Spark:
I am running my spark job (on yarn) on 6 data node cluster of 512GB each. I
was having tough time configuring it since the job would hang in one or more
tasks on any of the executors for indefinite time. The stage can be as
simple as rdd count. And the bottleneck point is not always the same.

So there must be something goofy in my configuration which might be causing
the deadlock in any of the stages. I do multiple transformation on the input
rdd, and I see the following log message where it consumed ~36GB in less
than 1 hour itself. After 2-3 hour run, executor runs OOM, container gets
skilled and a new one gets created which continues to work fine till issue
repeats. I configured executor failures to a high number, so the application
never fails.

2015-03-09 14:11:17,261 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
Memory usage of ProcessTree 1800 for container-id
container_1425683313223_0026_01_000002: 35.7 GB of 85 GB physical memory
used; 91.3 GB of 178.5 GB virtual memory used

./spark-submit 
--conf spark.storage.memoryFraction=0.6 
--conf spark.eventLog.overwrite=true 
--conf spark.driver.maxResultSize=5g 
--conf spark.yarn.executor.memoryOverhead=5120 
--conf spark.akka.frameSize=512 
--conf spark.eventLog.enabled=true 
--master yarn-cluster 
--num-executors 6 
--executor-memory 80G 
--driver-memory 40G 
--executor-cores 20 --class /tmp/main-all.jar

Here are the questions which can help me great deal:

1> Is it common for executors to get filled up so fast, I mean I am not
explicitly doing RDD.persist or unpersist. I had tried to do so in the past,
but didn't yield me anything. Is it common for containers to get killed and
new one get spawned in a spark job run?

2>Whenever a stage is hung processing the task, on the Yarn+Spark UI,
sometimes I see "CANNOT FIND ADDRESS" on the executor column, or sometimes
executor is mentioned, but the task size is 0, and all the task on the
executor remain in running state. How can we debug this? having trace
enabled also didn't yield any good evidence of what is going wrong.

3> I read about RDD cleanup process, but still don't completely understand
how do these RDD get purged out on its own. I set memory fraction to 0.6
which is quite substantial, but RDD size itself may vary depending upon
their content. I would not need an RDD once I complete all transformation,
how can I make sure it gets purged and my executors don't run into OOM
situation?

Thank you,
Regards





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/yarn-spark-deployment-issues-high-memory-consumption-and-task-hung-tp21980.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org