You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sameer Choudhary <sa...@gmail.com> on 2016/11/20 22:02:14 UTC

Fwd: Yarn resource utilization with Spark pipe()

Hi,

I am working on an Spark 1.6.2 application on YARN managed EMR cluster that
uses RDD's pipe method to process my data. I start a light weight daemon
process that starts processes for each task via pipes. This is to ensure
that I don't run into https://issues.apache.org/jira/browse/SPARK-671.

I'm running into Spark job failure due to task failures across the cluster.
Following are the questions that I think would help in understanding the
issue:
- How does resource allocation in PySpark work? How does YARN and SPARK
track the memory consumed by python processes launched on the worker nodes?
- As an example, let's say SPARK started n tasks on a worker node. These n
tasks start n processes via pipe. Memory for executors is already reserved
during application launch. As the processes run their memory footprint
grows and eventually there is not enough memory on the box. In this case
how will YARN and SPARK behave? Will the executors be killed or my
processes will kill, eventually killing the task? I think this could lead
to cascading failures of tasks across cluster as retry attempts also fail,
eventually leading to termination of SPARK job. Is there a way to avoid
this?
- When we define number of executors in my SparkConf, are they distributed
evenly across my nodes? One approach to get around this problem would be to
limit the number of executors on each host that YARN can launch. So we will
manage the memory for piped processes outside of YARN. Is there way to
avoid this?

Thanks,
Sameer