You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Karthik Palaniappan (JIRA)" <ji...@apache.org> on 2017/03/14 01:20:41 UTC
[jira] [Commented] (SPARK-19941) Spark should not schedule tasks on executors on decommissioning YARN nodes

    [ https://issues.apache.org/jira/browse/SPARK-19941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923363#comment-15923363 ] 

Karthik Palaniappan commented on SPARK-19941:
---------------------------------------------

To repro:

Set up Spark on YARN (Hadoop 2.8). Configure YARN to have a list of included and excluded nodes (yarn.resourcemanager.nodes.include-path and yarn.resourcemanager.nodes.exclude-path). Start with all nodes included, and node excluded.

Run a spark job. I used:

```
spark-submit --class org.apache.spark.examples.SparkPi spark-examples.jar 100000
```

While the job is running, add all nodes to the excluded file, and run `yarn rmadmin -refreshNodes -g 3600 -client`.

Expected: Spark does not schedule any more tasks on executors, and they exit after being idle for 60s. The job hangs.
Actual: Spark continues to schedule tasks and the job completes successfully. The nodes are only decommissioned when the job finishes.

A less dramatic example is to just decommission a subset of the nodes and expect that tasks are not scheduled on executors on those hosts.

> Spark should not schedule tasks on executors on decommissioning YARN nodes
> --------------------------------------------------------------------------
>
>                 Key: SPARK-19941
>                 URL: https://issues.apache.org/jira/browse/SPARK-19941
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, YARN
>    Affects Versions: 2.2.0
>         Environment: Hadoop 2.8.0-rc1
>            Reporter: Karthik Palaniappan
>
> Hadoop 2.8 added a mechanism to gracefully decommission Node Managers in YARN: https://issues.apache.org/jira/browse/YARN-914
> Essentially you can mark nodes to be decommissioned, and let them a) finish work in progress and b) finish serving shuffle data. But no new work will be scheduled on the node.
> Spark should respect when NMs are set to decommissioned, and similarly decommission executors on those nodes by not scheduling any more tasks on them.
> It looks like in the future YARN may inform the app master when containers will be killed: https://issues.apache.org/jira/browse/YARN-3784. However, I don't think Spark should schedule based on a timeout. We should gracefully decommission the executor as fast as possible (which is the spirit of YARN-914). The app master can query the RM for NM statuses (if it doesn't already have them) and stop scheduling on executors on NMs that are decommissioning.
> Stretch feature: The timeout may be useful in determining whether running further tasks on the executor is even helpful. Spark may be able to tell that shuffle data will not be consumed by the time the node is decommissioned, so it is not worth computing. The executor can be killed immediately.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org