You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Juan Rodríguez Hortalá (JIRA)" <ji...@apache.org> on 2017/10/24 17:17:00 UTC
[jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

    [ https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217299#comment-16217299 ] 

Juan Rodríguez Hortalá commented on SPARK-22148:
------------------------------------------------

Hi, 

I've been working on this issue, and I would like to get your feedback on the following approach. The idea is that instead of failing in `TaskSetManager.abortIfCompletelyBlacklisted`, when a task cannot be scheduled in any executor but dynamic allocation is enabled, we will register this task with `ExecutorAllocationManager`. Then `ExecutorAllocationManager` will request additional executors for these "unscheduleable tasks" by increasing the value returned in `ExecutorAllocationManager.maxNumExecutorsNeeded`. This way we are counting these tasks twice, but this makes sense because the current executors don't have any slot for these tasks, so we actually want to get new executors that are able to run these tasks. To avoid a deadlock due to tasks being unscheduleable forever, we store the timestamp when a task was registered as unscheduleable, and in `ExecutorAllocationManager.schedule` we abort the application if there is some task that has been unscheduleable for a configurable age threshold. This way we give an opportunity to dynamic allocation to get more executors that are able to run the tasks, but we don't make the application wait forever. 

Attached is a patch with a draft for this approach. Looking forward to your feedback on this. 

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22148
>                 URL: https://issues.apache.org/jira/browse/SPARK-22148
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Juan Rodríguez Hortalá
>         Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and the whole Spark job with `task X (partition Y) cannot run anywhere due to node and executor blacklist. Blacklisting behavior can be configured via spark.blacklist.*.` when all the available executors are blacklisted for a pending Task or TaskSet. This makes sense for static allocation, where the set of executors is fixed for the duration of the application, but this might lead to unnecessary job failures when dynamic allocation is enabled. For example, in a Spark application with a single job at a time, when a node fails at the end of a stage attempt, all other executors will complete their tasks, but the tasks running in the executors of the failing node will be pending. Spark will keep waiting for those tasks for 2 minutes by default (spark.network.timeout) until the heartbeat timeout is triggered, and then it will blacklist those executors for that stage. At that point in time, other executors would had been released after being idle for 1 minute by default (spark.dynamicAllocation.executorIdleTimeout), because the next stage hasn't started yet and so there are no more tasks available (assuming the default of spark.speculation = false). So Spark will fail because the only executors available are blacklisted for that stage. 
> An alternative is requesting more executors to the cluster manager in this situation. This could be retried a configurable number of times after a configurable wait time between request attempts, so if the cluster manager fails to provide a suitable executor then the job is aborted like in the previous case. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org