You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Juan Rodríguez Hortalá <ju...@gmail.com> on 2017/10/24 17:18:26 UTC

(SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled

Hi,

I've been working on this issue, and I would like to get your feedback on
the following approach. The idea is that instead of failing in
`TaskSetManager.abortIfCompletelyBlacklisted`, when a task cannot be
scheduled in any executor but dynamic allocation is enabled, we will
register this task with `ExecutorAllocationManager`. Then
`ExecutorAllocationManager` will request additional executors for these
"unscheduleable tasks" by increasing the value returned in
`ExecutorAllocationManager.maxNumExecutorsNeeded`. This way we are counting
these tasks twice, but this makes sense because the current executors don't
have any slot for these tasks, so we actually want to get new executors
that are able to run these tasks. To avoid a deadlock due to tasks being
unscheduleable forever, we store the timestamp when a task was registered
as unscheduleable, and in `ExecutorAllocationManager.schedule` we abort the
application if there is some task that has been unscheduleable for a
configurable age threshold. This way we give an opportunity to dynamic
allocation to get more executors that are able to run the tasks, but we
don't make the application wait forever.

Attached to the JIRA is a patch with a draft for this approach. Looking
forward to your feedback on this.