You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Imran Rashid (JIRA)" <ji...@apache.org> on 2016/12/02 17:20:58 UTC

[jira] [Commented] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

    [ https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15715719#comment-15715719 ] 

Imran Rashid commented on SPARK-15815:
--------------------------------------

[~SuYan] I've been mulling this over for a while, and I think my earlier proposal is good one.  We'd need two changes:

1. When unschedulability is detected, kill an executor that is blacklisted for the unschedulable task and request another one.
2. When we detect unschedulability due to blacklisting, instead of immediately killing the taskset, we should add a start a countdown (say 5 min).  If the taskset remains unschedulable until the countdown is up, then we abort the taskset.

In the case you outline above, this should have the desired effect.  When DA has you down to just one executor for the last task, then that executor gets blacklisted, you'd kill the executor, and simultaneously start the countdown.  Hopefully the cluster manager gives you another executor before the countdown is up, and then your job continues happily.

Two other situations worth considering: (a) the cluster manager gives us another executor on a bad node.  Tasks fail on this new executor, which again gets blacklisted.  I think this is OK.  The countdown would get reset when we schedule the task on the new executor, even though the task will fail.  Then when the executor gets blacklisted, 

(b) the cluster manager fails to give you another executor before the timeout is up.  We could either abort the job, or just let the app hang indefinitely (eg., ignore the countdown in this specific case that there aren't any executors).  In fact, the [code already lets the app wait indefinitely if tehre are no executors|https://github.com/apache/spark/blob/48778976e0566d9c93a8c900825def82c6b81fd6/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L594].

Note that SPARK-16554 , automated killing of blacklisted executors, is related, but is insufficient to handle (1) above.  SPARK-16554 will only kill an executor that is blacklisted for the entire application, but in this case we need to kill an executor that is blacklisted even for just one task.

An alternative to actively killing the executor would be to somehow inform the the {{ExecutorAllocationManager}} that we have a task which *cannot* be scheduled on the existing executors, so it requests a new executor and leaves the old one.  However, that makes the implementation significantly more complex.  Though it would be more efficient, I think we should keep things simpler and live with a bit of inefficiency in this case?

Thoughts?  Any interest in taking a stab at implementing this?

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -----------------------------------------------------------------
>
>                 Key: SPARK-15815
>                 URL: https://issues.apache.org/jira/browse/SPARK-15815
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 1.6.1
>            Reporter: SuYan
>            Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 executors, due to we already have executor A, so the oldTargetNumExecutor  == targetNumExecutor = 1, so will never add more Executors...even if Executor A was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org