You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Imran Rashid (JIRA)" <ji...@apache.org> on 2016/07/14 17:07:20 UTC

[jira] [Commented] (SPARK-15815) Hang while enable blacklistExecutor and DynamicExecutorAllocator

    [ https://issues.apache.org/jira/browse/SPARK-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377283#comment-15377283 ] 

Imran Rashid commented on SPARK-15815:
--------------------------------------

[~SuYan] I've been thinking about this more, and I agree this is definitely a problem.  Neither SPARK-15865  nor SPARK-8425 will rely solve the situation here.  SPARK-15865 does eliminate the indefinite hang, but you're right, its a little silly to abort the job completely.

However, I can't think of a good solution.  One possibilty: when we detect unschedulability, if DA is on, we could actively kill an executor, and start an unschedulability timeout (eg. 1 minute).  If we still detect unschedulability after the timeout is up, then fail the taskset.  Then hopefully you give time for another executor to come up and run the task.

This gets tricky in lots of situations, though -- what if the new executor which comes up is also bad? (eg., because of a bad disk, but there haven't been enough failures to trigger the node blacklist yet)  what if you hit this situation with more than one executor?  Maybe this can be worked around, but I'd need to spend some time thinking through the scenarios.  Overall the complexity makes me worried we'd be missing some cases.

You could also encounter something similar even without DA, say if you just run your job with 1 executor.  You would fail your job rather than try get a replacement executor, but maybe this isn't so bad since you're running with just one executor ...

ping [~kayousterhout] [~markhamstra] [~matei] for thoughts as well.

> Hang while enable blacklistExecutor and DynamicExecutorAllocator 
> -----------------------------------------------------------------
>
>                 Key: SPARK-15815
>                 URL: https://issues.apache.org/jira/browse/SPARK-15815
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 1.6.1
>            Reporter: SuYan
>            Priority: Minor
>
> Enable BlacklistExecutor with some time large than 120s and enabled DynamicAllocate with minExecutors = 0
> 1. Assume there only left 1 task running in Executor A, and other Executor are all timeout.  
> 2. the task failed, so task will not scheduled in current Executor A due to enable blacklistTime.
> 3. For ExecutorAllocateManager, it always request targetNumExecutor=1 executors, due to we already have executor A, so the oldTargetNumExecutor  == targetNumExecutor = 1, so will never add more Executors...even if Executor A was timeout.  it became endless request delta=0 executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org