You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by squito <gi...@git.apache.org> on 2016/05/20 21:20:21 UTC

[GitHub] spark pull request: [WIP] [SPARK-8426] Enhance Blacklist mechanism...

GitHub user squito opened a pull request:

    https://github.com/apache/spark/pull/13234

    [WIP] [SPARK-8426] Enhance Blacklist mechanism for fault-tolerance

    ## What changes were proposed in this pull request?
    
    Update of https://github.com/apache/spark/pull/8760 by @mwws.  The current blacklist mechanism only considers one task a time -- this expands that by considering:
    1. When we determine an executor is bad, we blacklist *all* tasks from that blacklist, both within the taskset and subsequent task sets.
    2. When many executors on a node appear to be bad, we blacklist the entire node.
    
    ## How was this patch tested?
    
    Unit tests via jenkins.
    Also I ran the additional tests proposed [here](https://github.com/apache/spark/pull/8559) which include blacklist tests. 
    
    TODO:
    [ ] performance tests
    [ ] more internal comments (in particular on concurrency)
    [ ] manual testing on a cluster

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/squito/spark blacklist-SPARK-8426

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13234.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13234
    
----
commit 975a2a3c2b810f6b462eb46813075aac4928c0ae
Author: mwws <we...@intel.com>
Date:   2015-12-29T06:01:17Z

    enhance blacklist mechanism
    
    1. create new BlacklistTracker and BlacklistStrategy interface to
    support
    complex use case for blacklist mechanism.
    2. make Yarn allocator aware of node blacklist information
    3. three strategies implemented for convenience, also user can define
    his own strategy
    SingleTaskStrategy: remain default behavior before this change.
    AdvanceSingleTaskStrategy: enhance SingleTaskStrategy by supporting
    stage level node blacklist
    ExecutorAndNodeStrategy: different taskSet can share blacklist
    information.

commit 51d3c88720faffd6a1fb6910b999cdce0d446bcf
Author: mwws <we...@intel.com>
Date:   2016-01-13T05:43:46Z

    change import order to meet new scala style check rule

commit 7e52311bcf4b5528d127d1d0a16bade7c039517e
Author: mwws <we...@intel.com>
Date:   2016-02-23T05:28:56Z

    simplify code and fix typo
    
    1. fix compile error after rebase to latest codebas.
    2. simplify configuration.
    3. fix typo.
    4. enhance comment and unit text.
    5. remove unused import.
    6. remove ExecutorAndNode strategy.

commit b600604a0920054cf3b33bff047d84cbd302fb3c
Author: Imran Rashid <ir...@cloudera.com>
Date:   2016-05-10T17:49:05Z

    style

commit 45525a118db078f80b3e0e74abe7d7f2e04a7883
Author: Imran Rashid <ir...@cloudera.com>
Date:   2016-05-10T19:27:39Z

    small refactoring

commit f6bb6de673cae7058c26d2f124d3de0d2eb5b06b
Author: Imran Rashid <ir...@cloudera.com>
Date:   2016-05-20T21:09:13Z

    Merge branch 'master' into blacklist-SPARK-8426

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP] [SPARK-8426] Enhance Blacklist mechanism...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13234#issuecomment-220741240
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP] [SPARK-8426] Enhance Blacklist mechanism...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/13234#issuecomment-220730341
  
    For the performance tests, I've collected data here: https://github.com/squito/spark/pull/5 (for lack of a better place).  The brief summary here: the advanced strategy is indeed much slower, but I don't know why yet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP] [SPARK-8426] Enhance Blacklist mechanism...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13234#issuecomment-221797436
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP] [SPARK-8426] Enhance Blacklist mechanism...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13234#issuecomment-220741242
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59031/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP] [SPARK-8426] Enhance Blacklist mechanism...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13234#issuecomment-221783307
  
    **[Test build #59344 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59344/consoleFull)** for PR 13234 at commit [`8f2534b`](https://github.com/apache/spark/commit/8f2534b1d4d90f1ed42c695a77f5a2fa588d3428).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP] [SPARK-8426] Enhance Blacklist mechanism...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13234#issuecomment-220741021
  
    **[Test build #59031 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59031/consoleFull)** for PR 13234 at commit [`f6bb6de`](https://github.com/apache/spark/commit/f6bb6de673cae7058c26d2f124d3de0d2eb5b06b).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13234: [WIP] [SPARK-8426] Enhance Blacklist mechanism for fault...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/13234
  
    (closing till this is in a better state to avoid triggering tests)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP] [SPARK-8426] Enhance Blacklist mechanism...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13234#discussion_r64698507
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
    @@ -249,10 +249,16 @@ private[spark] class TaskSchedulerImpl(
           availableCpus: Array[Int],
           tasks: Seq[ArrayBuffer[TaskDescription]]) : Boolean = {
         var launchedTask = false
    +    // TODO unit test, and also add executor-stage filtering as well
    +    // This is an optimization -- the taskSet might contain a very long list of pending tasks.
    +    // Rather than wasting time checking the offer against each task, and then realizing the
    +    // executor is blacklisted, just filter out the bad executor immediately.
    +    val nodeBlacklist = taskSet.blacklistTracker.map{_.nodeBlacklistForStage(taskSet.stageId)}
    +      .getOrElse(Set())
    --- End diff --
    
    Before this change, there is an `O(n^2)` (where `n` is the number of pending tasks) cost when you've got one bad executor.  The tasks assigned to the bad executor fail, but then we get another resource offer for the bad executor again.  So we find another task for the bad executor, it fails, and we continue the process, going through all of the pending task.  Each time we respond to the resource offer, we need to (a) iterate through the list of tasks to find one that is *not* blacklisted and (b) then remove it from the task list.  Those are both `O(1)` operations when there isn't any blacklisting -- we just pop the last task off the stack.  But as our bad executor makes its way through the tasks, it has to go deeper into the list each time, and both searching the list and then removing an element from it become expensive.
    
    After we've gone through *all* of the tasks for bad executor once, then we will wait for there to be resource offers from good executors.  However, even though we then start scheduling on the good executor, scheduling as a whole is still much slower, because we still have an `O(n)` cost at each call to resourceOffer.  The offer still includes the (now idle) bad executor, and we have to iterate through the entire list of pending tasks to decide that nope, there aren't any tasks we can schedule on that node.
    
    In my performance tests with a 3k task job, this leads to about a 10x slowdown, but obviously this depends a lot on the number of tasks.  But that is the really scary thing -- its not a function of how many bad nodes you have, but how many tasks you are trying to run.  So on a large cluster, where a bad node is more likely, and lots of tasks are more likely, the slowdown will be much worse.
    
    Note that as implemented in this version of the patch, this slowdown is only avoided when we blacklist the entire node.  But we should add blacklisting for an executor as well, to avoid the slowdown in that case also.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP] [SPARK-8426] Enhance Blacklist mechanism...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/13234#issuecomment-221797440
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59344/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13234: [WIP] [SPARK-8426] Enhance Blacklist mechanism fo...

Posted by squito <gi...@git.apache.org>.
Github user squito closed the pull request at:

    https://github.com/apache/spark/pull/13234


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP] [SPARK-8426] Enhance Blacklist mechanism...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13234#issuecomment-221797211
  
    **[Test build #59344 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59344/consoleFull)** for PR 13234 at commit [`8f2534b`](https://github.com/apache/spark/commit/8f2534b1d4d90f1ed42c695a77f5a2fa588d3428).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [WIP] [SPARK-8426] Enhance Blacklist mechanism...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/13234#issuecomment-220721708
  
    **[Test build #59031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59031/consoleFull)** for PR 13234 at commit [`f6bb6de`](https://github.com/apache/spark/commit/f6bb6de673cae7058c26d2f124d3de0d2eb5b06b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org