You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by Ngone51 <gi...@git.apache.org> on 2018/04/06 16:02:07 UTC

[GitHub] spark pull request #20998: [SPARK-23888][CORE] speculative task should not r...

GitHub user Ngone51 opened a pull request:

    https://github.com/apache/spark/pull/20998

    [SPARK-23888][CORE] speculative task should not run on a given host where another attempt is already running on

    ## What changes were proposed in this pull request?
    
    There's a bug in:
    ```
    /** Check whether a task is currently running an attempt on a given host */
     private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = {
       taskAttempts(taskIndex).exists(_.host == host)
     }
    ```
    This will ignore hosts which only have finished attempts, so we should check whether the attempt is currently running on the given host. 
    
    And it is possible for a speculative task to run on a host where another attempt failed here before.
    Assume we have only two machines: host1, host2.  We first run task0.0 on host1. Then, due to  a long time waiting for task0.0, we launch a speculative task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run since there's already  a copy running on host2. After another long time waiting, we launch a new  speculative task0.2. And, now, we can run task0.2 on host1 again, since there's no more running attempt on host1.
    
    ## How was this patch tested?
    
    Added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Ngone51/spark SPARK-23888

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20998.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20998
    
----
commit 3584a09410d72207d8a00dcbf385b2e276925fc8
Author: wuyi <ng...@...>
Date:   2018-04-06T15:34:18Z

    add task attempt running check in hasAttemptOnHost

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] correct the comment of hasAttemptOnH...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    **[Test build #89460 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89460/testReport)** for PR 20998 at commit [`0c6f305`](https://github.com/apache/spark/commit/0c6f3058a5c0af4a6e9cd1a90d43230805305df5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Adding isRunning can cause a single 'bad' node (from task pov - not necessarily only bad hardware: just that task fails on node) can keep tasks to fail repeatedly causing app to exit.
    
    Particularly with blacklist'ing, I am not very sure how the interactions will play out .. @squito might have more comments.
    In general, this is not a benign change imo and can have non trivial side effects.
    
    In the specific usecase of only two machines, it is an unfortunate side effect.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    @squito My concern is, in large workloads, some nodes simply become bad for some tasks (transient env or hardware issues, colocating containers, etc) while being fine for others; speculative tasks should alleviate performance concerns and not increase chances of application failure due to locality preference affinity. For cluster sizes which are very small, speculative execution is less relevant than for those which are large - and here we are tuning for the former.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20998: [SPARK-23888][CORE] speculative task should not r...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20998#discussion_r180443917
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -880,6 +880,59 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg
         assert(manager.resourceOffer("execB", "host2", ANY).get.index === 3)
       }
     
    +  test("speculative task should not run on a given host where another attempt " +
    +    "is already running on") {
    +    sc = new SparkContext("local", "test")
    +    sched = new FakeTaskScheduler(
    +      sc, ("execA", "host1"), ("execB", "host2"))
    +    val taskSet = FakeTask.createTaskSet(1,
    +      Seq(TaskLocation("host1", "execA"), TaskLocation("host2", "execB")))
    +    val clock = new ManualClock
    +    val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES, clock = clock)
    +
    +    // let task0.0 run on host1
    +    assert(manager.resourceOffer("execA", "host1", PROCESS_LOCAL).get.index == 0)
    +    val info1 = manager.taskAttempts(0)(0)
    +    assert(info1.running === true)
    +    assert(info1.host === "host1")
    +
    +    // long time elapse, and task0.0 is still running,
    +    // so we launch a speculative task0.1 on host2
    +    clock.advance(1000)
    +    manager.speculatableTasks += 0
    +    assert(manager.resourceOffer("execB", "host2", PROCESS_LOCAL).get.index === 0)
    +    val info2 = manager.taskAttempts(0)(0)
    +    assert(info2.running === true)
    +    assert(info2.host === "host2")
    +    assert(manager.speculatableTasks.size === 0)
    +
    +    // now, task0 has two copies running on host1, host2 separately,
    +    // so we can not launch a speculative task on any hosts.
    +    manager.speculatableTasks += 0
    +    assert(manager.resourceOffer("execA", "host1", PROCESS_LOCAL) === None)
    +    assert(manager.resourceOffer("execB", "host2", PROCESS_LOCAL) === None)
    +    assert(manager.speculatableTasks.size === 1)
    +
    +    // after a long long time, task0.0 failed, and task0.0 can not re-run since
    +    // there's already a running copy.
    +    clock.advance(1000)
    +    info1.finishTime = clock.getTimeMillis()
    --- End diff --
    
    it would be better here for you to call `manager.handleFailedTask`, to more accurately simulate the real behavior, and also makes the purpose of a test a little more clear.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] correct the comment of hasAttemptOnH...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    LGTM


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20998: [SPARK-23888][CORE] speculative task should not r...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20998#discussion_r180784912
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -880,8 +880,8 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg
         assert(manager.resourceOffer("execB", "host2", ANY).get.index === 3)
       }
     
    -  test("speculative task should not run on a given host where another attempt " +
    -    "is already running on") {
    +  test("SPARK-23888: speculative task should not run on a given host " +
    +    "where another attempt is already running on") {
    --- End diff --
    
    I'd reword this to be a bit more specific to what you're trying to test: 
    
    speculative task cannot run on host with another running attempt, but can run on a host with a failed attempt.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    @Ngone51 can you instead leave the behavior as is, and just update the comment?
    
    Sorry that its going to be a small change in the end, and all the extra work the bad comments led you to do, but still appreciate you noticing this and fixing.  a good PR with a quality test too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89164/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] correct the comment of hasAttemptOnH...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20998: [SPARK-23888][CORE] correct the comment of hasAtt...

Posted by Ngone51 <gi...@git.apache.org>.
Github user Ngone51 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20998#discussion_r183408481
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -287,7 +287,7 @@ private[spark] class TaskSetManager(
         None
       }
     
    -  /** Check whether a task is currently running an attempt on a given host */
    +  /** Check whether a task once run an attempt on a given host */
    --- End diff --
    
    Yes. Thank you.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] correct the comment of hasAttemptOnH...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    **[Test build #89728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89728/testReport)** for PR 20998 at commit [`e44d80b`](https://github.com/apache/spark/commit/e44d80b70985f9be23ccafd0868e29c946ab4a5a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89026/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    **[Test build #89228 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89228/testReport)** for PR 20998 at commit [`5901728`](https://github.com/apache/spark/commit/5901728d6be8ad33e39d56006a2bc8cc02cfff38).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    **[Test build #89026 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89026/testReport)** for PR 20998 at commit [`3584a09`](https://github.com/apache/spark/commit/3584a09410d72207d8a00dcbf385b2e276925fc8).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] correct the comment of hasAttemptOnH...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    **[Test build #89728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89728/testReport)** for PR 20998 at commit [`e44d80b`](https://github.com/apache/spark/commit/e44d80b70985f9be23ccafd0868e29c946ab4a5a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by Ngone51 <gi...@git.apache.org>.
Github user Ngone51 commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Hi, @squito . Thank for review and comments.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] correct the comment of hasAttemptOnH...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    merged to master, thanks @Ngone51  .  I also updated the commit msg some before committing, I thought it best to focus on the eventual change, figured it wasn't worth bugging you for another update cycle.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    **[Test build #89026 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89026/testReport)** for PR 20998 at commit [`3584a09`](https://github.com/apache/spark/commit/3584a09410d72207d8a00dcbf385b2e276925fc8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20998: [SPARK-23888][CORE] speculative task should not r...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20998#discussion_r180439559
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -880,6 +880,59 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg
         assert(manager.resourceOffer("execB", "host2", ANY).get.index === 3)
       }
     
    +  test("speculative task should not run on a given host where another attempt " +
    +    "is already running on") {
    +    sc = new SparkContext("local", "test")
    +    sched = new FakeTaskScheduler(
    +      sc, ("execA", "host1"), ("execB", "host2"))
    +    val taskSet = FakeTask.createTaskSet(1,
    +      Seq(TaskLocation("host1", "execA"), TaskLocation("host2", "execB")))
    +    val clock = new ManualClock
    +    val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES, clock = clock)
    +
    +    // let task0.0 run on host1
    +    assert(manager.resourceOffer("execA", "host1", PROCESS_LOCAL).get.index == 0)
    +    val info1 = manager.taskAttempts(0)(0)
    +    assert(info1.running === true)
    +    assert(info1.host === "host1")
    +
    +    // long time elapse, and task0.0 is still running,
    +    // so we launch a speculative task0.1 on host2
    +    clock.advance(1000)
    +    manager.speculatableTasks += 0
    +    assert(manager.resourceOffer("execB", "host2", PROCESS_LOCAL).get.index === 0)
    +    val info2 = manager.taskAttempts(0)(0)
    +    assert(info2.running === true)
    +    assert(info2.host === "host2")
    +    assert(manager.speculatableTasks.size === 0)
    +
    +    // now, task0 has two copies running on host1, host2 separately,
    +    // so we can not launch a speculative task on any hosts.
    +    manager.speculatableTasks += 0
    +    assert(manager.resourceOffer("execA", "host1", PROCESS_LOCAL) === None)
    +    assert(manager.resourceOffer("execB", "host2", PROCESS_LOCAL) === None)
    +    assert(manager.speculatableTasks.size === 1)
    +
    +    // after a long long time, task0.0 failed, and task0.0 can not re-run since
    +    // there's already a running copy.
    +    clock.advance(1000)
    +    info1.finishTime = clock.getTimeMillis()
    +    assert(info1.running === false)
    --- End diff --
    
    `assert(!info1.running)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20998: [SPARK-23888][CORE] speculative task should not r...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20998#discussion_r180439612
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -880,6 +880,59 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg
         assert(manager.resourceOffer("execB", "host2", ANY).get.index === 3)
       }
     
    +  test("speculative task should not run on a given host where another attempt " +
    +    "is already running on") {
    +    sc = new SparkContext("local", "test")
    +    sched = new FakeTaskScheduler(
    +      sc, ("execA", "host1"), ("execB", "host2"))
    +    val taskSet = FakeTask.createTaskSet(1,
    +      Seq(TaskLocation("host1", "execA"), TaskLocation("host2", "execB")))
    +    val clock = new ManualClock
    +    val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES, clock = clock)
    +
    +    // let task0.0 run on host1
    +    assert(manager.resourceOffer("execA", "host1", PROCESS_LOCAL).get.index == 0)
    +    val info1 = manager.taskAttempts(0)(0)
    +    assert(info1.running === true)
    +    assert(info1.host === "host1")
    +
    +    // long time elapse, and task0.0 is still running,
    +    // so we launch a speculative task0.1 on host2
    +    clock.advance(1000)
    +    manager.speculatableTasks += 0
    +    assert(manager.resourceOffer("execB", "host2", PROCESS_LOCAL).get.index === 0)
    +    val info2 = manager.taskAttempts(0)(0)
    +    assert(info2.running === true)
    +    assert(info2.host === "host2")
    +    assert(manager.speculatableTasks.size === 0)
    +
    +    // now, task0 has two copies running on host1, host2 separately,
    +    // so we can not launch a speculative task on any hosts.
    +    manager.speculatableTasks += 0
    +    assert(manager.resourceOffer("execA", "host1", PROCESS_LOCAL) === None)
    +    assert(manager.resourceOffer("execB", "host2", PROCESS_LOCAL) === None)
    +    assert(manager.speculatableTasks.size === 1)
    +
    +    // after a long long time, task0.0 failed, and task0.0 can not re-run since
    +    // there's already a running copy.
    +    clock.advance(1000)
    +    info1.finishTime = clock.getTimeMillis()
    +    assert(info1.running === false)
    +
    +    // time goes on, and task0.1 is still running
    +    clock.advance(1000)
    +    // so we try to launch a new speculative task
    +    // we can not run it on host2, because task0.1 is already running on
    +    assert(manager.resourceOffer("execB", "host2", PROCESS_LOCAL) === None)
    +    // we successfully launch a speculative task0.2 on host1, since there's
    +    // no more running copy of task0
    +    assert(manager.resourceOffer("execA", "host1", PROCESS_LOCAL).get.index === 0)
    +    val info3 = manager.taskAttempts(0)(0)
    +    assert(info3.running === true)
    --- End diff --
    
    `assert(info3.running)`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    I'm not even really concerned about the case for two hosts -- I agree its fine if we do something sub-optimal.  I'm more concerned about code-clarity and the behavior in general.  It seems cleaner to me if speculation doesn't worry about where its failed before, and those exclusions are left to the blacklist.
    
    But it sounds like you're saying the prior behavior was really desirable -- you think its better if speculation always excludes hosts that task has ever failed on?  I'm happy to defer to your opinion on this, I haven't really stressed speculative execution yet.  Then lets just change that comment in the code to be consistent.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by Ngone51 <gi...@git.apache.org>.
Github user Ngone51 commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    ping @pwendell @kayousterhout . pls help review, thanks :)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] correct the comment of hasAttemptOnH...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89460/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20998: [SPARK-23888][CORE] speculative task should not r...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20998#discussion_r180784374
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -880,6 +880,59 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg
         assert(manager.resourceOffer("execB", "host2", ANY).get.index === 3)
       }
     
    +  test("speculative task should not run on a given host where another attempt " +
    +    "is already running on") {
    +    sc = new SparkContext("local", "test")
    +    sched = new FakeTaskScheduler(
    +      sc, ("execA", "host1"), ("execB", "host2"))
    +    val taskSet = FakeTask.createTaskSet(1,
    +      Seq(TaskLocation("host1", "execA"), TaskLocation("host2", "execB")))
    +    val clock = new ManualClock
    +    val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES, clock = clock)
    +
    +    // let task0.0 run on host1
    +    assert(manager.resourceOffer("execA", "host1", PROCESS_LOCAL).get.index == 0)
    +    val info1 = manager.taskAttempts(0)(0)
    +    assert(info1.running === true)
    +    assert(info1.host === "host1")
    +
    +    // long time elapse, and task0.0 is still running,
    +    // so we launch a speculative task0.1 on host2
    +    clock.advance(1000)
    +    manager.speculatableTasks += 0
    +    assert(manager.resourceOffer("execB", "host2", PROCESS_LOCAL).get.index === 0)
    +    val info2 = manager.taskAttempts(0)(0)
    +    assert(info2.running === true)
    +    assert(info2.host === "host2")
    +    assert(manager.speculatableTasks.size === 0)
    +
    +    // now, task0 has two copies running on host1, host2 separately,
    +    // so we can not launch a speculative task on any hosts.
    +    manager.speculatableTasks += 0
    +    assert(manager.resourceOffer("execA", "host1", PROCESS_LOCAL) === None)
    +    assert(manager.resourceOffer("execB", "host2", PROCESS_LOCAL) === None)
    +    assert(manager.speculatableTasks.size === 1)
    +
    +    // after a long long time, task0.0 failed, and task0.0 can not re-run since
    +    // there's already a running copy.
    +    clock.advance(1000)
    +    info1.finishTime = clock.getTimeMillis()
    --- End diff --
    
    you shouldn't need to set `info.finishTime` anymore, that should be taken care of by `manager.handleFailedTask`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    **[Test build #89164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89164/testReport)** for PR 20998 at commit [`2ed9584`](https://github.com/apache/spark/commit/2ed958418ff182bca0a3af1bf35999130312e78f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by Ngone51 <gi...@git.apache.org>.
Github user Ngone51 commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Hi, @mridulm, thank for your comment. Actually, I have the same worry with you. May be we can make this change as a second choice for `hasAttemptOnHost `, in case of there's really no other hosts to select.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by Ngone51 <gi...@git.apache.org>.
Github user Ngone51 commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Will do, and it's okay. 
    My view limited in the source code yet, but you guys have more practical experience. So I learned from your points. It's beneficial.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Jenkins, ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    **[Test build #89228 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89228/testReport)** for PR 20998 at commit [`5901728`](https://github.com/apache/spark/commit/5901728d6be8ad33e39d56006a2bc8cc02cfff38).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    **[Test build #89164 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89164/testReport)** for PR 20998 at commit [`2ed9584`](https://github.com/apache/spark/commit/2ed958418ff182bca0a3af1bf35999130312e78f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89228/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20998: [SPARK-23888][CORE] correct the comment of hasAtt...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20998


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    @squito I completely agree that the comment is inaccurate.
    Note that this is for a specific taskset, so impact is limited to that taskset (w.r.t using executors for spec exec)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] correct the comment of hasAttemptOnH...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    **[Test build #89460 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89460/testReport)** for PR 20998 at commit [`0c6f305`](https://github.com/apache/spark/commit/0c6f3058a5c0af4a6e9cd1a90d43230805305df5).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] correct the comment of hasAttemptOnH...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89728/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] correct the comment of hasAttemptOnH...

Posted by Ngone51 <gi...@git.apache.org>.
Github user Ngone51 commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Agree and thank you @squito .
    
    And thanks for all of you. @felixcheung @mridulm @jiangxb1987 @srowen 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by felixcheung <gi...@git.apache.org>.
Github user felixcheung commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    sounds fair, but shouldn't this be up to the scheduler backend? multiple tasks/attempts can run simultaneously on the same physical host?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by Ngone51 <gi...@git.apache.org>.
Github user Ngone51 commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Hi, @felixcheung , thank for trigger a task and your comments.
     > shouldn't this be up to the scheduler backend? 
    
    Actually, it is `TaskSchedulerImpl` who holds a thread to check whether there are any speculative tasks need to be scheduled periodically. If any, then, call `backend.reviveOffers` to offer resources  . But, it's `TaskSetManager` who decides whether we need to launch a speculative task for a certain task.
    
    > multiple tasks/attempts can run simultaneously on the same physical host?
    
    I think multiple task attempts(actually, speculative tasks) can run on the sam physical host, but not simultaneously, as long as there's no running attempt on it. In PR description, I illustrate a case which a speculative task chose to run on a host, where a previous task attempts have been run on, but failed finally. I think if the task's failure is not relevant to the host, 'run on the same host' can be acceptable. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20998: [SPARK-23888][CORE] speculative task should not r...

Posted by Ngone51 <gi...@git.apache.org>.
Github user Ngone51 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20998#discussion_r180937626
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -880,8 +880,8 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg
         assert(manager.resourceOffer("execB", "host2", ANY).get.index === 3)
       }
     
    -  test("speculative task should not run on a given host where another attempt " +
    -    "is already running on") {
    +  test("SPARK-23888: speculative task should not run on a given host " +
    +    "where another attempt is already running on") {
    --- End diff --
    
    Sure. Also, do we need to reword PR and jira title? @squito 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20998: [SPARK-23888][CORE] correct the comment of hasAtt...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20998#discussion_r183406175
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -287,7 +287,7 @@ private[spark] class TaskSetManager(
         None
       }
     
    -  /** Check whether a task is currently running an attempt on a given host */
    +  /** Check whether a task once run an attempt on a given host */
    --- End diff --
    
    Should this be "once ran"?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20998: [SPARK-23888][CORE] speculative task should not r...

Posted by Ngone51 <gi...@git.apache.org>.
Github user Ngone51 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20998#discussion_r180614873
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -880,6 +880,59 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg
         assert(manager.resourceOffer("execB", "host2", ANY).get.index === 3)
       }
     
    +  test("speculative task should not run on a given host where another attempt " +
    +    "is already running on") {
    +    sc = new SparkContext("local", "test")
    +    sched = new FakeTaskScheduler(
    +      sc, ("execA", "host1"), ("execB", "host2"))
    +    val taskSet = FakeTask.createTaskSet(1,
    +      Seq(TaskLocation("host1", "execA"), TaskLocation("host2", "execB")))
    +    val clock = new ManualClock
    +    val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES, clock = clock)
    +
    +    // let task0.0 run on host1
    +    assert(manager.resourceOffer("execA", "host1", PROCESS_LOCAL).get.index == 0)
    +    val info1 = manager.taskAttempts(0)(0)
    +    assert(info1.running === true)
    +    assert(info1.host === "host1")
    +
    +    // long time elapse, and task0.0 is still running,
    +    // so we launch a speculative task0.1 on host2
    +    clock.advance(1000)
    +    manager.speculatableTasks += 0
    +    assert(manager.resourceOffer("execB", "host2", PROCESS_LOCAL).get.index === 0)
    +    val info2 = manager.taskAttempts(0)(0)
    +    assert(info2.running === true)
    +    assert(info2.host === "host2")
    +    assert(manager.speculatableTasks.size === 0)
    +
    +    // now, task0 has two copies running on host1, host2 separately,
    +    // so we can not launch a speculative task on any hosts.
    +    manager.speculatableTasks += 0
    +    assert(manager.resourceOffer("execA", "host1", PROCESS_LOCAL) === None)
    +    assert(manager.resourceOffer("execB", "host2", PROCESS_LOCAL) === None)
    +    assert(manager.speculatableTasks.size === 1)
    +
    +    // after a long long time, task0.0 failed, and task0.0 can not re-run since
    +    // there's already a running copy.
    +    clock.advance(1000)
    +    info1.finishTime = clock.getTimeMillis()
    --- End diff --
    
    nice suggestion.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] correct the comment of hasAttemptOnH...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20998: [SPARK-23888][CORE] speculative task should not run on a...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/20998
  
    @mridulm more thoughts?  I think this is the right change but I will leave open for a bit to get more input


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org