You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tgravescs <gi...@git.apache.org> on 2018/08/02 17:39:52 UTC

[GitHub] spark pull request #21976: [SPARK-24909] Spark scheduler can hang when fetch...

GitHub user tgravescs opened a pull request:

    https://github.com/apache/spark/pull/21976

     [SPARK-24909] Spark scheduler can hang when fetch failures, executor

    …lost, task running on lost executor, and multiple stage attempts
    
    ## What changes were proposed in this pull request?
    this PR is actually reverting the change in SPARK-19263, so that it always does shuffleStage.pendingPartitions -= task.partitionId.   The change in SPARK-23433, should fix the issue originally from SPARK-19263.
    
    ## How was this patch tested?
    
    Unit tests.  The condition happens on a race which I haven't reproduced on a real customer, just see it sometimes on customers jobs in a real cluster.  
    I am also working on adding spark scheduler integration tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tgravescs/spark SPARK-24909

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21976.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21976
    
----
commit 82243746fb8709c925bea97c25cb57c82cec8c2f
Author: Thomas Graves <tg...@...>
Date:   2018-08-02T17:37:00Z

    [SPARK-24909] Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts

commit 54646148730462e34a32d81200530cf50dbf7a51
Author: Thomas Graves <tg...@...>
Date:   2018-08-02T17:39:08Z

    add log message back

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    just an fyi, the other jira is https://issues.apache.org/jira/browse/SPARK-25250, its related to a race with SPARK-23433


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    @tgravescs did you have a chance to figure out what you mention above?
    
    (Test failure above seems unrelated to the change, too.)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    thanks @vanzin cherry-picked into 2.3


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    test this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Looks like we're good. Merging to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    **[Test build #94039 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94039/testReport)** for PR 21976 at commit [`5464614`](https://github.com/apache/spark/commit/54646148730462e34a32d81200530cf50dbf7a51).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95307/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2585/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    @tgravescs I did not merge to 2.3 even though it says it affects 2.3.1; feel free to do it if you think it should be there.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2620/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2574/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    **[Test build #95290 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95290/testReport)** for PR 21976 at commit [`e384245`](https://github.com/apache/spark/commit/e384245f7b0c6c43e6e0e0f7b73528b5c355e2f1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Please hold off on merging this am investigating a weirdness I want to make sure isn't caused by this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    **[Test build #95307 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95307/testReport)** for PR 21976 at commit [`e384245`](https://github.com/apache/spark/commit/e384245f7b0c6c43e6e0e0f7b73528b5c355e2f1).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2384/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95353/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Something like "[SPARK-24909][core] Always unregister pending partition on task completion." which is what the code is now doing. You can have the explanation of what this is fixing in the description, which can be longer than the title.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94039/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    still working on it, getting close, takes a bit to reproduce the case so taking a bit of time.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21976: [SPARK-24909][core] Always unregister pending par...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21976#discussion_r213319977
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -2474,19 +2478,21 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
         runEvent(makeCompletionEvent(
           taskSets(3).tasks(0), Success, makeMapStatus("hostB", 2)))
     
    -    // There should be no new attempt of stage submitted,
    -    // because task(stageId=1, stageAttempt=1, partitionId=1) is still running in
    -    // the current attempt (and hasn't completed successfully in any earlier attempts).
    -    assert(taskSets.size === 4)
    +    // At this point there should be no active task set for stageId=1 and we need
    +    // to resubmit because the output from (stageId=1, stageAttemptId=0, partitionId=1)
    +    // was ignored due to executor failure
    +    assert(taskSets.size === 5)
    +    assert(taskSets(4).stageId === 1 && taskSets(4).stageAttemptId === 2
    +      && taskSets(4).tasks.size === 1)
     
    -    // Complete task(stageId=1, stageAttempt=1, partitionId=1) successfully.
    +    // Complete task(stageId=1, stageAttempt=2, partitionId=1) successfully.
         runEvent(makeCompletionEvent(
    -      taskSets(3).tasks(1), Success, makeMapStatus("hostB", 2)))
    +      taskSets(4).tasks(0), Success, makeMapStatus("hostB", 2)))
    --- End diff --
    
    https://issues.apache.org/jira/browse/SPARK-25263


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95034/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21976: [SPARK-24909][core] Always unregister pending par...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21976#discussion_r213176636
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -2474,19 +2478,21 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
         runEvent(makeCompletionEvent(
           taskSets(3).tasks(0), Success, makeMapStatus("hostB", 2)))
     
    -    // There should be no new attempt of stage submitted,
    -    // because task(stageId=1, stageAttempt=1, partitionId=1) is still running in
    -    // the current attempt (and hasn't completed successfully in any earlier attempts).
    -    assert(taskSets.size === 4)
    +    // At this point there should be no active task set for stageId=1 and we need
    +    // to resubmit because the output from (stageId=1, stageAttemptId=0, partitionId=1)
    +    // was ignored due to executor failure
    +    assert(taskSets.size === 5)
    +    assert(taskSets(4).stageId === 1 && taskSets(4).stageAttemptId === 2
    +      && taskSets(4).tasks.size === 1)
     
    -    // Complete task(stageId=1, stageAttempt=1, partitionId=1) successfully.
    +    // Complete task(stageId=1, stageAttempt=2, partitionId=1) successfully.
         runEvent(makeCompletionEvent(
    -      taskSets(3).tasks(1), Success, makeMapStatus("hostB", 2)))
    +      taskSets(4).tasks(0), Success, makeMapStatus("hostB", 2)))
    --- End diff --
    
    Yea thanks for explanation, BTW what's the jira number of the ongoing scheduler integration test?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    **[Test build #95034 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95034/testReport)** for PR 21976 at commit [`e384245`](https://github.com/apache/spark/commit/e384245f7b0c6c43e6e0e0f7b73528b5c355e2f1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Yep, the SchedulerIntegrationSuite is what I'm working on modifying. It might be next week by the time I finish anyway.  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/1686/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    **[Test build #95353 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95353/testReport)** for PR 21976 at commit [`e384245`](https://github.com/apache/spark/commit/e384245f7b0c6c43e6e0e0f7b73528b5c355e2f1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21976: [SPARK-24909][core] Always unregister pending par...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21976


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    **[Test build #95307 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95307/testReport)** for PR 21976 at commit [`e384245`](https://github.com/apache/spark/commit/e384245f7b0c6c43e6e0e0f7b73528b5c355e2f1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Hey @tgravescs, after reading through the discussion on the original PR, I think I agree with you. I tried to hit the original problem described in SPARK-19263 after undoing the fix for SPARK-23433, but wasn't able to... but reading the code what you're saying makes sense, and there was a unit test added as part of SPARK-19263, so I guess we're good?
    
    This needs to be merged with master though; and I have a preference for PR titles that explain the fix, not the problem.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Yeah we can merge it as is (once I upmerge) and I can add more tests under a separate jira.   I have been working on it in the background but keep getting distracted by other things.
    
    Can you be more specific on what you are looking for in the PR title?  how would you phrase this?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    **[Test build #95290 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95290/testReport)** for PR 21976 at commit [`e384245`](https://github.com/apache/spark/commit/e384245f7b0c6c43e6e0e0f7b73528b5c355e2f1).
     * This patch **fails PySpark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    @squito   Note this is just to get people looking at this.  I am working on adding some scheduler integration tests but I have to extend those to support multiple executors and allow for tasks out of order.
    
    The 3 jiras involved here I believe are covered but need to think about all other cases as well.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    **[Test build #95034 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95034/testReport)** for PR 21976 at commit [`e384245`](https://github.com/apache/spark/commit/e384245f7b0c6c43e6e0e0f7b73528b5c355e2f1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    test this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95290/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    **[Test build #95353 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95353/testReport)** for PR 21976 at commit [`e384245`](https://github.com/apache/spark/commit/e384245f7b0c6c43e6e0e0f7b73528b5c355e2f1).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    test this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909][core] Always unregister pending partition ...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    Ok the issue is not related to this patch will be filing a separate jira for it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    there is a lot of context here I need to page back in, sorry won't get to this for a few days at least.  But at least on testing, have you looked at `SchedulerIntegrationSuite`?  I was hoping we could use it cover cases like this.  Perhaps its not exposing the right handles we need, but then maybe we could fix it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21976: [SPARK-24909] Spark scheduler can hang when fetch failur...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21976
  
    **[Test build #94039 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94039/testReport)** for PR 21976 at commit [`5464614`](https://github.com/apache/spark/commit/54646148730462e34a32d81200530cf50dbf7a51).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21976: [SPARK-24909][core] Always unregister pending par...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21976#discussion_r213056190
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -2474,19 +2478,21 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
         runEvent(makeCompletionEvent(
           taskSets(3).tasks(0), Success, makeMapStatus("hostB", 2)))
     
    -    // There should be no new attempt of stage submitted,
    -    // because task(stageId=1, stageAttempt=1, partitionId=1) is still running in
    -    // the current attempt (and hasn't completed successfully in any earlier attempts).
    -    assert(taskSets.size === 4)
    +    // At this point there should be no active task set for stageId=1 and we need
    +    // to resubmit because the output from (stageId=1, stageAttemptId=0, partitionId=1)
    +    // was ignored due to executor failure
    +    assert(taskSets.size === 5)
    +    assert(taskSets(4).stageId === 1 && taskSets(4).stageAttemptId === 2
    +      && taskSets(4).tasks.size === 1)
     
    -    // Complete task(stageId=1, stageAttempt=1, partitionId=1) successfully.
    +    // Complete task(stageId=1, stageAttempt=2, partitionId=1) successfully.
         runEvent(makeCompletionEvent(
    -      taskSets(3).tasks(1), Success, makeMapStatus("hostB", 2)))
    +      taskSets(4).tasks(0), Success, makeMapStatus("hostB", 2)))
    --- End diff --
    
    yes it will, marking either of these successful will work, but the assumption on line 2469 is that it got marked completed there by the tasksetmanager.  So we don't want to send success for taskSet(3).task(1) as it should have already been marked success
    
    Unfortunately you can't test the interactions in this unit test, that is why I'm working on another scheduler integration test but was going to do that under separate jira.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21976: [SPARK-24909][core] Always unregister pending par...

Posted by jiangxb1987 <gi...@git.apache.org>.
Github user jiangxb1987 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21976#discussion_r213042176
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
    @@ -2474,19 +2478,21 @@ class DAGSchedulerSuite extends SparkFunSuite with LocalSparkContext with TimeLi
         runEvent(makeCompletionEvent(
           taskSets(3).tasks(0), Success, makeMapStatus("hostB", 2)))
     
    -    // There should be no new attempt of stage submitted,
    -    // because task(stageId=1, stageAttempt=1, partitionId=1) is still running in
    -    // the current attempt (and hasn't completed successfully in any earlier attempts).
    -    assert(taskSets.size === 4)
    +    // At this point there should be no active task set for stageId=1 and we need
    +    // to resubmit because the output from (stageId=1, stageAttemptId=0, partitionId=1)
    +    // was ignored due to executor failure
    +    assert(taskSets.size === 5)
    +    assert(taskSets(4).stageId === 1 && taskSets(4).stageAttemptId === 2
    +      && taskSets(4).tasks.size === 1)
     
    -    // Complete task(stageId=1, stageAttempt=1, partitionId=1) successfully.
    +    // Complete task(stageId=1, stageAttempt=2, partitionId=1) successfully.
         runEvent(makeCompletionEvent(
    -      taskSets(3).tasks(1), Success, makeMapStatus("hostB", 2)))
    +      taskSets(4).tasks(0), Success, makeMapStatus("hostB", 2)))
    --- End diff --
    
    IIUC the test case shall still pass without changing this line right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org