You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by kayousterhout <gi...@git.apache.org> on 2016/06/13 19:44:13 UTC

[GitHub] spark pull request #13646: [SPARK-15927] Eliminate redundant DAGScheduler co...

GitHub user kayousterhout opened a pull request:

    https://github.com/apache/spark/pull/13646

    [SPARK-15927] Eliminate redundant DAGScheduler code.

    ## What changes were proposed in this pull request?
    
    To try to eliminate redundant code to traverse the RDD dependency graph, this PR creates a new function getShuffleDependencies that returns shuffle dependencies that are immediate parents of a given RDD.  This new function is used by getParentStages and getAncestorShuffleDependencies.
    
    cc @squito @markhamstra  
    
    FYI @rxin 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kayousterhout/spark-1 SPARK-15927

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13646.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13646
    
----
commit 5078bb2037201cbcd51223d2acf971a7a7fcc0cb
Author: Kay Ousterhout <ka...@gmail.com>
Date:   2016-06-10T23:12:02Z

    [SPARK-15927] Eliminate redundant DAGScheduler code.

commit 42a8d16ed0b7e8175a58d1d6fa21685cc36c85c2
Author: Kay Ousterhout <ka...@gmail.com>
Date:   2016-06-13T19:43:32Z

    Improved method comment

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    **[Test build #60432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60432/consoleFull)** for PR 13646 at commit [`3e47166`](https://github.com/apache/spark/commit/3e471665505ba0b259fcd7b43333a69d2c4ae1f5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60509/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13646: [SPARK-15927] Eliminate redundant DAGScheduler co...

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13646#discussion_r67037654
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -378,31 +378,9 @@ class DAGScheduler(
        * the provided firstJobId.
        */
       private def getParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
    -    val parents = new HashSet[Stage]
    -    val visited = new HashSet[RDD[_]]
    -    // We are manually maintaining a stack here to prevent StackOverflowError
    -    // caused by recursively visiting
    -    val waitingForVisit = new Stack[RDD[_]]
    -    def visit(r: RDD[_]) {
    -      if (!visited(r)) {
    -        visited += r
    -        // Kind of ugly: need to register RDDs with the cache here since
    -        // we can't do it in its constructor because # of partitions is unknown
    --- End diff --
    
    Yeah I decided that it was probably an accident that it's here -- as you said, I think it only belongs in newOrUsedShuffleStage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    yikes, thanks for pointing this out @JoshRosen .  Sorry I was wrong about the failure being unrelated -- it didn't create a new bug, but took an old rare problem and made it far more common.  I have opened https://github.com/apache/spark/pull/13688.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13646: [SPARK-15927] Eliminate redundant DAGScheduler co...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/13646


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    **[Test build #60432 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60432/consoleFull)** for PR 13646 at commit [`3e47166`](https://github.com/apache/spark/commit/3e471665505ba0b259fcd7b43333a69d2c4ae1f5).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    **[Test build #60419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60419/consoleFull)** for PR 13646 at commit [`42a8d16`](https://github.com/apache/spark/commit/42a8d16ed0b7e8175a58d1d6fa21685cc36c85c2).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    **[Test build #60509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60509/consoleFull)** for PR 13646 at commit [`edb2985`](https://github.com/apache/spark/commit/edb29859a611939aba4c81ff13607ee2a74b2d75).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13646: [SPARK-15927] Eliminate redundant DAGScheduler co...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13646#discussion_r67005952
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -412,25 +390,48 @@ class DAGScheduler(
         // We are manually maintaining a stack here to prevent StackOverflowError
         // caused by recursively visiting
         val waitingForVisit = new Stack[RDD[_]]
    -    def visit(r: RDD[_]) {
    -      if (!visited(r)) {
    -        visited += r
    -        for (dep <- r.dependencies) {
    -          dep match {
    -            case shufDep: ShuffleDependency[_, _, _] =>
    -              if (!shuffleToMapStage.contains(shufDep.shuffleId)) {
    -                parents.push(shufDep)
    -              }
    -            case _ =>
    -          }
    -          waitingForVisit.push(dep.rdd)
    +    waitingForVisit.push(rdd)
    +    while (waitingForVisit.nonEmpty) {
    +      val toVisit = waitingForVisit.pop()
    +      if (!visited(toVisit)) {
    +        visited += toVisit
    +        getShuffleDependencies(toVisit).foreach { shuffleDep =>
    +          if (!shuffleToMapStage.contains(shuffleDep.shuffleId)) {
    +            parents.push(shuffleDep)
    +            waitingForVisit.push(shuffleDep.rdd)
    +          } // Otherwise, the dependency and it's ancestors have already been registered.
             }
           }
         }
    +    parents
    --- End diff --
    
    while you're touching this, would be nice to rename `parents` to `ancestors` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60518/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    **[Test build #60509 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60509/consoleFull)** for PR 13646 at commit [`edb2985`](https://github.com/apache/spark/commit/edb29859a611939aba4c81ff13607ee2a74b2d75).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    Merged into master (not 2.0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    lgtm!
    
    not sure about the test timeout.  I will take a look, but in any case I'm confident its not from this change.  I've tried running that test in a loop but haven't triggered a failure in over 10k trials so far :(


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    **[Test build #60518 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60518/consoleFull)** for PR 13646 at commit [`edb2985`](https://github.com/apache/spark/commit/edb29859a611939aba4c81ff13607ee2a74b2d75).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    **[Test build #60419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60419/consoleFull)** for PR 13646 at commit [`42a8d16`](https://github.com/apache/spark/commit/42a8d16ed0b7e8175a58d1d6fa21685cc36c85c2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    LGTM, but I agree with Imran's renaming suggestion, and his new test looks good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    I think that @cloud-fan proposed a similar change in #4134, so I think we might be able to resolve SPARK-5374 as a duplicate of this JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    Ah shoot I see why this would make the stage ID numbering potentially different.  Sorry about that!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13646: [SPARK-15927] Eliminate redundant DAGScheduler co...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13646#discussion_r66861452
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -412,25 +390,48 @@ class DAGScheduler(
         // We are manually maintaining a stack here to prevent StackOverflowError
         // caused by recursively visiting
         val waitingForVisit = new Stack[RDD[_]]
    -    def visit(r: RDD[_]) {
    -      if (!visited(r)) {
    -        visited += r
    -        for (dep <- r.dependencies) {
    -          dep match {
    -            case shufDep: ShuffleDependency[_, _, _] =>
    -              if (!shuffleToMapStage.contains(shufDep.shuffleId)) {
    -                parents.push(shufDep)
    -              }
    -            case _ =>
    -          }
    -          waitingForVisit.push(dep.rdd)
    +    waitingForVisit.push(rdd)
    +    while (waitingForVisit.nonEmpty) {
    +      val toVisit = waitingForVisit.pop()
    +      if (!visited(toVisit)) {
    +        visited += toVisit
    +        getShuffleDependencies(toVisit).foreach { shuffleDep =>
    +          if (!shuffleToMapStage.contains(shuffleDep.shuffleId)) {
    +            parents.push(shuffleDep)
    +            waitingForVisit.push(shuffleDep.rdd)
    +          } // Otherwise, the dependency and it's ancestors have already been registered.
             }
           }
         }
    +    parents
    +  }
     
    +  /**
    +   * Returns shuffle dependencies that are immediate parents of the given RDD.
    +   *
    +   * This function will not return more distant ancestors.  For example, if C has a shuffle
    +   * dependency on B which has a shuffle dependency on A:
    +   *
    +   * A <-- B <-- C
    +   *
    +   * calling this function with rdd C will only return the B <-- C dependency.
    +   */
    +  private def getShuffleDependencies(rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {
    +    val parents = new HashSet[ShuffleDependency[_, _, _]]
    +    val visited = new HashSet[RDD[_]]
    +    val waitingForVisit = new Stack[RDD[_]]
         waitingForVisit.push(rdd)
         while (waitingForVisit.nonEmpty) {
    -      visit(waitingForVisit.pop())
    +      val toVisit = waitingForVisit.pop()
    +      if (!visited(toVisit)) {
    +        visited += toVisit
    +        toVisit.dependencies.foreach {
    +          case shuffleDep: ShuffleDependency[_, _, _] =>
    +            parents += shuffleDep
    +          case dependency: Any =>
    --- End diff --
    
    nit: we can remove `: Any`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    It looks like a bunch of the tests in `BasicSchedulerIntegrationSuite` are now really flaky. For example: https://spark-tests.appspot.com/tests/org.apache.spark.scheduler.BasicSchedulerIntegrationSuite/multi-stage%20job
    
    Is there any way that we can hotfix / merge in the small fix for this sooner rather than later, perhaps by separating it from the blacklist test PR into its own PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #13646: [SPARK-15927] Eliminate redundant DAGScheduler co...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13646#discussion_r67006526
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -378,31 +378,9 @@ class DAGScheduler(
        * the provided firstJobId.
        */
       private def getParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {
    -    val parents = new HashSet[Stage]
    -    val visited = new HashSet[RDD[_]]
    -    // We are manually maintaining a stack here to prevent StackOverflowError
    -    // caused by recursively visiting
    -    val waitingForVisit = new Stack[RDD[_]]
    -    def visit(r: RDD[_]) {
    -      if (!visited(r)) {
    -        visited += r
    -        // Kind of ugly: need to register RDDs with the cache here since
    -        // we can't do it in its constructor because # of partitions is unknown
    --- End diff --
    
    this comment is getting lost ... but I think that is OK, I have no idea what it is referring to here.  Its also in `newOrUsedShuffleStage`, which is probably the only place it belongs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60432/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    Thanks for the comments / test @squito!  Updated the naming and the test. 
    
    Once this is merged, I have another DAGScheduler PR that does some more renaming / commenting / re-organizing to try to make some of the code easier to read / understand.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    lgtm.  I realized there isn't really any test for going through a mix of narrow and shuffle dependencies -- might be nice to add this (it passes w/ this change).  You would need to change the visibility of `getShuffleDependencies` to `private[scheduler]` though:
    
    ```scala
      /**
       * Ensure we get all parent shuffle stages.  Walk back all narrow dependencies, but don't go
       * past any shuffle dependencies (parent stages only, not ancestor stages).
       */
      test("getShuffleDependencies") {
        val rddA = new MyRDD(sc, 2, Nil)
        val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(1))
        val rddB = new MyRDD(sc, 2, Nil)
        val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(1))
        val rddC = new MyRDD(sc, 1, List(shuffleDepB))
        val shuffleDepC = new ShuffleDependency(rddC, new HashPartitioner(1))
        val rddD = new MyRDD(sc, 1, List(shuffleDepC))
        val narrowDepD = new OneToOneDependency(rddD)
        val finalRdd = new MyRDD(sc, 1, List(shuffleDepA, narrowDepD), tracker = mapOutputTracker)
    
        assert(scheduler.getShuffleDependencies(rddA) === Set())
        assert(scheduler.getShuffleDependencies(rddB) === Set())
        assert(scheduler.getShuffleDependencies(rddC) === Set(shuffleDepB))
        assert(scheduler.getShuffleDependencies(rddD) === Set(shuffleDepC))
        assert(scheduler.getShuffleDependencies(finalRdd) === Set(shuffleDepA, shuffleDepC))
      }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    @kayousterhout btw I tracked down the test issue.  It wasn't safe to check the map output status inside the backend in those tests -- I addressed that here: https://github.com/apache/spark/pull/13565/commits/91ea3df33aaebb20b2df0cbbe56dbc99975200ea.  Also the real problem was somewhat hidden b/c the assertion was failing in another thread, which I fixed here: https://github.com/apache/spark/pull/13565/commits/e35a7d3d0696fd0c864ee8fa9c9ded95abffba2c
    
    I ran the tests over 15k times on my laptop and still never triggered it, though.  wish I had a better way to find these issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    **[Test build #60518 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/60518/consoleFull)** for PR 13646 at commit [`edb2985`](https://github.com/apache/spark/commit/edb29859a611939aba4c81ff13607ee2a74b2d75).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/60419/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #13646: [SPARK-15927] Eliminate redundant DAGScheduler code.

Posted by kayousterhout <gi...@git.apache.org>.
Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/13646
  
    (@squito this looks like another timeout in the basic scheduler integration suite? I tried it locally and it passed)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org