You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ilganeli <gi...@git.apache.org> on 2015/02/20 08:47:42 UTC

[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

GitHub user ilganeli opened a pull request:

    https://github.com/apache/spark/pull/4703

    [SPARK-4655] Split Stage into ShuffleMapStage and ResultStage subclasses

    Hi all - this patch includes two main efforts:
    
    1) I've split up Stage into ShuffleMapStage and ResultStage and updated their usage within DAGScheduler
    2) While doing this, I ended up cleaning up a lot of the long and awkward functions within the DAGScheduler since I needed to do that to make sense of the code. I believe this improved readability of the code significantly - the only problem is that it's not straightforward to diff those changes side by side (particularly the updates to handleFetchFailure(). 
    
    I wanted to confirm that it's appropriate to move the outputLocs variable and its associated methods into ShuffleMapStage. The one function of concern is within the completeStage() and handleFailedTasks() functions where I wanted to confirm that we are guaranteed for that stage to be a ShuffleMapStage
    
    I believe I've split up the functionality of ResultStage and ShuffleMapStage appropriately but I wanted to make sure I'm not missing something.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ilganeli/spark SPARK-4653

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4703.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4703
    
----
commit 424fdef9578b4450ec75c8d84db98927e4b15831
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2015-02-18T01:19:20Z

    Began refactoring Stage.scala

commit 562b688d07a9404196c6856b5f3302c31b05bb2d
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2015-02-19T00:54:36Z

    Significant refactoring within DAG Scheduler class

commit 75fe74f616adfa4e2da7a439fc86f784265aae84
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2015-02-19T00:58:34Z

    Merge remote-tracking branch 'upstream/master' into SPARK-4653

commit 5e76fb8b21ba877afe4a9255036b8120691aeda8
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2015-02-19T01:06:23Z

    Small updates

commit cb66a4fdf23883c24a7684b6f20ca532c74241a7
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2015-02-19T18:08:14Z

    Further refactoring. Moved outputLocs and associated functions to inside ShuffleMapStage.

commit e3e7ea23584d3fb442ede54231aa048815a8f436
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2015-02-19T18:11:15Z

    Minor formatting

commit 149b9a13e48371a1ea06a3fc85374fa026368cd8
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2015-02-19T22:06:21Z

    Refactored more extremely large methods to be more modular. Cleaned up some usages of vs nonEmpty.

commit be3e01fad73a3d1124ad090acbef6f38379afc4e
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2015-02-19T22:58:56Z

    Minor fixes

commit 00182c7e7f9a38d5eee86065171804acc52e6d2a
Author: Ilya Ganelin <il...@capitalone.com>
Date:   2015-02-20T07:43:59Z

    Cleaned up hanleTaskCompletion more

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75307867
  
    What do you think about performing only the Stage split in this PR and deferring the other code movement to a separate one?  That would make this much easier to review, since I'm pretty sure that the Stage split itself should involve comparatively few changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75304156
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27787/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75284266
  
      [Test build #27780 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27780/consoleFull) for   PR 4703 at commit [`09a3437`](https://github.com/apache/spark/commit/09a343746602ec8cb6630de9d3308513ff73eea0).
     * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75311331
  
    Hi @ilganeli,
    
    Those other refactoring might be welcome, since the some of the mega-functions in DAGScheduler can be pretty hard to read.  Let's do those separately, though.  My conservatism here stems from the fact that DAGScheduler is a giant piece of complex code that only a handful of people understand well.  Given that there might be objections to performing _any_ unnecessary refactoring due to the potential to introduce bugs, I think it's best to work very gradually and perform many small, separate sets of changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4703#discussion_r25057714
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -229,41 +227,56 @@ class DAGScheduler(
     
       /**
        * Create a Stage -- either directly for use as a result stage, or as part of the (re)-creation
    -   * of a shuffle map stage in newOrUsedStage.  The stage will be associated with the provided
    -   * jobId. Production of shuffle map stages should always use newOrUsedStage, not newStage
    -   * directly.
    +   * of a shuffle map stage in newOrUsedShuffleStage.  The stage will be associated with the provided
    +   * jobId. Production of shuffle map stages should always use newOrUsedShuffleStage, 
    +   * not getAndRegNewShuffleMap directly.
        */
    -  private def newStage(
    -      rdd: RDD[_],
    -      numTasks: Int,
    -      shuffleDep: Option[ShuffleDependency[_, _, _]],
    -      jobId: Int,
    -      callSite: CallSite)
    -    : Stage =
    -  {
    +  private def getAndRegNewShuffleMap(rdd: RDD[_],
    +                                     numTasks: Int,
    +                                     shuffleDep: ShuffleDependency[_, _, _],
    +                                     jobId: Int,
    +                                     callSite: CallSite): ShuffleMapStage = {
         val parentStages = getParentStages(rdd, jobId)
    -    val id = nextStageId.getAndIncrement()
    -    val stage = new Stage(id, rdd, numTasks, shuffleDep, parentStages, jobId, callSite)
    +    val id = nextStageId.getAndIncrement
    +    val stage: ShuffleMapStage = new ShuffleMapStage(id, rdd, numTasks, parentStages, 
    +      jobId, callSite, shuffleDep)
    +      
         stageIdToStage(id) = stage
         updateJobIdStageIdMaps(jobId, stage)
         stage
       }
     
       /**
    +   * Create a Stage -- either directly for use as a result stage, or as part of the (re)-creation
    +   * of a shuffle map stage in newOrUsedShuffleStage.  The stage will be associated with the provided
    +   * jobId. Production of shuffle map stages should always use newOrUsedShuffleStage, 
    +   * not getAndRegNewResults directly.
    +   */
    +  private def getAndRegNewResults(rdd: RDD[_],
    --- End diff --
    
    I'm not a fan of the `getAndReg` addition since `get` falsely implies some kind of look up, and `reg` is more-or-less an implementation detail that across future refactorings won't necessarily be a fixed part of the creation of new stages.  The new, longer naming just doesn't add anything positive, imo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75288542
  
      [Test build #27783 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27783/consoleFull) for   PR 4703 at commit [`0a4109e`](https://github.com/apache/spark/commit/0a4109ecb1982130e7f561e8f72d47c6fd0cd9e0).
     * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75201642
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27771/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75286815
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27782/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4703#discussion_r25056762
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -229,41 +227,56 @@ class DAGScheduler(
     
       /**
        * Create a Stage -- either directly for use as a result stage, or as part of the (re)-creation
    -   * of a shuffle map stage in newOrUsedStage.  The stage will be associated with the provided
    -   * jobId. Production of shuffle map stages should always use newOrUsedStage, not newStage
    -   * directly.
    +   * of a shuffle map stage in newOrUsedShuffleStage.  The stage will be associated with the provided
    +   * jobId. Production of shuffle map stages should always use newOrUsedShuffleStage, 
    +   * not getAndRegNewShuffleMap directly.
        */
    -  private def newStage(
    -      rdd: RDD[_],
    -      numTasks: Int,
    -      shuffleDep: Option[ShuffleDependency[_, _, _]],
    -      jobId: Int,
    -      callSite: CallSite)
    -    : Stage =
    -  {
    +  private def getAndRegNewShuffleMap(rdd: RDD[_],
    +                                     numTasks: Int,
    +                                     shuffleDep: ShuffleDependency[_, _, _],
    +                                     jobId: Int,
    +                                     callSite: CallSite): ShuffleMapStage = {
         val parentStages = getParentStages(rdd, jobId)
    -    val id = nextStageId.getAndIncrement()
    -    val stage = new Stage(id, rdd, numTasks, shuffleDep, parentStages, jobId, callSite)
    +    val id = nextStageId.getAndIncrement
    +    val stage: ShuffleMapStage = new ShuffleMapStage(id, rdd, numTasks, parentStages, 
    +      jobId, callSite, shuffleDep)
    +      
         stageIdToStage(id) = stage
         updateJobIdStageIdMaps(jobId, stage)
         stage
       }
     
       /**
    +   * Create a Stage -- either directly for use as a result stage, or as part of the (re)-creation
    +   * of a shuffle map stage in newOrUsedShuffleStage.  The stage will be associated with the provided
    +   * jobId. Production of shuffle map stages should always use newOrUsedShuffleStage, 
    +   * not getAndRegNewResults directly.
    +   */
    +  private def getAndRegNewResults(rdd: RDD[_],
    +                                     numTasks: Int,
    +                                     jobId: Int,
    +                                     callSite: CallSite): ResultStage = {
    --- End diff --
    
    Indentation of this one's even worse than the others.  Please fix all your `def`s to adhere to the Style Guide.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4703#discussion_r25058302
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -912,6 +959,196 @@ class DAGScheduler(
       }
     
       /**
    +   * Helper function to handle stage completion
    +   * @param stage
    +   * @param errorMessage
    +   * @return
    +   */
    +  private def markStageAsFinished(stage: Stage, errorMessage: Option[String] = None) = {
    +    val serviceTime = stage.latestInfo.submissionTime match {
    +      case Some(t) => "%.03f".format((clock.getTime() - t) / 1000.0)
    +      case _ => "Unknown"
    +    }
    +    if (errorMessage.isEmpty) {
    +      logInfo("%s (%s) finished in %s s".format(stage, stage.name, serviceTime))
    +      stage.latestInfo.completionTime = Some(clock.getTime())
    +    } else {
    +      stage.latestInfo.stageFailed(errorMessage.get)
    +      logInfo("%s (%s) failed in %s s".format(stage, stage.name, serviceTime))
    +    }
    +    listenerBus.post(SparkListenerStageCompleted(stage.latestInfo))
    +    runningStages -= stage
    +  }
    +
    +  /**
    +   * Helper functino to process a ResultTask completion
    --- End diff --
    
    typo


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4703#discussion_r25056104
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
    @@ -77,53 +70,16 @@ private[spark] class Stage(
       /** Pointer to the latest [StageInfo] object, set by DAGScheduler. */
       var latestInfo: StageInfo = StageInfo.fromStage(this)
     
    +  var numAvailableOutputs = 0
    +
       def isAvailable: Boolean = {
    -    if (!isShuffleMap) {
    +    if (!this.isInstanceOf[ShuffleMapStage]) {
           true
         } else {
           numAvailableOutputs == numPartitions
         }
       }
    --- End diff --
    
    Why not leave this abstract with just one branch of the if in each override?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75318888
  
      [Test build #27788 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27788/consoleFull) for   PR 4703 at commit [`8ed51d8`](https://github.com/apache/spark/commit/8ed51d81513b36289bbd76164b6a705ef1889ba3).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4703#discussion_r25058069
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -484,7 +505,7 @@ class DAGScheduler(
               "Total number of partitions: " + maxPartitions)
         }
     
    -    val jobId = nextJobId.getAndIncrement()
    +    val jobId = nextJobId.getAndIncrement
    --- End diff --
    
    mutator should include the `()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75304153
  
      [Test build #27787 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27787/consoleFull) for   PR 4703 at commit [`fe4e137`](https://github.com/apache/spark/commit/fe4e1378f3a56b2d633a5beabe324eede24c43dd).
     * This patch **fails RAT tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75288560
  
      [Test build #27783 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27783/consoleFull) for   PR 4703 at commit [`0a4109e`](https://github.com/apache/spark/commit/0a4109ecb1982130e7f561e8f72d47c6fd0cd9e0).
     * This patch **fails RAT tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75304143
  
      [Test build #27787 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27787/consoleFull) for   PR 4703 at commit [`fe4e137`](https://github.com/apache/spark/commit/fe4e1378f3a56b2d633a5beabe324eede24c43dd).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75318899
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27788/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75320629
  
    Josh, what I'll do is create a new branch and port the changes individually and leave this branch intact as a reference. I'll submit a separate PR for that work. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-85319636
  
    Hey @ilganeli would you mind closing this one since reviews are going on in the newer one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4703#discussion_r25055713
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
    @@ -47,26 +47,19 @@ import org.apache.spark.util.CallSite
      * be updated for each attempt.
      *
      */
    -private[spark] class Stage(
    -    val id: Int,
    -    val rdd: RDD[_],
    -    val numTasks: Int,
    -    val shuffleDep: Option[ShuffleDependency[_, _, _]],  // Output shuffle if stage is a map stage
    -    val parents: List[Stage],
    -    val jobId: Int,
    -    val callSite: CallSite)
    +private[spark] abstract class Stage(val id: Int,
    +                                    val rdd: RDD[_],
    +                                    val numTasks: Int,
    +                                    val parents: List[Stage],
    +                                    val jobId: Int,
    +                                    val callSite: CallSite)
    --- End diff --
    
    Follow the prior indentation style, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75201638
  
      [Test build #27771 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27771/consoleFull) for   PR 4703 at commit [`00182c7`](https://github.com/apache/spark/commit/00182c7e7f9a38d5eee86065171804acc52e6d2a).
     * This patch **fails RAT tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4703#discussion_r25056356
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
    @@ -132,9 +88,7 @@ private[spark] class Stage(
       }
     
       def attemptId: Int = nextAttemptId
    -
    -  override def toString = "Stage " + id
    -
    +  
       override def hashCode(): Int = id
     
       override def equals(other: Any): Boolean = other match {
    --- End diff --
    
    `hashCode` and `equals` should probably be `final`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75284298
  
      [Test build #27780 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27780/consoleFull) for   PR 4703 at commit [`09a3437`](https://github.com/apache/spark/commit/09a343746602ec8cb6630de9d3308513ff73eea0).
     * This patch **fails RAT tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75327639
  
    I've submitted a new PR here:
    https://github.com/apache/spark/pull/4708


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75286811
  
      [Test build #27782 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27782/consoleFull) for   PR 4703 at commit [`6266165`](https://github.com/apache/spark/commit/6266165ee44587621cde059e21648b0c4a9f2dfd).
     * This patch **fails RAT tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75288566
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27783/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75284304
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27780/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75201629
  
      [Test build #27771 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27771/consoleFull) for   PR 4703 at commit [`00182c7`](https://github.com/apache/spark/commit/00182c7e7f9a38d5eee86065171804acc52e6d2a).
     * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75306668
  
      [Test build #27788 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27788/consoleFull) for   PR 4703 at commit [`8ed51d8`](https://github.com/apache/spark/commit/8ed51d81513b36289bbd76164b6a705ef1889ba3).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli closed the pull request at:

    https://github.com/apache/spark/pull/4703


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4703#discussion_r25057992
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -306,26 +319,31 @@ class DAGScheduler(
           }
         }
         waitingForVisit.push(rdd)
    -    while (!waitingForVisit.isEmpty) {
    +    while (waitingForVisit.nonEmpty) {
           visit(waitingForVisit.pop())
         }
         parents.toList
       }
    -
    -  // Find ancestor missing shuffle dependencies and register into shuffleToMapStage
    +  
    +  /**
    +   * Find ancestor missing shuffle dependencies and register into shuffleToMapStage 
    +   * @param shuffleDep
    +   * @param jobId
    --- End diff --
    
    Placeholder doc annotations aren't helpful for `private` methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75286796
  
      [Test build #27782 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27782/consoleFull) for   PR 4703 at commit [`6266165`](https://github.com/apache/spark/commit/6266165ee44587621cde059e21648b0c4a9f2dfd).
     * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4655] Split Stage into ShuffleMapStage ...

Posted by ilganeli <gi...@git.apache.org>.
Github user ilganeli commented on the pull request:

    https://github.com/apache/spark/pull/4703#issuecomment-75308162
  
    Hi @JoshRosen, I'm down for making your life easier. Would you still be interested in the other refactoring that I did as part of this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org