You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by viirya <gi...@git.apache.org> on 2014/07/15 14:07:42 UTC

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/1418

    [SPARK-2490] Change recursive visiting on RDD dependencies to iterative approach

    
    When performing some transformations on RDDs after many iterations, the dependencies of RDDs could be very long. It can easily cause StackOverflowError when recursively visiting these dependencies in Spark core. For example:
    
        var rdd = sc.makeRDD(Array(1))
        for (i <- 1 to 1000) { 
          rdd = rdd.coalesce(1).cache()
          rdd.collect()
        }
    
    This PR changes recursive visiting on rdd's dependencies to iterative approach to avoid StackOverflowError. 
    
    In addition to the recursive visiting, since the Java serializer has a known [bug](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4152790) that causes StackOverflowError too when serializing/deserializing a large graph of objects. So applying this PR only solves part of the problem. Using KryoSerializer to replace Java serializer might be helpful. However, since KryoSerializer is not supported for `spark.closure.serializer` now, I can not test if KryoSerializer can solve Java serializer's problem completely. 
    
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 remove_recursive_visit

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1418.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1418
    
----
commit 900538bbcb61683bf1418534c2466463a630569f
Author: Liang-Chi Hsieh <vi...@gmail.com>
Date:   2014-07-15T10:58:45Z

    change recursive visiting on rdd's dependencies to iterative approach to avoid stackoverflowerror.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50250213
  
    QA results for PR 1418:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17228/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-49794355
  
    Jenkins, ok to test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50922415
  
    QA results for PR 1418:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17687/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15591064
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -265,6 +275,7 @@ class DAGScheduler(
       private def getParentStages(rdd: RDD[_], jobId: Int): List[Stage] = {
         val parents = new HashSet[Stage]
         val visited = new HashSet[RDD[_]]
    +    val waitingForVisit = new Stack[RDD[_]]
    --- End diff --
    
    OK. I add a commit for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50250223
  
    Hey @viirya, instead of modifying the PageRank example, what do you think of leaving it as-is until we have automatic checkpointing of long lineage chains? I think that will be better because the example is meant to be easy to understand and not that many people will run PageRank with hundreds of iterations (this particular algorithm usually converges much faster).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15609962
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -275,18 +287,53 @@ class DAGScheduler(
                 case shufDep: ShuffleDependency[_, _, _] =>
                   parents += getShuffleMapStage(shufDep, jobId)
                 case _ =>
    -              visit(dep.rdd)
    +              waitingForVisit.push(dep.rdd)
               }
             }
           }
         }
    -    visit(rdd)
    +    waitingForVisit.push(rdd)
    +    while (!waitingForVisit.isEmpty) {
    +      visit(waitingForVisit.pop())
    +    }
         parents.toList
       }
     
    +  private def getParentShuffleDependencies(rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {
    --- End diff --
    
    This should actually be called getAncestorShuffleDependencies because it's not just direct parents, it can be grandparents and such as well.
    
    Also, please add a comment at the top of this saying that what it finds. In particular it only finds missing ones.
    
    Finally, would it be possible to merge this with the code that calls it in getShuffleMapStage, e.g. have a method called registerShuffleDependencies? Might be easier to follow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50428730
  
    QA tests have started for PR 1418. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17331/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50250276
  
    BTW you can run sbt scalastyle to check these style things locally


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15563247
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -265,6 +275,7 @@ class DAGScheduler(
       private def getParentStages(rdd: RDD[_], jobId: Int): List[Stage] = {
         val parents = new HashSet[Stage]
         val visited = new HashSet[RDD[_]]
    +    val waitingForVisit = new Stack[RDD[_]]
    --- End diff --
    
    Maybe add a comment that we are manually maintaining a stack to prevent StackOverflowError


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50923486
  
    Thanks Liang-Chi; I've merged this in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15558011
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -275,18 +286,51 @@ class DAGScheduler(
                 case shufDep: ShuffleDependency[_, _, _] =>
                   parents += getShuffleMapStage(shufDep, jobId)
                 case _ =>
    -              visit(dep.rdd)
    +              waitingForVisit.push(dep.rdd)
               }
             }
           }
         }
    -    visit(rdd)
    +    waitingForVisit.push(rdd)
    +    while (!waitingForVisit.isEmpty) {
    +      visit(waitingForVisit.pop())
    +    }
         parents.toList
       }
     
    +  private def getParentShuffleDependencies(rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {
    +    val parents = new Stack[ShuffleDependency[_, _, _]]
    +    val visited = new HashSet[RDD[_]]
    +    val waitingForVisit = new Stack[RDD[_]]
    +    def visit(r: RDD[_]) {
    --- End diff --
    
    why define a function here? seems like this is only used once? why not just inline it in the while?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15648333
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -275,18 +287,53 @@ class DAGScheduler(
                 case shufDep: ShuffleDependency[_, _, _] =>
                   parents += getShuffleMapStage(shufDep, jobId)
                 case _ =>
    -              visit(dep.rdd)
    +              waitingForVisit.push(dep.rdd)
               }
             }
           }
         }
    -    visit(rdd)
    +    waitingForVisit.push(rdd)
    +    while (!waitingForVisit.isEmpty) {
    +      visit(waitingForVisit.pop())
    +    }
         parents.toList
       }
     
    +  private def getParentShuffleDependencies(rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {
    --- End diff --
    
    Add new commit for several changes including function name, comments. Please review if it is fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50855896
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50916408
  
    QA tests have started for PR 1418. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17687/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50250201
  
    QA tests have started for PR 1418. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17228/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15436181
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -275,18 +286,48 @@ class DAGScheduler(
                 case shufDep: ShuffleDependency[_, _, _] =>
                   parents += getShuffleMapStage(shufDep, jobId)
                 case _ =>
    -              visit(dep.rdd)
    +              waitingForVisit.push(dep.rdd)
               }
             }
           }
         }
    -    visit(rdd)
    +    waitingForVisit.push(rdd)
    +    while (!waitingForVisit.isEmpty)
    +      visit(waitingForVisit.pop())
         parents.toList
       }
     
    +  private def getParentShuffleDependencies(rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {
    +    val parents = new Stack[ShuffleDependency[_, _, _]]
    +    val visited = new HashSet[RDD[_]]
    +    val waitingForVisit = new Stack[RDD[_]]
    +    def visit(r: RDD[_]) {
    +      if (!visited(r)) {
    +        visited += r
    +        for (dep <- r.dependencies) {
    +          dep match {
    +            case shufDep: ShuffleDependency[_, _, _] =>
    +              if (!shuffleToMapStage.contains(shufDep.shuffleId))
    +                parents.push(shufDep)
    +
    +              waitingForVisit.push(shufDep.rdd)
    +            case _ =>
    +              waitingForVisit.push(dep.rdd)
    +          }
    +        }
    +      }
    +    }
    +
    +    waitingForVisit.push(rdd)
    +    while (!waitingForVisit.isEmpty)
    --- End diff --
    
    Same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50257662
  
    @mateiz Thanks for suggestion. I leave the PageRank example as-is. These braces are added to comply with code style.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-49184593
  
    Another example of this problem is the PageRank example bundled in Spark. At this time, since the problem of Java serializer still exists, to avoid causing StackOverflowError after too many iterations, it is needed to call `checkpoint()` on the RDD.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50856200
  
    QA tests have started for PR 1418. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17661/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50289064
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-49022775
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15563292
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -195,11 +195,21 @@ class DAGScheduler(
         shuffleToMapStage.get(shuffleDep.shuffleId) match {
           case Some(stage) => stage
           case None =>
    +        val parentsWithNoMapStage = getParentShuffleDependencies(shuffleDep.rdd)
    +        while (!parentsWithNoMapStage.isEmpty) {
    +          val currentShufDep = parentsWithNoMapStage.pop()
    +          val stage =
    +            newOrUsedStage(
    +              currentShufDep.rdd, currentShufDep.rdd.partitions.size, currentShufDep, jobId,
    +              currentShufDep.rdd.creationSite)
    +          shuffleToMapStage(currentShufDep.shuffleId) = stage
    +        }
    --- End diff --
    
    Maybe I'm missing it, but this seems like an unrelated change. Is this needed by the non-recursive visit code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50915770
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-49773532
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15436180
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -275,18 +286,48 @@ class DAGScheduler(
                 case shufDep: ShuffleDependency[_, _, _] =>
                   parents += getShuffleMapStage(shufDep, jobId)
                 case _ =>
    -              visit(dep.rdd)
    +              waitingForVisit.push(dep.rdd)
               }
             }
           }
         }
    -    visit(rdd)
    +    waitingForVisit.push(rdd)
    +    while (!waitingForVisit.isEmpty)
    +      visit(waitingForVisit.pop())
         parents.toList
       }
     
    +  private def getParentShuffleDependencies(rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {
    +    val parents = new Stack[ShuffleDependency[_, _, _]]
    +    val visited = new HashSet[RDD[_]]
    +    val waitingForVisit = new Stack[RDD[_]]
    +    def visit(r: RDD[_]) {
    +      if (!visited(r)) {
    +        visited += r
    +        for (dep <- r.dependencies) {
    +          dep match {
    +            case shufDep: ShuffleDependency[_, _, _] =>
    +              if (!shuffleToMapStage.contains(shufDep.shuffleId))
    --- End diff --
    
    This should have braces around the if statement's body


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15592346
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -195,11 +195,21 @@ class DAGScheduler(
         shuffleToMapStage.get(shuffleDep.shuffleId) match {
           case Some(stage) => stage
           case None =>
    +        val parentsWithNoMapStage = getParentShuffleDependencies(shuffleDep.rdd)
    +        while (!parentsWithNoMapStage.isEmpty) {
    +          val currentShufDep = parentsWithNoMapStage.pop()
    +          val stage =
    +            newOrUsedStage(
    +              currentShufDep.rdd, currentShufDep.rdd.partitions.size, currentShufDep, jobId,
    +              currentShufDep.rdd.creationSite)
    +          shuffleToMapStage(currentShufDep.shuffleId) = stage
    +        }
    --- End diff --
    
    Yes. There is a recursive calling chain formed by `getShuffleMapStage` calling `newOrUsedStage` that calls `newStage` that calls `getParentStages` that calls `getShuffleMapStage` again. The calling chain is going to fill `shuffleToMapStage`  for parent map stages. Here I use an iterative approach instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50225517
  
    Thanks for commenting. How about the review?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by viirya <gi...@git.apache.org>.
Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15590707
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -275,18 +286,51 @@ class DAGScheduler(
                 case shufDep: ShuffleDependency[_, _, _] =>
                   parents += getShuffleMapStage(shufDep, jobId)
                 case _ =>
    -              visit(dep.rdd)
    +              waitingForVisit.push(dep.rdd)
               }
             }
           }
         }
    -    visit(rdd)
    +    waitingForVisit.push(rdd)
    +    while (!waitingForVisit.isEmpty) {
    +      visit(waitingForVisit.pop())
    +    }
         parents.toList
       }
     
    +  private def getParentShuffleDependencies(rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {
    +    val parents = new Stack[ShuffleDependency[_, _, _]]
    +    val visited = new HashSet[RDD[_]]
    +    val waitingForVisit = new Stack[RDD[_]]
    +    def visit(r: RDD[_]) {
    --- End diff --
    
    I let the codes as the function `getParentShuffleDependencies` because it contains multiple indents and so put it under the `case` statement would not be readable. I can make it as inline if this is an issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50428405
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50859287
  
    QA results for PR 1418:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17661/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1418


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50431125
  
    QA results for PR 1418:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17331/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15436168
  
    --- Diff: examples/src/main/scala/org/apache/spark/examples/SparkPageRank.scala ---
    @@ -51,6 +55,11 @@ object SparkPageRank {
             urls.map(url => (url, rank / size))
           }
           ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
    +      if (i % 50 == 0) {
    +        ranks.cache()
    +        ranks.checkpoint()
    +        ranks.collect()
    --- End diff --
    
    Don't do a collect, if you want to force it to be computed, just do a `foreach`. Otherwise `collect` will try to bring all the data back to the driver.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50250242
  
    BTW the Jenkins failure is due to a code style issue: an if block without braces
    
    Jenkins, this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-49540416
  
    Actually it's late. I will review this tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-50250175
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15609799
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -195,11 +195,21 @@ class DAGScheduler(
         shuffleToMapStage.get(shuffleDep.shuffleId) match {
           case Some(stage) => stage
           case None =>
    +        val parentsWithNoMapStage = getParentShuffleDependencies(shuffleDep.rdd)
    +        while (!parentsWithNoMapStage.isEmpty) {
    +          val currentShufDep = parentsWithNoMapStage.pop()
    +          val stage =
    +            newOrUsedStage(
    +              currentShufDep.rdd, currentShufDep.rdd.partitions.size, currentShufDep, jobId,
    +              currentShufDep.rdd.creationSite)
    +          shuffleToMapStage(currentShufDep.shuffleId) = stage
    +        }
    --- End diff --
    
    Ah, I see it. Please add a comment on top that explains this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15563334
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -265,6 +275,7 @@ class DAGScheduler(
       private def getParentStages(rdd: RDD[_], jobId: Int): List[Stage] = {
         val parents = new HashSet[Stage]
         val visited = new HashSet[RDD[_]]
    +    val waitingForVisit = new Stack[RDD[_]]
    --- End diff --
    
    (Same on the other methods that use a stack)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/1418#issuecomment-49540404
  
    Thanks for submitting this. I think we can still stack overflow in serialization, but I agree it's better to do this non-recursivley. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2490] Change recursive visiting on RDD ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1418#discussion_r15436179
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -275,18 +286,48 @@ class DAGScheduler(
                 case shufDep: ShuffleDependency[_, _, _] =>
                   parents += getShuffleMapStage(shufDep, jobId)
                 case _ =>
    -              visit(dep.rdd)
    +              waitingForVisit.push(dep.rdd)
               }
             }
           }
         }
    -    visit(rdd)
    +    waitingForVisit.push(rdd)
    +    while (!waitingForVisit.isEmpty)
    --- End diff --
    
    This should have braces around the loop body


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---