You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2014/09/10 06:07:16 UTC

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/2341

    [SPARK-3160] [mllib]  DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays

    This PR includes some code simplifications and re-organization which will be helpful for implementing random forests.  The main changes are that the nodes and parentImpurities arrays are no longer pre-allocated in the main train() method.
    
    Relation to RFs:
    * Since RFs will be deeper and will therefore be more likely sparse (not full trees), it could be a cost savings to avoid pre-allocating a full tree.
    * The associated re-organization also reduces bookkeeping, which will make RFs easier to implement.
    * The return code doneTraining may be generalized to include cases such as nodes ready for local training.
    
    Details:
    
    No longer pre-allocate parentImpurities array in main train() method.
    * parentImpurities values are now stored in individual nodes (in Node.stats.impurity).
    * These were not really needed.  They were used in calculateGainForSplit(), but they can be calculated anyways using parentNodeAgg.
    
    No longer using Node.build since tree structure is constructed on-the-fly.
    * Did not eliminate since it is public (Developer) API.  Marked as deprecated.
    
    Eliminated pre-allocated nodes array in main train() method.
    * Nodes are constructed and added to the tree structure as needed during training.
    * Moved tree construction from main train() method into findBestSplitsPerGroup() since there is no need to keep the (split, gain) array for an entire level of nodes.  Only one element of that array is needed at a time, so we do not the array.
    
    findBestSplits() now returns 2 items:
    * rootNode (newly created root node on first iteration, same root node on later iterations)
    * doneTraining (indicating if all nodes at that level were leafs)
    
    Updated DecisionTreeSuite.  Notes:
    * Improved test "Second level node building with vs. without groups"
    ** generateOrderedLabeledPoints() modified so that it really does require 2 levels of internal nodes.
    * Related update: Added Node.deepCopy (private[tree]), used for test suite
    
    CC: @mengxr

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-spark-3160

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2341.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2341
    
----
commit 2ab763b2ca1bbc8977777ab898b28965dce5a8a3
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-09-09T17:42:46Z

    Simplifications to DecisionTree code:
    
    No longer pre-allocate parentImpurities array in main train() method.
    * parentImpurities values are now stored in individual nodes (in Node.stats.impurity).
    
    No longer using Node.build since tree structure is constructed on-the-fly.
    * Did not eliminate since it is public (Developer) API.
    
    Also: Updated DecisionTreeSuite test "Second level node building with vs. without groups"
    * generateOrderedLabeledPoints() modified so that it really does require 2 levels of internal nodes.

commit 1a8f0add470e4ed53100ce6cf344e24448a0ba42
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-09-10T02:34:55Z

    Eliminated pre-allocated nodes array in main train() method.
    * Nodes are constructed and added to the tree structure as needed during training.
    
    Moved tree construction from main train() method into findBestSplitsPerGroup() since there is no need to keep the (split, gain) array for an entire level of nodes.  Only one element of that array is needed at a time, so we do not the array.
    
    findBestSplits() now returns 2 items:
    * rootNode (newly created root node on first iteration, same root node on later iterations)
    * doneTraining (indicating if all nodes at that level were leafs)
    
    Also:
    * Added Node.deepCopy (private[tree]), used for test suite
    * Updated test suite (same functionality)

commit d4dbb99a50418e0168d85db457458d8d96edc242
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-09-10T02:35:06Z

    Merge remote-tracking branch 'upstream/master' into dt-spark-3160

commit d4d786407a9bb5fce14dd7999097b21d6fa1cf5e
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-09-10T02:45:30Z

    Marked Node.build as deprecated

commit eaa1dcf6a46501779ae58c746e672583d10ff6c8
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-09-10T02:58:27Z

    Added topNode doc in DecisionTree and scalastyle fix

commit 306120fc93021f3d2d86333c77296fe3d36b76e1
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-09-10T03:09:02Z

    Fixed typo in DecisionTreeModel.scala doc

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by chouqin <gi...@git.apache.org>.

Github user chouqin commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55216875
  
    @jkbradley Thanks for your nice work, I have read your code and just have one question:
    
    Can we allocate a root node before the loop in `train()`, and allocate left and child node for the next level after choosing split, then `DecisionTree.findBestSplits` can just return `doneTraining`. During the choose splits step which iterate all the nodes, it just set fields of current node, and don't allocate a new node. This seems to be more easier to understand for me, because it handle all levels in a same way.
    
    What's more, I think after choosing a best split and allocate left and right child nodes, we can set impurity of left and right child, which can avoid recomputation of impurity in `calculateGainForSplit`, this saving may be useless when (number of splits * number of features * number of classes) is small though.
    
    This is just a suggestion, ignore this if you don't think it helps :).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by chouqin <gi...@git.apache.org>.

Github user chouqin commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55346692
  
    Sounds reasonable to me, go ahead with random forests first please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55351994
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/76/consoleFull) for   PR 2341 at commit [`07dd1ee`](https://github.com/apache/spark/commit/07dd1ee166c7054f9277662fa01defd8786798a2).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2341


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55068847
  
    I will wait until [https://github.com/apache/spark/pull/2332] is merged, and then will update this with the merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2341#discussion_r17509255
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -120,81 +114,35 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo
          * beforehand and is not used in later levels.
          */
     
    +    var topNode: Node = null // set on first iteration
         var level = 0
         var break = false
         while (level <= maxDepth && !break) {
    -
           logDebug("#####################################")
           logDebug("level = " + level)
           logDebug("#####################################")
     
           // Find best split for all nodes at a level.
           timer.start("findBestSplits")
    -      val splitsStatsForLevel: Array[(Split, InformationGainStats, Predict)] =
    -        DecisionTree.findBestSplits(treeInput, parentImpurities,
    -          metadata, level, nodes, splits, bins, maxLevelForSingleGroup, timer)
    +      val (tmpTopNode: Node, doneTraining: Boolean) = DecisionTree.findBestSplits(treeInput,
    +        metadata, level, topNode, splits, bins, maxLevelForSingleGroup, timer)
           timer.stop("findBestSplits")
     
    -      val levelNodeIndexOffset = Node.startIndexInLevel(level)
    -      for ((nodeSplitStats, index) <- splitsStatsForLevel.view.zipWithIndex) {
    -        val nodeIndex = levelNodeIndexOffset + index
    -
    -        // Extract info for this node (index) at the current level.
    -        timer.start("extractNodeInfo")
    -        val split = nodeSplitStats._1
    -        val stats = nodeSplitStats._2
    -        val predict = nodeSplitStats._3.predict
    -        val isLeaf = (stats.gain <= 0) || (level == strategy.maxDepth)
    -        val node = new Node(nodeIndex, predict, isLeaf, Some(split), None, None, Some(stats))
    -        logDebug("Node = " + node)
    -        nodes(nodeIndex) = node
    -        timer.stop("extractNodeInfo")
    -
    -        if (level != 0) {
    -          // Set parent.
    -          val parentNodeIndex = Node.parentIndex(nodeIndex)
    -          if (Node.isLeftChild(nodeIndex)) {
    -            nodes(parentNodeIndex).leftNode = Some(nodes(nodeIndex))
    -          } else {
    -            nodes(parentNodeIndex).rightNode = Some(nodes(nodeIndex))
    -          }
    -        }
    -        // Extract info for nodes at the next lower level.
    -        timer.start("extractInfoForLowerLevels")
    -        if (level < maxDepth) {
    -          val leftChildIndex = Node.leftChildIndex(nodeIndex)
    -          val leftImpurity = stats.leftImpurity
    -          logDebug("leftChildIndex = " + leftChildIndex + ", impurity = " + leftImpurity)
    -          parentImpurities(leftChildIndex) = leftImpurity
    -
    -          val rightChildIndex = Node.rightChildIndex(nodeIndex)
    -          val rightImpurity = stats.rightImpurity
    -          logDebug("rightChildIndex = " + rightChildIndex + ", impurity = " + rightImpurity)
    -          parentImpurities(rightChildIndex) = rightImpurity
    -        }
    -        timer.stop("extractInfoForLowerLevels")
    -        logDebug("final best split = " + split)
    +      if (level == 0) {
    +        topNode = tmpTopNode
           }
    -      require(Node.maxNodesInLevel(level) == splitsStatsForLevel.length)
    -      // Check whether all the nodes at the current level at leaves.
    -      val allLeaf = splitsStatsForLevel.forall(_._2.gain <= 0)
    -      logDebug("all leaf = " + allLeaf)
    -      if (allLeaf) {
    -        break = true // no more tree construction
    -      } else {
    -        level += 1
    +      if (doneTraining) {
    +        break = true
    --- End diff --
    
    There still needs to be a temp value since I can't write:
    var topNode
    var doneTraining
    (topNode, doneTraining) = findBestSplits(...)
    I believe the LHS of the above line needs to be newly declared vals.  Is there a way around that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55375636
  
    LGTM except minor inline comments. I'm merging this in and could you make the changes with your next update? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55222206
  
    @chouqin  Thanks for looking at the PR!  I wanted to allocate a root node beforehand, but the problem is that the member data in Node is not all mutable.  Let me know, though, if you see a way around it.
    
    Caching the impurity sounds good; I'll try to incorporate that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55213199
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20130/consoleFull) for   PR 2341 at commit [`5c4ac33`](https://github.com/apache/spark/commit/5c4ac3303fcf94bb5cbbc272013a88ff8c4e7749).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55131403
  
    Failure is unrelated to this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2341#discussion_r17466105
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -120,81 +114,35 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo
          * beforehand and is not used in later levels.
          */
     
    +    var topNode: Node = null // set on first iteration
         var level = 0
         var break = false
         while (level <= maxDepth && !break) {
    -
           logDebug("#####################################")
           logDebug("level = " + level)
           logDebug("#####################################")
     
           // Find best split for all nodes at a level.
           timer.start("findBestSplits")
    -      val splitsStatsForLevel: Array[(Split, InformationGainStats, Predict)] =
    -        DecisionTree.findBestSplits(treeInput, parentImpurities,
    -          metadata, level, nodes, splits, bins, maxLevelForSingleGroup, timer)
    +      val (tmpTopNode: Node, doneTraining: Boolean) = DecisionTree.findBestSplits(treeInput,
    +        metadata, level, topNode, splits, bins, maxLevelForSingleGroup, timer)
           timer.stop("findBestSplits")
     
    -      val levelNodeIndexOffset = Node.startIndexInLevel(level)
    -      for ((nodeSplitStats, index) <- splitsStatsForLevel.view.zipWithIndex) {
    -        val nodeIndex = levelNodeIndexOffset + index
    -
    -        // Extract info for this node (index) at the current level.
    -        timer.start("extractNodeInfo")
    -        val split = nodeSplitStats._1
    -        val stats = nodeSplitStats._2
    -        val predict = nodeSplitStats._3.predict
    -        val isLeaf = (stats.gain <= 0) || (level == strategy.maxDepth)
    -        val node = new Node(nodeIndex, predict, isLeaf, Some(split), None, None, Some(stats))
    -        logDebug("Node = " + node)
    -        nodes(nodeIndex) = node
    -        timer.stop("extractNodeInfo")
    -
    -        if (level != 0) {
    -          // Set parent.
    -          val parentNodeIndex = Node.parentIndex(nodeIndex)
    -          if (Node.isLeftChild(nodeIndex)) {
    -            nodes(parentNodeIndex).leftNode = Some(nodes(nodeIndex))
    -          } else {
    -            nodes(parentNodeIndex).rightNode = Some(nodes(nodeIndex))
    -          }
    -        }
    -        // Extract info for nodes at the next lower level.
    -        timer.start("extractInfoForLowerLevels")
    -        if (level < maxDepth) {
    -          val leftChildIndex = Node.leftChildIndex(nodeIndex)
    -          val leftImpurity = stats.leftImpurity
    -          logDebug("leftChildIndex = " + leftChildIndex + ", impurity = " + leftImpurity)
    -          parentImpurities(leftChildIndex) = leftImpurity
    -
    -          val rightChildIndex = Node.rightChildIndex(nodeIndex)
    -          val rightImpurity = stats.rightImpurity
    -          logDebug("rightChildIndex = " + rightChildIndex + ", impurity = " + rightImpurity)
    -          parentImpurities(rightChildIndex) = rightImpurity
    -        }
    -        timer.stop("extractInfoForLowerLevels")
    -        logDebug("final best split = " + split)
    +      if (level == 0) {
    +        topNode = tmpTopNode
           }
    -      require(Node.maxNodesInLevel(level) == splitsStatsForLevel.length)
    -      // Check whether all the nodes at the current level at leaves.
    -      val allLeaf = splitsStatsForLevel.forall(_._2.gain <= 0)
    -      logDebug("all leaf = " + allLeaf)
    -      if (allLeaf) {
    -        break = true // no more tree construction
    -      } else {
    -        level += 1
    +      if (doneTraining) {
    +        break = true
    --- End diff --
    
    Shall we remove `break` and only use `doneTraining`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55355197
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/76/consoleFull) for   PR 2341 at commit [`07dd1ee`](https://github.com/apache/spark/commit/07dd1ee166c7054f9277662fa01defd8786798a2).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `case class CreateTableAsSelect(`
      * `case class CreateTableAsSelect(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55073529
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20078/consoleFull) for   PR 2341 at commit [`306120f`](https://github.com/apache/spark/commit/306120fc93021f3d2d86333c77296fe3d36b76e1).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55214353
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/52/consoleFull) for   PR 2341 at commit [`306120f`](https://github.com/apache/spark/commit/306120fc93021f3d2d86333c77296fe3d36b76e1).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55210724
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/52/consoleFull) for   PR 2341 at commit [`306120f`](https://github.com/apache/spark/commit/306120fc93021f3d2d86333c77296fe3d36b76e1).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55289359
  
    Actually, trying to treat all levels equally sounds like it might fit well with this JIRA [https://issues.apache.org/jira/browse/SPARK-3158], so I might delay until then.  It also might make sense to cache the impurity in the nodes allocated for the next level.  I will update that JIRA with these to-do items and postpone these updates.  Currently, I would like to prioritize random forests [https://issues.apache.org/jira/browse/SPARK-1545], and later on follow up with these optimizations.  Does that sound reasonable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55351898
  
    I just pushed 2 small (but important) bug fixes onto this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55070870
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20078/consoleFull) for   PR 2341 at commit [`306120f`](https://github.com/apache/spark/commit/306120fc93021f3d2d86333c77296fe3d36b76e1).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55216852
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20130/consoleFull) for   PR 2341 at commit [`5c4ac33`](https://github.com/apache/spark/commit/5c4ac3303fcf94bb5cbbc272013a88ff8c4e7749).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55287795
  
    I hesitate to change a public API, but I agree it makes more sense.  I'll make that change since it's just a Developer API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [mllib] DecisionTree: eliminate p...

Posted by chouqin <gi...@git.apache.org>.

Github user chouqin commented on the pull request:

    https://github.com/apache/spark/pull/2341#issuecomment-55223947
  
    Can we change the fields from `val` to `var`? `leftNode` and `rightNode` are `var`s, I wonder if we can change other fields too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2341#discussion_r17465271
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -87,17 +87,11 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo
         val maxDepth = strategy.maxDepth
         require(maxDepth <= 30,
           s"DecisionTree currently only supports maxDepth <= 30, but was given maxDepth = $maxDepth.")
    -    // Number of nodes to allocate: max number of nodes possible given the depth of the tree, plus 1
    -    val maxNumNodesPlus1 = Node.startIndexInLevel(maxDepth + 1)
    -    // Initialize an array to hold parent impurity calculations for each node.
    -    val parentImpurities = new Array[Double](maxNumNodesPlus1)
    -    // dummy value for top node (updated during first split calculation)
    -    val nodes = new Array[Node](maxNumNodesPlus1)
     
         // Calculate level for single group construction
     
         // Max memory usage for aggregates
    -    val maxMemoryUsage = strategy.maxMemoryInMB * 1024 * 1024
    +    val maxMemoryUsage = strategy.maxMemoryInMB * 1024L * 1024L
    --- End diff --
    
    It is also useful to set an upper bound here (e.g., 1GB) to avoid memory/GC problems on the driver.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2341#discussion_r17466108
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -435,18 +385,18 @@ object DecisionTree extends Serializable with Logging {
           // numGroups is equal to 2 at level 11 and 4 at level 12, respectively.
           val numGroups = 1 << level - maxLevelForSingleGroup
           logDebug("numGroups = " + numGroups)
    -      var bestSplits = new Array[(Split, InformationGainStats, Predict)](0)
           // Iterate over each group of nodes at a level.
           var groupIndex = 0
    +      var doneTraining = true
           while (groupIndex < numGroups) {
    -        val bestSplitsForGroup = findBestSplitsPerGroup(input, parentImpurities, metadata, level,
    -          nodes, splits, bins, timer, numGroups, groupIndex)
    -        bestSplits = Array.concat(bestSplits, bestSplitsForGroup)
    +        val (tmpRoot, doneTrainingGroup) = findBestSplitsPerGroup(input, metadata, level,
    --- End diff --
    
    `tmpRoot` -> `_` (since it is not used after)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3160] [SPARK-3494] [mllib] DecisionTree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2341#discussion_r17509172
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -87,17 +87,11 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo
         val maxDepth = strategy.maxDepth
         require(maxDepth <= 30,
           s"DecisionTree currently only supports maxDepth <= 30, but was given maxDepth = $maxDepth.")
    -    // Number of nodes to allocate: max number of nodes possible given the depth of the tree, plus 1
    -    val maxNumNodesPlus1 = Node.startIndexInLevel(maxDepth + 1)
    -    // Initialize an array to hold parent impurity calculations for each node.
    -    val parentImpurities = new Array[Double](maxNumNodesPlus1)
    -    // dummy value for top node (updated during first split calculation)
    -    val nodes = new Array[Node](maxNumNodesPlus1)
     
         // Calculate level for single group construction
     
         // Max memory usage for aggregates
    -    val maxMemoryUsage = strategy.maxMemoryInMB * 1024 * 1024
    +    val maxMemoryUsage = strategy.maxMemoryInMB * 1024L * 1024L
    --- End diff --
    
    I'll go for 10GB since 1GB memory is not that large.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org