You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2014/08/26 05:41:57 UTC

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/2125

    [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]  DecisionTree aggregation improvements

    Summary:
    (1) Variable numBins for each feature [SPARK-3043]
    (2) Reduced data reshaping in aggregation [SPARK-3043]
    (3) Choose ordering for ordered categorical features adaptively [SPARK-3156]
    (4) Changed nodes to use 1-indexing [SPARK-3086]
    (5) Small clean-ups
    
    Note: This PR looks bigger than it is since I moved several functions from inside findBestSplitsPerGroup to outside of it (to make it clear what was being serialized in the aggregation).
    
    Speedups: This update helps most when many features use few bins but a few features use many bins.  Some example results on speedups with 2M examples, 3.5K features (15-worker EC2 cluster):
    * Example where old code was reasonably efficient (1/2 continuous, 1/4 binary, 1/4 20-category): 164.813 --> 116.491 sec
    * Example where old code wasted many bins (1/10 continuous, 81/100 binary, 9/100 20-category): 128.701 --> 39.334 sec
    
    Details:
    
    (1) Variable numBins for each feature [SPARK-3043]
    
    DecisionTreeMetadata now computes a variable numBins for each feature.  It also tracks numSplits.
    
    (2) Reduced data reshaping in aggregation [SPARK-3043]
    
    Added DTStatsAggregator, a wrapper around the aggregate statistics array for easy but efficient indexing.
    * Added ImpurityAggregator and ImpurityCalculator classes, to make DecisionTree code more oblivious to the type of impurity.
    
    The aggregate statistics are never reshaped, and cumulative sums are computed in-place.
    
    Updated the layout of aggregation functions.  The update simplifies things by (1) dividing features into ordered/unordered (instead of ordered/unordered/continuous) and (2) making use of the DTStatsAggregator for indexing.
    For this update, the following functions were refactored:
    * updateBinForOrderedFeature
    * updateBinForUnorderedFeature
    * binaryOrNotCategoricalBinSeqOp
    * multiclassWithCategoricalBinSeqOp
    * regressionBinSeqOp
    The above 5 functions were replaced with:
    * orderedBinSeqOp
    * someUnorderedBinSeqOp
    
    Other changes:
    * calculateGainForSplit now treats all feature types the same way.
    * Eliminated extractLeftRightNodeAggregates.
    
    (3) Choose ordering for ordered categorical features adaptively [SPARK-3156]
    
    Updated binsToBestSplit():
    * This now computes cumulative sums of stats for ordered features.
    * For ordered categorical features, it chooses an ordering for categories. (This uses to be done by findSplitsBins.)
    * Uses iterators to shorten code and avoid building an Array[Array[InformationGainStats]].
    
    Side effects:
    * In findSplitsBins: A sample of the data is only taken for data with continuous features.  It is not needed for data with only categorical features.
    * In findSplitsBins: splits and bins are no longer pre-computed for ordered categorical features since they are not needed.
    * TreePoint binning is simpler for categorical features.
    
    (4) Changed nodes to use 1-indexing [SPARK-3086]
    
    Nodes used to be indexed from 0.  Now they are indexed from 1.
    Node indexing functions are now collected in object Node (Node.scala).
    
    (5) Small clean-ups
    
    Eliminated functions extractNodeInfo() and extractInfoForLowerLevels() to reduce duplicate code.
    Eliminated InvalidBinIndex since it is no longer used.
    
    CC: @mengxr  @manishamde  Please let me know if you have thoughts on this—thanks!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark dt-opt3alt

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2125
    
----
commit a95bc22e648d01158d3a4fd597059135e1302266
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-05T18:17:28Z

    timing for DecisionTree internals

commit 511ec85fbe4c4463d8e600fabc5d54c5b2bd8417
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-06T01:16:19Z

    Merge remote-tracking branch 'upstream/master' into dt-timing

commit bcf874a7444303ac7dc14cc5a36890cec45a8359
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-07T21:53:22Z

    Merge remote-tracking branch 'upstream/master' into dt-timing
    
    Conflicts:
    	mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

commit f61e9d227233679ab826e38210376e7050da9b6b
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-08T07:35:06Z

    Merge remote-tracking branch 'upstream/master' into dt-timing

commit 3211f027c1a41f8eaa4eea4e90073216a8474c4e
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-08T16:46:12Z

    Optimizing DecisionTree
    * Added TreePoint representation to avoid calling findBin multiple times.
    * (not working yet, but debugging)

commit 0f676e2e0ae02e54387a255ac9f64d3c7265d152
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-08T21:12:52Z

    Optimizations + Bug fix for DecisionTree
    
    Optimization: Added TreePoint representation so we only call findBin once for each example, feature.
    
    Also, calculateGainsForAllNodeSplits now only searches over actual splits, not empty/unused ones.
    
    BUG FIX: isSampleValid
    * isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.
    * exhibited for unordered features (multi-class classification with categorical features of low arity)
    * Fix: Index bins correctly for unordered categorical features.
    
    Also: some commented-out debugging println calls in DecisionTree, to be removed later

commit b2ed1f39ecc967a663a88241b46e5786eb66be22
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-08T21:15:44Z

    Merge remote-tracking branch 'upstream/master' into dt-opt

commit b914f3b7ed94e897b55f28c772f48a7d6fba7f06
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-09T19:01:45Z

    DecisionTree optimization: eliminated filters + small changes
    
    DecisionTree.scala
    * Eliminated filters, replaced by building tree on the fly and filtering top-down.
    ** Aggregation over examples now skips examples which do not reach the current level.
    * Only calculate unorderedFeatures once (in findSplitsBins)
    
    Node: Renamed predictIfLeaf to predict
    
    Bin, Split: Updated doc

commit c1565a5248e5d0ccc2293315799281030a74c217
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-11T18:09:32Z

    Small DecisionTree updates:
    * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair.
    * Internal doc: findAggForOrderedFeatureClassification

commit fd653725dff2ad1de2aaf7eac0b06bbeee8d1129
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-13T04:13:09Z

    Major changes:
    * Created ImpurityAggregator classes, rather than old aggregates.
    * Feature split/bin semantics are based on ordered vs. unordered
    ** E.g.: numSplits = numBins for all unordered features, and numSplits = numBins - 1 for all ordered features.
    * numBins can differ for each feature
    
    DecisionTree
    * Major changes based on new aggregator setup
    ** For ordered features, aggregate is indexed by: (nodeIndex)(featureIndex)(binIndex).
    ** For unordered features, aggregate is indexed by: (nodeIndex)(featureIndex)(2 * binIndex),
    * Added LearningMetadata class
    * Eliminated now-unused functions:
    ** extractNodeInfo
    ** getNumSplitsForFeature
    ** getBinDataForNode (Eliminated since it merely slices/reshapes data.)
    
    ImpurityAggregator classes
    * Changed main aggregate operation to create binAggregates (binSeqOp, binCompOp) to use the aggregator.
    * Before, for unordered features, the left/right bins were treated as a separate dimension for aggregates.  They are now part of the bins: binAggregates is of size: (numNodes, numBins_f, numFeatures) where numBins_f is:
    ** 2 * [pow(2, maxFeatureValue - 1) - 1] for unordered categorical features
    ** maxFeatureValue for ordered categorical features
    ** maxBins for continuous features
    
    DecisionTreeSuite
    * For tests using unordered (low-arity) features, removed checks of Bin.category, which only has meaning for ordered features.

commit 51ef7813d9fb1c98457f015e1aa7dca92816750a
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-13T08:26:21Z

    Fixed bug introduced by last commit: Variance impurity calculation was incorrect since counts were swapped accidentally

commit e3c84ccf06f58fce235fb387c7fd0b432103e5a1
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-14T00:49:25Z

    Added stuff fro mnist8m to D T Runner

commit 86e217fb454e2834f92ecfbebd33419c886fe944
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-14T00:58:52Z

    added cache to DT input

commit 438a66018775dc928644d32e833aecd6c2265109
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-14T01:17:16Z

    removed subsampling for mnist8m from DT

commit dd4d3aa65e796d5b6ac36e36c0172cc90ad4ae15
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-14T18:27:34Z

    Mid-process in bug fix: bug for binary classification with categorical features
    * Bug: Categorical features were all treated as ordered for binary classification.  This is possible but would require the bin ordering to be determined on-the-fly after the aggregation.  Currently, the ordering is determined a priori and fixed for all splits.
    * (Temp) Fix: Treat low-arity categorical features as unordered for binary classification.
    * Related change: I removed most tests for isMulticlass in the code.  I instead test metadata for whether there are unordered features.
    * Status: The bug may be fixed, but more testing needs to be done.
    
    Aggregates: The same binMultiplier (for ordered vs. unordered) was applied to all features.  It is now applied on a per-feature basis.

commit a87e08f1e5999c31b956a34617f88ff9a50775ae
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-14T18:34:12Z

    Merge remote-tracking branch 'upstream/master' into dt-opt1

commit 8464a6efd644daf9954ba43c9790ec304f94e029
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-14T19:26:57Z

    Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable

commit e66f1b1cb2252dab1f847f2c24623baab40627fc
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-14T19:58:22Z

    TreePoint
    * Updated doc
    * Made some methods private
    
    Changed timer to report time in seconds.

commit d03608949e19c53596b4f6cc09d9f68011184d68
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-14T20:07:14Z

    Print timing info to logDebug.

commit 430d782294a08f63535e2ecce167703021e1fe44
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-14T23:09:14Z

    Added more debug info on binning error.  Added some docs.

commit 356dabac6bad8b2e2a9f7b90aaae80d987c113dc
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-14T23:57:13Z

    Merge branch 'dt-opt1' into dt-opt2
    
    Conflicts:
    	mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
    	mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

commit 26d10dd58ee218102bd205c1e6d68fda5a45cf4b
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-15T00:44:08Z

    Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.

commit 5fce6353a818087198307a3932ece044a09e45ab
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-15T16:45:28Z

    Merge branch 'dt-opt2' into dt-opt3
    
    Conflicts:
    	mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
    	mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala
    	mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

commit 45f7ea7afdbe100b6a489da71805049a9d63995b
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-17T17:17:18Z

    partial merge, not yet done

commit 9c833639ec935a1f372ea0655012259957d8778b
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-18T03:23:17Z

    partial merge but not done yet

commit b3146594390cb4b6edd2b47e56611b7e42dc0f77
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-18T17:25:32Z

    Merge remote-tracking branch 'upstream/master' into dt-opt3

commit 3ba716667460310ef3bd897af9cb624a80e7fa94
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-18T20:25:47Z

    Merge remote-tracking branch 'upstream/master' into dt-opt3

commit 61c45093a9ae73b03e7a6737424101e45c5aa123
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-18T21:02:33Z

    Fixed bugs from merge: missing DT timer call, and numBins setting.  Cleaned up DT Suite some.

commit 5f94342bda903aac294ab76ab5ba89eb3751d3ab
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-18T22:32:04Z

    Added treeAggregate since not yet merged from master.  Moved node indexing functions to Node.

commit 95cad7cb16e659a58985230295f62a438d4a84ab
Author: Joseph K. Bradley <jo...@gmail.com>
Date:   2014-08-18T22:32:45Z

    Merge remote-tracking branch 'upstream/master' into dt-opt3
    
    Conflicts:
    	mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54351159
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19679/consoleFull) for   PR 2125 at commit [`aa4e4df`](https://github.com/apache/spark/commit/aa4e4df1989564f411e8e9e975618f6e715e2683).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54849996
  
    LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54358281
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19679/consoleFull) for   PR 2125 at commit [`aa4e4df`](https://github.com/apache/spark/commit/aa4e4df1989564f411e8e9e975618f6e715e2683).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `    logDebug("isMulticlass = " + metadata.isMulticlass)`
      * `    logDebug("isMulticlass = " + metadata.isMulticlass)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16865813
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -70,32 +83,48 @@ private[tree] object DecisionTreeMetadata {
           case Regression => 0
         }
     
    -    val maxBins = math.min(strategy.maxBins, numExamples).toInt
    -    val log2MaxBinsp1 = math.log(maxBins + 1) / math.log(2.0)
    +    val maxPossibleBins = math.min(strategy.maxBins, numExamples).toInt
    +    val log2MaxPossibleBinsp1 = math.log(maxPossibleBins + 1) / math.log(2.0)
     
    +    // We check the number of bins here against maxPossibleBins.
    +    // This needs to be checked here instead of in Strategy since maxPossibleBins can be modified
    +    // based on the number of training examples.
         val unorderedFeatures = new mutable.HashSet[Int]()
    +    val numBins = Array.fill[Int](numFeatures)(maxPossibleBins)
         if (numClasses > 2) {
           strategy.categoricalFeaturesInfo.foreach { case (f, k) =>
    -        if (k - 1 < log2MaxBinsp1) {
    +        if (k - 1 < log2MaxPossibleBinsp1) {
               // Note: The above check is equivalent to checking:
               //       numUnorderedBins = (1 << k - 1) - 1 < maxBins
               unorderedFeatures.add(f)
    +          numBins(f) = numUnorderedBins(k)
             } else {
    -          // TODO: Allow this case, where we simply will know nothing about some categories?
    -          require(k < maxBins, s"maxBins (= $maxBins) should be greater than max categories " +
    +          require(k <= maxPossibleBins,
    +            s"maxBins (= $maxPossibleBins) should be greater than max categories " +
                 s"in categorical features (>= $k)")
    +          numBins(f) = k
             }
           }
         } else {
           strategy.categoricalFeaturesInfo.foreach { case (f, k) =>
    -        require(k < maxBins, s"maxBins (= $maxBins) should be greater than max categories " +
    -          s"in categorical features (>= $k)")
    +        require(k <= maxPossibleBins,
    +          s"DecisionTree requires maxBins (= $maxPossibleBins) >= max categories " +
    +          s"in categorical features (= ${strategy.categoricalFeaturesInfo.values.max})")
    +        numBins(f) = k
           }
         }
     
    -    new DecisionTreeMetadata(numFeatures, numExamples, numClasses, maxBins,
    -      strategy.categoricalFeaturesInfo, unorderedFeatures.toSet,
    +    new DecisionTreeMetadata(numFeatures, numExamples, numClasses, numBins.max,
    +      strategy.categoricalFeaturesInfo, unorderedFeatures.toSet, numBins,
           strategy.impurity, strategy.quantileCalculationStrategy)
       }
     
    +  /**
    +   * Given the arity of a categorical feature (arity = number of categories),
    +   * return the number of bins for the feature if it is to be treated as an unordered feature.
    +   */
    +  def numUnorderedBins(arity: Int): Int = {
    +    (1 << arity - 1) - 1
    --- End diff --
    
    It might be obvious but a comment explaining the bit shift operations will be helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16696021
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -122,28 +126,32 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo
           logDebug("level = " + level)
           logDebug("#####################################")
     
    +
    --- End diff --
    
    remove extra empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16867878
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -619,661 +662,258 @@ object DecisionTree extends Serializable with Logging {
           if (level == 0) {
             0
           } else {
    -        val globalNodeIndex = predictNodeIndex(nodes(0), treePoint.binnedFeatures)
    -        // Get index for this (level, group).
    -        globalNodeIndex - levelOffset - groupShift
    -      }
    -    }
    -
    -    /**
    -     * Increment aggregate in location for (node, feature, bin, label).
    -     *
    -     * @param treePoint  Data point being aggregated.
    -     * @param agg  Array storing aggregate calculation, of size:
    -     *             numClasses * numBins * numFeatures * numNodes.
    -     *             Indexed by (node, feature, bin, label) where label is the least significant bit.
    -     * @param nodeIndex  Node corresponding to treePoint. Indexed from 0 at start of (level, group).
    -     */
    -    def updateBinForOrderedFeature(
    -        treePoint: TreePoint,
    -        agg: Array[Double],
    -        nodeIndex: Int,
    -        featureIndex: Int): Unit = {
    -      // Update the left or right count for one bin.
    -      val aggIndex =
    -        numClasses * numBins * numFeatures * nodeIndex +
    -        numClasses * numBins * featureIndex +
    -        numClasses * treePoint.binnedFeatures(featureIndex) +
    -        treePoint.label.toInt
    -      agg(aggIndex) += 1
    -    }
    -
    -    /**
    -     * Increment aggregate in location for (nodeIndex, featureIndex, [bins], label),
    -     * where [bins] ranges over all bins.
    -     * Updates left or right side of aggregate depending on split.
    -     *
    -     * @param nodeIndex  Node corresponding to treePoint. Indexed from 0 at start of (level, group).
    -     * @param treePoint  Data point being aggregated.
    -     * @param agg  Indexed by (left/right, node, feature, bin, label)
    -     *             where label is the least significant bit.
    -     *             The left/right specifier is a 0/1 index indicating left/right child info.
    -     * @param rightChildShift Offset for right side of agg.
    -     */
    -    def updateBinForUnorderedFeature(
    -        nodeIndex: Int,
    -        featureIndex: Int,
    -        treePoint: TreePoint,
    -        agg: Array[Double],
    -        rightChildShift: Int): Unit = {
    -      val featureValue = treePoint.binnedFeatures(featureIndex)
    -      // Update the left or right count for one bin.
    -      val aggShift =
    -        numClasses * numBins * numFeatures * nodeIndex +
    -        numClasses * numBins * featureIndex +
    -        treePoint.label.toInt
    -      // Find all matching bins and increment their values
    -      val featureCategories = metadata.featureArity(featureIndex)
    -      val numCategoricalBins = (1 << featureCategories - 1) - 1
    -      var binIndex = 0
    -      while (binIndex < numCategoricalBins) {
    -        val aggIndex = aggShift + binIndex * numClasses
    -        if (bins(featureIndex)(binIndex).highSplit.categories.contains(featureValue)) {
    -          agg(aggIndex) += 1
    -        } else {
    -          agg(rightChildShift + aggIndex) += 1
    -        }
    -        binIndex += 1
    -      }
    -    }
    -
    -    /**
    -     * Helper for binSeqOp.
    -     *
    -     * @param agg  Array storing aggregate calculation, of size:
    -     *             numClasses * numBins * numFeatures * numNodes.
    -     *             Indexed by (node, feature, bin, label) where label is the least significant bit.
    -     * @param treePoint  Data point being aggregated.
    -     * @param nodeIndex  Node corresponding to treePoint. Indexed from 0 at start of (level, group).
    -     */
    -    def binaryOrNotCategoricalBinSeqOp(
    -        agg: Array[Double],
    -        treePoint: TreePoint,
    -        nodeIndex: Int): Unit = {
    -      // Iterate over all features.
    -      var featureIndex = 0
    -      while (featureIndex < numFeatures) {
    -        updateBinForOrderedFeature(treePoint, agg, nodeIndex, featureIndex)
    -        featureIndex += 1
    -      }
    -    }
    -
    -    val rightChildShift = numClasses * numBins * numFeatures * numNodes
    -
    -    /**
    -     * Helper for binSeqOp.
    -     *
    -     * @param agg  Array storing aggregate calculation.
    -     *             For ordered features, this is of size:
    -     *               numClasses * numBins * numFeatures * numNodes.
    -     *             For unordered features, this is of size:
    -     *               2 * numClasses * numBins * numFeatures * numNodes.
    -     * @param treePoint   Data point being aggregated.
    -     * @param nodeIndex  Node corresponding to treePoint. Indexed from 0 at start of (level, group).
    -     */
    -    def multiclassWithCategoricalBinSeqOp(
    -        agg: Array[Double],
    -        treePoint: TreePoint,
    -        nodeIndex: Int): Unit = {
    -      val label = treePoint.label
    -      // Iterate over all features.
    -      var featureIndex = 0
    -      while (featureIndex < numFeatures) {
    -        if (metadata.isUnordered(featureIndex)) {
    -          updateBinForUnorderedFeature(nodeIndex, featureIndex, treePoint, agg, rightChildShift)
    -        } else {
    -          updateBinForOrderedFeature(treePoint, agg, nodeIndex, featureIndex)
    -        }
    -        featureIndex += 1
    -      }
    -    }
    -
    -    /**
    -     * Performs a sequential aggregation over a partition for regression.
    -     * For l nodes, k features,
    -     * the count, sum, sum of squares of one of the p bins is incremented.
    -     *
    -     * @param agg Array storing aggregate calculation, updated by this function.
    -     *            Size: 3 * numBins * numFeatures * numNodes
    -     * @param treePoint   Data point being aggregated.
    -     * @param nodeIndex  Node corresponding to treePoint. Indexed from 0 at start of (level, group).
    -     * @return agg
    -     */
    -    def regressionBinSeqOp(agg: Array[Double], treePoint: TreePoint, nodeIndex: Int): Unit = {
    -      val label = treePoint.label
    -      // Iterate over all features.
    -      var featureIndex = 0
    -      while (featureIndex < numFeatures) {
    -        // Update count, sum, and sum^2 for one bin.
    -        val binIndex = treePoint.binnedFeatures(featureIndex)
    -        val aggIndex =
    -          3 * numBins * numFeatures * nodeIndex +
    -          3 * numBins * featureIndex +
    -          3 * binIndex
    -        agg(aggIndex) += 1
    -        agg(aggIndex + 1) += label
    -        agg(aggIndex + 2) += label * label
    -        featureIndex += 1
    +        val globalNodeIndex =
    +          predictNodeIndex(nodes(1), treePoint.binnedFeatures, bins, metadata.unorderedFeatures)
    +        globalNodeIndex - globalNodeIndexOffset
           }
         }
     
         /**
          * Performs a sequential aggregation over a partition.
    -     * For l nodes, k features,
    -     *   For classification:
    -     *     Either the left count or the right count of one of the bins is
    -     *     incremented based upon whether the feature is classified as 0 or 1.
    -     *   For regression:
    -     *     The count, sum, sum of squares of one of the bins is incremented.
          *
    -     * @param agg Array storing aggregate calculation, updated by this function.
    -     *            Size for classification:
    -     *              numClasses * numBins * numFeatures * numNodes for ordered features, or
    -     *              2 * numClasses * numBins * numFeatures * numNodes for unordered features.
    -     *            Size for regression:
    -     *              3 * numBins * numFeatures * numNodes.
    +     * Each data point contributes to one node. For each feature,
    +     * the aggregate sufficient statistics are updated for the relevant bins.
    +     *
    +     * @param agg  Array storing aggregate calculation, with a set of sufficient statistics for
    +     *             each (node, feature, bin).
          * @param treePoint   Data point being aggregated.
          * @return  agg
          */
    -    def binSeqOp(agg: Array[Double], treePoint: TreePoint): Array[Double] = {
    +    def binSeqOp(
    +        agg: DTStatsAggregator,
    +        treePoint: TreePoint): DTStatsAggregator = {
           val nodeIndex = treePointToNodeIndex(treePoint)
           // If the example does not reach this level, then nodeIndex < 0.
           // If the example reaches this level but is handled in a different group,
           //  then either nodeIndex < 0 (previous group) or nodeIndex >= numNodes (later group).
           if (nodeIndex >= 0 && nodeIndex < numNodes) {
    -        if (metadata.isClassification) {
    -          if (isMulticlassWithCategoricalFeatures) {
    -            multiclassWithCategoricalBinSeqOp(agg, treePoint, nodeIndex)
    -          } else {
    -            binaryOrNotCategoricalBinSeqOp(agg, treePoint, nodeIndex)
    -          }
    +        if (metadata.unorderedFeatures.isEmpty) {
    +          orderedBinSeqOp(agg, treePoint, nodeIndex)
             } else {
    -          regressionBinSeqOp(agg, treePoint, nodeIndex)
    +          someUnorderedBinSeqOp(agg, treePoint, nodeIndex, bins, metadata.unorderedFeatures)
    --- End diff --
    
    ```mixed``` or ```mixedOrderedUnordered``` instead of ```someUnordered```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2125#issuecomment-54221304

@manishamde Thank you very much for the review and comments! About to send a PR with updates which should address everything. One clarification about the adaptive ordering:

It is actually more accurate to choose a new ordering at every node (and is required to make this have guarantees and not be a heuristic for regression and binary classification). It does mean a different set of splits may be considered at each node, but that split should be tailored specifically for that node and should give better results.
As far as computation, it does require a sort, but that should be cheap as long as the number of categories for any feature is not too large. In my tests, much more (10x - 100x) time is spent on the aggregation than on the master, so it is not an issue for categorical features with a smallish number of categories.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-53815867
  
    Note (from @mengxr): Need to check maxDepth <= 30 to make sure we don't overflow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54593766
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-53810214
  
    I have made one pass and it's a fantastic PR. Apart from the performance benefits, the code improvements are awesome.
    
    I still need to make another pass and look closely at some methods especially ```binsToBestSplit```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54699127
  
    Looks like other tests are failing.  I might as well push one more tweak (speedup tweak based on profiling).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54700205
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19886/consoleFull) for   PR 2125 at commit [`42c192a`](https://github.com/apache/spark/commit/42c192ac23e6ca9ebcc5d7c3dcd6665de3c96447).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54383053
  
    Jenkins, please test this again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16696028
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/Node.scala ---
    @@ -148,3 +145,50 @@ class Node (
       }
     
     }
    +
    +private[tree] object Node {
    +
    +  /**
    +   * Return the index of the left child of this node.
    +   */
    +  def leftChildIndex(nodeIndex: Int): Int = nodeIndex * 2
    +
    +  /**
    +   * Return the index of the right child of this node.
    +   */
    +  def rightChildIndex(nodeIndex: Int): Int = nodeIndex * 2 + 1
    --- End diff --
    
    `nodeIndex << 1` + 1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2125#issuecomment-54252929

@jkbradley I'm a little worry about the memory usage, but this is not a regression caused by this PR (ignoring the bug I commented inline). I'm testing in on MNIST digits, which has `10` classes and `28 * 28 = 784` continuous features. The default `maxBins` is `100`. So the total array size is `10 * 100 * 784 = 784000` per node, i.e., 6MB. Under the default setting (128MB), we can only train 16 nodes per group. Two questions:

1. Is the default `maxBins = 100` too high and the default `maxMemoryInMB = 128` too low? If we change `maxBins` to `32` and `maxMemoryInMB` to 256, we can train ~128 nodes in one pass.

2. Can we compute the best splits distributively? We can join on the node index and send all stats associated with a single node to a single executor and find the best split there. It will help reduce the memory requirement as well as driver's load.

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16866042
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -70,32 +83,48 @@ private[tree] object DecisionTreeMetadata {
           case Regression => 0
         }
     
    -    val maxBins = math.min(strategy.maxBins, numExamples).toInt
    -    val log2MaxBinsp1 = math.log(maxBins + 1) / math.log(2.0)
    +    val maxPossibleBins = math.min(strategy.maxBins, numExamples).toInt
    +    val log2MaxPossibleBinsp1 = math.log(maxPossibleBins + 1) / math.log(2.0)
     
    +    // We check the number of bins here against maxPossibleBins.
    +    // This needs to be checked here instead of in Strategy since maxPossibleBins can be modified
    +    // based on the number of training examples.
         val unorderedFeatures = new mutable.HashSet[Int]()
    +    val numBins = Array.fill[Int](numFeatures)(maxPossibleBins)
         if (numClasses > 2) {
           strategy.categoricalFeaturesInfo.foreach { case (f, k) =>
    -        if (k - 1 < log2MaxBinsp1) {
    +        if (k - 1 < log2MaxPossibleBinsp1) {
               // Note: The above check is equivalent to checking:
               //       numUnorderedBins = (1 << k - 1) - 1 < maxBins
               unorderedFeatures.add(f)
    +          numBins(f) = numUnorderedBins(k)
             } else {
    -          // TODO: Allow this case, where we simply will know nothing about some categories?
    -          require(k < maxBins, s"maxBins (= $maxBins) should be greater than max categories " +
    +          require(k <= maxPossibleBins,
    +            s"maxBins (= $maxPossibleBins) should be greater than max categories " +
                 s"in categorical features (>= $k)")
    +          numBins(f) = k
             }
           }
         } else {
           strategy.categoricalFeaturesInfo.foreach { case (f, k) =>
    -        require(k < maxBins, s"maxBins (= $maxBins) should be greater than max categories " +
    -          s"in categorical features (>= $k)")
    +        require(k <= maxPossibleBins,
    --- End diff --
    
    Should this require be outside the foreach since we are checking for values.max?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-53520207
  
    I just pushed a small update, but I'll make sure and check the outdated diff comments after we discuss the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-53971034
  
    The ordered categorical features are not binned and the centriods are re-calculated using the entire bin aggregate every level. I can see the improvement in accuracy here since we are not using a subsample for centriod calculation. However, is there a loss in performance by repeatedly performing this calculation at every level/group?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16696033
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/Node.scala ---
    @@ -148,3 +145,50 @@ class Node (
       }
     
     }
    +
    +private[tree] object Node {
    +
    +  /**
    +   * Return the index of the left child of this node.
    +   */
    +  def leftChildIndex(nodeIndex: Int): Int = nodeIndex * 2
    +
    +  /**
    +   * Return the index of the right child of this node.
    +   */
    +  def rightChildIndex(nodeIndex: Int): Int = nodeIndex * 2 + 1
    +
    +  /**
    +   * Get the parent index of the given node, or 0 if it is the root.
    +   */
    +  def parentIndex(nodeIndex: Int): Int = nodeIndex >> 1
    +
    +  /**
    +   * Return the level of a tree which the given node is in.
    +   */
    +  def indexToLevel(nodeIndex: Int): Int = if (nodeIndex == 0) {
    +    throw new IllegalArgumentException(s"0 is not a valid node index.")
    +  } else {
    +    java.lang.Integer.numberOfTrailingZeros(java.lang.Integer.highestOneBit(nodeIndex))
    +  }
    +
    +  /**
    +   * Returns true if this is a left child.
    +   * Note: Returns false for the root.
    +   */
    +  def isLeftChild(nodeIndex: Int): Boolean = nodeIndex > 1 && nodeIndex % 2 == 0
    +
    +  /**
    +   * Return the maximum number of nodes which can be in the given level of the tree.
    +   * @param level  Level of tree (0 = root).
    +   */
    +  private[tree] def maxNodesInLevel(level: Int): Int = 1 << level
    --- End diff --
    
    remove `private[tree]` (because the object is already `private[tree]`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16696018
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -65,36 +66,39 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo
         val retaggedInput = input.retag(classOf[LabeledPoint])
         val metadata = DecisionTreeMetadata.buildMetadata(retaggedInput, strategy)
         logDebug("algo = " + strategy.algo)
    +    logDebug("maxBins = " + metadata.maxBins)
     
         // Find the splits and the corresponding bins (interval between the splits) using a sample
         // of the input data.
         timer.start("findSplitsBins")
         val (splits, bins) = DecisionTree.findSplitsBins(retaggedInput, metadata)
    -    val numBins = bins(0).length
         timer.stop("findSplitsBins")
    -    logDebug("numBins = " + numBins)
    +    logDebug("numBins: feature: number of bins")
    +    Range(0, metadata.numFeatures).foreach { featureIndex =>
    +      logDebug(s"\t$featureIndex\t${metadata.numBins(featureIndex)}")
    +    }
     
         // Bin feature values (TreePoint representation).
         // Cache input RDD for speedup during multiple passes.
         val treeInput = TreePoint.convertToTreeRDD(retaggedInput, bins, metadata)
           .persist(StorageLevel.MEMORY_AND_DISK)
     
    -    val numFeatures = metadata.numFeatures
         // depth of the decision tree
         val maxDepth = strategy.maxDepth
    -    // the max number of nodes possible given the depth of the tree
    -    val maxNumNodes = (2 << maxDepth) - 1
    +    // the max number of nodes possible given the depth of the tree, plus 1
    +    val maxNumNodes_p1 = Node.maxNodesInLevel(maxDepth + 1)
    --- End diff --
    
    `maxNumNodes_p1` -> `maxNumNodes1` or `maxNumNodesP1`. We use camelCase for variable names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r17017235
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala ---
    @@ -0,0 +1,208 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impl
    +
    +import org.apache.spark.mllib.tree.impurity._
    +
    +/**
    + * DecisionTree statistics aggregator.
    + * This holds a flat array of statistics for a set of (nodes, features, bins)
    + * and helps with indexing.
    + */
    +private[tree] class DTStatsAggregator(
    +    metadata: DecisionTreeMetadata,
    +    val numNodes: Int) extends Serializable {
    +
    +  /**
    +   * [[ImpurityAggregator]] instance specifying the impurity type.
    +   */
    +  val impurityAggregator: ImpurityAggregator = metadata.impurity match {
    +    case Gini => new GiniAggregator(metadata.numClasses)
    +    case Entropy => new EntropyAggregator(metadata.numClasses)
    +    case Variance => new VarianceAggregator()
    +    case _ => throw new IllegalArgumentException(s"Bad impurity parameter: ${metadata.impurity}")
    +  }
    +
    +  /**
    +   * Number of elements (Double values) used for the sufficient statistics of each bin.
    +   */
    +  val statsSize: Int = impurityAggregator.statsSize
    +
    +  val numFeatures: Int = metadata.numFeatures
    +
    +  /**
    +   * Number of bins for each feature.  This is indexed by the feature index.
    +   */
    +  val numBins: Array[Int] = metadata.numBins
    +
    +  /**
    +   * Indicator for each feature of whether that feature is an unordered feature.
    +   * TODO: Is Array[Boolean] any faster?
    +   */
    +  def isUnordered(featureIndex: Int): Boolean = metadata.isUnordered(featureIndex)
    +
    +  /**
    +   * Offset for each feature for calculating indices into the [[allStats]] array.
    +   */
    +  private val featureOffsets: Array[Int] = {
    +    def featureOffsetsCalc(total: Int, featureIndex: Int): Int = {
    +      if (isUnordered(featureIndex)) {
    +        total + 2 * numBins(featureIndex)
    +      } else {
    +        total + numBins(featureIndex)
    +      }
    +    }
    +    Range(0, numFeatures).scanLeft(0)(featureOffsetsCalc).map(statsSize * _).toArray
    +  }
    +
    +  /**
    +   * Number of elements for each node, corresponding to stride between nodes in [[allStats]].
    +   */
    +  private val nodeStride: Int = featureOffsets.last * statsSize
    +
    +  /**
    +   * Total number of elements stored in this aggregator.
    +   */
    +  val allStatsSize: Int = numNodes * nodeStride
    +
    +  /**
    +   * Flat array of elements.
    +   * Index for start of stats for a (node, feature, bin) is:
    +   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize
    +   * Note: For unordered features, the left child stats have binIndex in [0, numBins(featureIndex))
    +   *       and the right child stats in [numBins(featureIndex), 2 * numBins(featureIndex))
    +   */
    +  val allStats: Array[Double] = new Array[Double](allStatsSize)
    +
    +  /**
    +   * Get an [[ImpurityCalculator]] for a given (node, feature, bin).
    +   * @param nodeFeatureOffset  For ordered features, this is a pre-computed (node, feature) offset
    +   *                           from [[getNodeFeatureOffset]].
    +   *                           For unordered features, this is a pre-computed
    +   *                           (node, feature, left/right child) offset from
    +   *                           [[getLeftRightNodeFeatureOffsets]].
    +   */
    +  def getImpurityCalculator(nodeFeatureOffset: Int, binIndex: Int): ImpurityCalculator = {
    +    impurityAggregator.getCalculator(allStats, nodeFeatureOffset + binIndex * statsSize)
    +  }
    +
    +  /**
    +   * Update the stats for a given (node, feature, bin) for ordered features, using the given label.
    +   */
    +  def update(nodeIndex: Int, featureIndex: Int, binIndex: Int, label: Double): Unit = {
    +    val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize
    +    impurityAggregator.update(allStats, i, label)
    +  }
    +
    +  /**
    +   * Pre-compute node offset for use with [[nodeUpdate]].
    +   */
    +  def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride
    +
    +  /**
    +   * Faster version of [[update]].
    +   * Update the stats for a given (node, feature, bin) for ordered features, using the given label.
    +   * @param nodeOffset  Pre-computed node offset from [[getNodeOffset]].
    +   */
    +  def nodeUpdate(nodeOffset: Int, featureIndex: Int, binIndex: Int, label: Double): Unit = {
    +    val i = nodeOffset + featureOffsets(featureIndex) + binIndex * statsSize
    +    impurityAggregator.update(allStats, i, label)
    +  }
    +
    +  /**
    +   * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]].
    +   * For ordered features only.
    +   */
    +  def getNodeFeatureOffset(nodeIndex: Int, featureIndex: Int): Int = {
    +    require(!isUnordered(featureIndex),
    +      s"DTStatsAggregator.getNodeFeatureOffset is for ordered features only, but was called" +
    +      s" for unordered feature $featureIndex.")
    +    nodeIndex * nodeStride + featureOffsets(featureIndex)
    +  }
    +
    +  /**
    +   * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]].
    +   * For unordered features only.
    +   */
    +  def getLeftRightNodeFeatureOffsets(nodeIndex: Int, featureIndex: Int): (Int, Int) = {
    +    require(isUnordered(featureIndex),
    +      s"DTStatsAggregator.getLeftRightNodeFeatureOffsets is for unordered features only," +
    +        s" but was called for ordered feature $featureIndex.")
    +    val baseOffset = nodeIndex * nodeStride + featureOffsets(featureIndex)
    +    (baseOffset, baseOffset + numBins(featureIndex) * statsSize)
    +  }
    +
    +  /**
    +   * Faster version of [[update]].
    +   * Update the stats for a given (node, feature, bin), using the given label.
    +   * @param nodeFeatureOffset  For ordered features, this is a pre-computed (node, feature) offset
    +   *                           from [[getNodeFeatureOffset]].
    +   *                           For unordered features, this is a pre-computed
    +   *                           (node, feature, left/right child) offset from
    +   *                           [[getLeftRightNodeFeatureOffsets]].
    +   */
    +  def nodeFeatureUpdate(nodeFeatureOffset: Int, binIndex: Int, label: Double): Unit = {
    +    impurityAggregator.update(allStats, nodeFeatureOffset + binIndex * statsSize, label)
    +  }
    +
    +  /**
    +   * For a given (node, feature), merge the stats for two bins.
    +   * @param nodeFeatureOffset  For ordered features, this is a pre-computed (node, feature) offset
    +   *                           from [[getNodeFeatureOffset]].
    +   *                           For unordered features, this is a pre-computed
    +   *                           (node, feature, left/right child) offset from
    +   *                           [[getLeftRightNodeFeatureOffsets]].
    +   * @param binIndex  The other bin is merged into this bin.
    +   * @param otherBinIndex  This bin is not modified.
    +   */
    +  def mergeForNodeFeature(nodeFeatureOffset: Int, binIndex: Int, otherBinIndex: Int): Unit = {
    +    impurityAggregator.merge(allStats, nodeFeatureOffset + binIndex * statsSize,
    +      nodeFeatureOffset + otherBinIndex * statsSize)
    +  }
    +
    +  /**
    +   * Merge this aggregator with another, and returns this aggregator.
    +   * This method modifies this aggregator in-place.
    +   */
    +  def merge(other: DTStatsAggregator): DTStatsAggregator = {
    --- End diff --
    
    Good idea; I'll make a JIRA for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54701490
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19886/consoleFull) for   PR 2125 at commit [`42c192a`](https://github.com/apache/spark/commit/42c192ac23e6ca9ebcc5d7c3dcd6665de3c96447).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `    logDebug("isMulticlass = " + metadata.isMulticlass)`
      * `    logDebug("isMulticlass = " + metadata.isMulticlass)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54695059
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19859/consoleFull) for   PR 2125 at commit [`a2acea5`](https://github.com/apache/spark/commit/a2acea566449f07ce122f0904effe2eefe6c0f8a).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16866883
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -1395,96 +1027,24 @@ object DecisionTree extends Serializable with Logging {
                           Double.MinValue)
                       } else {
                         new Bin(
    -                      splits(featureIndex)(index - 1),
    -                      splits(featureIndex)(index),
    +                      splits(featureIndex)(splitIndex - 1),
    +                      splits(featureIndex)(splitIndex),
                           Categorical,
                           Double.MinValue)
                       }
                     }
    -                index += 1
    -              }
    -            } else { // ordered feature
    -              /* For a given categorical feature, use a subsample of the data
    -               * to choose how to arrange possible splits.
    -               * This examines each category and computes a centroid.
    -               * These centroids are later used to sort the possible splits.
    -               * centroidForCategories is a mapping: category (for the given feature) --> centroid
    -               */
    -              val centroidForCategories = {
    -                if (isMulticlass) {
    -                  // For categorical variables in multiclass classification,
    -                  // each bin is a category. The bins are sorted and they
    -                  // are ordered by calculating the impurity of their corresponding labels.
    -                  sampledInput.map(lp => (lp.features(featureIndex), lp.label))
    -                   .groupBy(_._1)
    -                   .mapValues(x => x.groupBy(_._2).mapValues(x => x.size.toDouble))
    -                   .map(x => (x._1, x._2.values.toArray))
    -                   .map(x => (x._1, metadata.impurity.calculate(x._2, x._2.sum)))
    -                } else { // regression or binary classification
    -                  // For categorical variables in regression and binary classification,
    -                  // each bin is a category. The bins are sorted and they
    -                  // are ordered by calculating the centroid of their corresponding labels.
    -                  sampledInput.map(lp => (lp.features(featureIndex), lp.label))
    -                    .groupBy(_._1)
    -                    .mapValues(x => x.map(_._2).sum / x.map(_._1).length)
    -                }
    -              }
    -
    -              logDebug("centroid for categories = " + centroidForCategories.mkString(","))
    -
    -              // Check for missing categorical variables and putting them last in the sorted list.
    -              val fullCentroidForCategories = scala.collection.mutable.Map[Double,Double]()
    -              for (i <- 0 until featureCategories) {
    -                if (centroidForCategories.contains(i)) {
    -                  fullCentroidForCategories(i) = centroidForCategories(i)
    -                } else {
    -                  fullCentroidForCategories(i) = Double.MaxValue
    -                }
    -              }
    -
    -              // bins sorted by centroids
    -              val categoriesSortedByCentroid = fullCentroidForCategories.toList.sortBy(_._2)
    -
    -              logDebug("centroid for categorical variable = " + categoriesSortedByCentroid)
    -
    -              var categoriesForSplit = List[Double]()
    -              categoriesSortedByCentroid.iterator.zipWithIndex.foreach {
    -                case ((key, value), index) =>
    -                  categoriesForSplit = key :: categoriesForSplit
    -                  splits(featureIndex)(index) = new Split(featureIndex, Double.MinValue,
    -                    Categorical, categoriesForSplit)
    -                  bins(featureIndex)(index) = {
    -                    if (index == 0) {
    -                      new Bin(new DummyCategoricalSplit(featureIndex, Categorical),
    -                        splits(featureIndex)(0), Categorical, key)
    -                    } else {
    -                      new Bin(splits(featureIndex)(index-1), splits(featureIndex)(index),
    -                        Categorical, key)
    -                    }
    -                  }
    +                splitIndex += 1
                   }
    +            } else {
    +              // Ordered features: high-arity features, or not multiclass classification
    --- End diff --
    
    Unclear what "or not multiclass classification" means here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54225506
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19606/consoleFull) for   PR 2125 at commit [`4651154`](https://github.com/apache/spark/commit/46511547be6d193d19356ac31a978a83fe27e0b5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54231541
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19606/consoleFull) for   PR 2125 at commit [`4651154`](https://github.com/apache/spark/commit/46511547be6d193d19356ac31a978a83fe27e0b5).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `    logDebug("isMulticlass = " + metadata.isMulticlass)`
      * `    logDebug("isMulticlass = " + metadata.isMulticlass)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54698166
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19859/consoleFull) for   PR 2125 at commit [`a2acea5`](https://github.com/apache/spark/commit/a2acea566449f07ce122f0904effe2eefe6c0f8a).
     * This patch **fails** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-53372120
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19197/consoleFull) for   PR 2125 at commit [`92f934f`](https://github.com/apache/spark/commit/92f934f789decc0b4d8eaf8614438e90a226a934).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16865586
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -70,32 +83,48 @@ private[tree] object DecisionTreeMetadata {
           case Regression => 0
         }
     
    -    val maxBins = math.min(strategy.maxBins, numExamples).toInt
    -    val log2MaxBinsp1 = math.log(maxBins + 1) / math.log(2.0)
    +    val maxPossibleBins = math.min(strategy.maxBins, numExamples).toInt
    +    val log2MaxPossibleBinsp1 = math.log(maxPossibleBins + 1) / math.log(2.0)
     
    +    // We check the number of bins here against maxPossibleBins.
    +    // This needs to be checked here instead of in Strategy since maxPossibleBins can be modified
    +    // based on the number of training examples.
         val unorderedFeatures = new mutable.HashSet[Int]()
    +    val numBins = Array.fill[Int](numFeatures)(maxPossibleBins)
         if (numClasses > 2) {
           strategy.categoricalFeaturesInfo.foreach { case (f, k) =>
    -        if (k - 1 < log2MaxBinsp1) {
    +        if (k - 1 < log2MaxPossibleBinsp1) {
               // Note: The above check is equivalent to checking:
               //       numUnorderedBins = (1 << k - 1) - 1 < maxBins
               unorderedFeatures.add(f)
    +          numBins(f) = numUnorderedBins(k)
             } else {
    -          // TODO: Allow this case, where we simply will know nothing about some categories?
    -          require(k < maxBins, s"maxBins (= $maxBins) should be greater than max categories " +
    +          require(k <= maxPossibleBins,
    +            s"maxBins (= $maxPossibleBins) should be greater than max categories " +
                 s"in categorical features (>= $k)")
    +          numBins(f) = k
             }
           }
         } else {
           strategy.categoricalFeaturesInfo.foreach { case (f, k) =>
    -        require(k < maxBins, s"maxBins (= $maxBins) should be greater than max categories " +
    -          s"in categorical features (>= $k)")
    +        require(k <= maxPossibleBins,
    +          s"DecisionTree requires maxBins (= $maxPossibleBins) >= max categories " +
    +          s"in categorical features (= ${strategy.categoricalFeaturesInfo.values.max})")
    --- End diff --
    
    could extract strategy.categoricalFeaturesInfo into a val for reuse.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-53974143
  
    Apart from the discussion around the correct place for centriod calculations and some minor code style comments, it looks good to me. If it's too much work to change in this PR, we can create a JIRA ticket for 1.3 to address this later.
    
    :+1: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16865085
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -70,32 +83,48 @@ private[tree] object DecisionTreeMetadata {
           case Regression => 0
         }
     
    -    val maxBins = math.min(strategy.maxBins, numExamples).toInt
    -    val log2MaxBinsp1 = math.log(maxBins + 1) / math.log(2.0)
    +    val maxPossibleBins = math.min(strategy.maxBins, numExamples).toInt
    +    val log2MaxPossibleBinsp1 = math.log(maxPossibleBins + 1) / math.log(2.0)
     
    +    // We check the number of bins here against maxPossibleBins.
    +    // This needs to be checked here instead of in Strategy since maxPossibleBins can be modified
    +    // based on the number of training examples.
         val unorderedFeatures = new mutable.HashSet[Int]()
    +    val numBins = Array.fill[Int](numFeatures)(maxPossibleBins)
         if (numClasses > 2) {
           strategy.categoricalFeaturesInfo.foreach { case (f, k) =>
    --- End diff --
    
    f and k could be renamed to feature and numCategories


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16932280
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala ---
    @@ -0,0 +1,208 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impl
    +
    +import org.apache.spark.mllib.tree.impurity._
    +
    +/**
    + * DecisionTree statistics aggregator.
    + * This holds a flat array of statistics for a set of (nodes, features, bins)
    + * and helps with indexing.
    + */
    +private[tree] class DTStatsAggregator(
    +    metadata: DecisionTreeMetadata,
    +    val numNodes: Int) extends Serializable {
    +
    +  /**
    +   * [[ImpurityAggregator]] instance specifying the impurity type.
    +   */
    +  val impurityAggregator: ImpurityAggregator = metadata.impurity match {
    +    case Gini => new GiniAggregator(metadata.numClasses)
    +    case Entropy => new EntropyAggregator(metadata.numClasses)
    +    case Variance => new VarianceAggregator()
    +    case _ => throw new IllegalArgumentException(s"Bad impurity parameter: ${metadata.impurity}")
    +  }
    +
    +  /**
    +   * Number of elements (Double values) used for the sufficient statistics of each bin.
    +   */
    +  val statsSize: Int = impurityAggregator.statsSize
    +
    +  val numFeatures: Int = metadata.numFeatures
    +
    +  /**
    +   * Number of bins for each feature.  This is indexed by the feature index.
    +   */
    +  val numBins: Array[Int] = metadata.numBins
    +
    +  /**
    +   * Indicator for each feature of whether that feature is an unordered feature.
    +   * TODO: Is Array[Boolean] any faster?
    +   */
    +  def isUnordered(featureIndex: Int): Boolean = metadata.isUnordered(featureIndex)
    +
    +  /**
    +   * Offset for each feature for calculating indices into the [[allStats]] array.
    +   */
    +  private val featureOffsets: Array[Int] = {
    +    def featureOffsetsCalc(total: Int, featureIndex: Int): Int = {
    +      if (isUnordered(featureIndex)) {
    +        total + 2 * numBins(featureIndex)
    --- End diff --
    
    Would the factor of 2 for unordered categorical feature be more suitable in the numBins calculation in the DecisionTreeMetaData class?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54590611
  
    retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16931670
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala ---
    @@ -0,0 +1,208 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impl
    +
    +import org.apache.spark.mllib.tree.impurity._
    +
    +/**
    + * DecisionTree statistics aggregator.
    + * This holds a flat array of statistics for a set of (nodes, features, bins)
    + * and helps with indexing.
    + */
    +private[tree] class DTStatsAggregator(
    +    metadata: DecisionTreeMetadata,
    +    val numNodes: Int) extends Serializable {
    +
    +  /**
    +   * [[ImpurityAggregator]] instance specifying the impurity type.
    +   */
    +  val impurityAggregator: ImpurityAggregator = metadata.impurity match {
    +    case Gini => new GiniAggregator(metadata.numClasses)
    +    case Entropy => new EntropyAggregator(metadata.numClasses)
    +    case Variance => new VarianceAggregator()
    +    case _ => throw new IllegalArgumentException(s"Bad impurity parameter: ${metadata.impurity}")
    +  }
    +
    +  /**
    +   * Number of elements (Double values) used for the sufficient statistics of each bin.
    +   */
    +  val statsSize: Int = impurityAggregator.statsSize
    +
    +  val numFeatures: Int = metadata.numFeatures
    +
    +  /**
    +   * Number of bins for each feature.  This is indexed by the feature index.
    +   */
    +  val numBins: Array[Int] = metadata.numBins
    +
    +  /**
    +   * Indicator for each feature of whether that feature is an unordered feature.
    +   * TODO: Is Array[Boolean] any faster?
    +   */
    +  def isUnordered(featureIndex: Int): Boolean = metadata.isUnordered(featureIndex)
    +
    +  /**
    +   * Offset for each feature for calculating indices into the [[allStats]] array.
    +   */
    +  private val featureOffsets: Array[Int] = {
    +    def featureOffsetsCalc(total: Int, featureIndex: Int): Int = {
    +      if (isUnordered(featureIndex)) {
    +        total + 2 * numBins(featureIndex)
    +      } else {
    +        total + numBins(featureIndex)
    +      }
    +    }
    +    Range(0, numFeatures).scanLeft(0)(featureOffsetsCalc).map(statsSize * _).toArray
    +  }
    +
    +  /**
    +   * Number of elements for each node, corresponding to stride between nodes in [[allStats]].
    +   */
    +  private val nodeStride: Int = featureOffsets.last * statsSize
    +
    +  /**
    +   * Total number of elements stored in this aggregator.
    +   */
    +  val allStatsSize: Int = numNodes * nodeStride
    +
    +  /**
    +   * Flat array of elements.
    +   * Index for start of stats for a (node, feature, bin) is:
    +   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize
    +   * Note: For unordered features, the left child stats have binIndex in [0, numBins(featureIndex))
    +   *       and the right child stats in [numBins(featureIndex), 2 * numBins(featureIndex))
    +   */
    +  val allStats: Array[Double] = new Array[Double](allStatsSize)
    +
    +  /**
    +   * Get an [[ImpurityCalculator]] for a given (node, feature, bin).
    +   * @param nodeFeatureOffset  For ordered features, this is a pre-computed (node, feature) offset
    +   *                           from [[getNodeFeatureOffset]].
    +   *                           For unordered features, this is a pre-computed
    +   *                           (node, feature, left/right child) offset from
    +   *                           [[getLeftRightNodeFeatureOffsets]].
    +   */
    +  def getImpurityCalculator(nodeFeatureOffset: Int, binIndex: Int): ImpurityCalculator = {
    +    impurityAggregator.getCalculator(allStats, nodeFeatureOffset + binIndex * statsSize)
    +  }
    +
    +  /**
    +   * Update the stats for a given (node, feature, bin) for ordered features, using the given label.
    +   */
    +  def update(nodeIndex: Int, featureIndex: Int, binIndex: Int, label: Double): Unit = {
    +    val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize
    +    impurityAggregator.update(allStats, i, label)
    +  }
    +
    +  /**
    +   * Pre-compute node offset for use with [[nodeUpdate]].
    +   */
    +  def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride
    +
    +  /**
    +   * Faster version of [[update]].
    +   * Update the stats for a given (node, feature, bin) for ordered features, using the given label.
    +   * @param nodeOffset  Pre-computed node offset from [[getNodeOffset]].
    +   */
    +  def nodeUpdate(nodeOffset: Int, featureIndex: Int, binIndex: Int, label: Double): Unit = {
    +    val i = nodeOffset + featureOffsets(featureIndex) + binIndex * statsSize
    +    impurityAggregator.update(allStats, i, label)
    +  }
    +
    +  /**
    +   * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]].
    +   * For ordered features only.
    +   */
    +  def getNodeFeatureOffset(nodeIndex: Int, featureIndex: Int): Int = {
    +    require(!isUnordered(featureIndex),
    --- End diff --
    
    Should we enable these requires only in debug mode?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16870450
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala ---
    @@ -0,0 +1,208 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impl
    +
    +import org.apache.spark.mllib.tree.impurity._
    +
    +/**
    + * DecisionTree statistics aggregator.
    + * This holds a flat array of statistics for a set of (nodes, features, bins)
    + * and helps with indexing.
    + */
    +private[tree] class DTStatsAggregator(
    +    metadata: DecisionTreeMetadata,
    +    val numNodes: Int) extends Serializable {
    +
    +  /**
    +   * [[ImpurityAggregator]] instance specifying the impurity type.
    +   */
    +  val impurityAggregator: ImpurityAggregator = metadata.impurity match {
    +    case Gini => new GiniAggregator(metadata.numClasses)
    +    case Entropy => new EntropyAggregator(metadata.numClasses)
    +    case Variance => new VarianceAggregator()
    +    case _ => throw new IllegalArgumentException(s"Bad impurity parameter: ${metadata.impurity}")
    +  }
    +
    +  /**
    +   * Number of elements (Double values) used for the sufficient statistics of each bin.
    +   */
    +  val statsSize: Int = impurityAggregator.statsSize
    +
    +  val numFeatures: Int = metadata.numFeatures
    +
    +  /**
    +   * Number of bins for each feature.  This is indexed by the feature index.
    +   */
    +  val numBins: Array[Int] = metadata.numBins
    +
    +  /**
    +   * Indicator for each feature of whether that feature is an unordered feature.
    +   * TODO: Is Array[Boolean] any faster?
    +   */
    +  def isUnordered(featureIndex: Int): Boolean = metadata.isUnordered(featureIndex)
    +
    +  /**
    +   * Offset for each feature for calculating indices into the [[allStats]] array.
    +   */
    +  private val featureOffsets: Array[Int] = {
    +    def featureOffsetsCalc(total: Int, featureIndex: Int): Int = {
    +      if (isUnordered(featureIndex)) {
    +        total + 2 * numBins(featureIndex)
    +      } else {
    +        total + numBins(featureIndex)
    +      }
    +    }
    +    Range(0, numFeatures).scanLeft(0)(featureOffsetsCalc).map(statsSize * _).toArray
    +  }
    +
    +  /**
    +   * Number of elements for each node, corresponding to stride between nodes in [[allStats]].
    +   */
    +  private val nodeStride: Int = featureOffsets.last * statsSize
    +
    +  /**
    +   * Total number of elements stored in this aggregator.
    +   */
    +  val allStatsSize: Int = numNodes * nodeStride
    +
    +  /**
    +   * Flat array of elements.
    +   * Index for start of stats for a (node, feature, bin) is:
    +   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize
    +   * Note: For unordered features, the left child stats have binIndex in [0, numBins(featureIndex))
    +   *       and the right child stats in [numBins(featureIndex), 2 * numBins(featureIndex))
    +   */
    +  val allStats: Array[Double] = new Array[Double](allStatsSize)
    +
    +  /**
    +   * Get an [[ImpurityCalculator]] for a given (node, feature, bin).
    +   * @param nodeFeatureOffset  For ordered features, this is a pre-computed (node, feature) offset
    +   *                           from [[getNodeFeatureOffset]].
    +   *                           For unordered features, this is a pre-computed
    +   *                           (node, feature, left/right child) offset from
    +   *                           [[getLeftRightNodeFeatureOffsets]].
    +   */
    +  def getImpurityCalculator(nodeFeatureOffset: Int, binIndex: Int): ImpurityCalculator = {
    +    impurityAggregator.getCalculator(allStats, nodeFeatureOffset + binIndex * statsSize)
    +  }
    +
    +  /**
    +   * Update the stats for a given (node, feature, bin) for ordered features, using the given label.
    +   */
    +  def update(nodeIndex: Int, featureIndex: Int, binIndex: Int, label: Double): Unit = {
    +    val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize
    +    impurityAggregator.update(allStats, i, label)
    +  }
    +
    +  /**
    +   * Pre-compute node offset for use with [[nodeUpdate]].
    +   */
    +  def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride
    +
    +  /**
    +   * Faster version of [[update]].
    +   * Update the stats for a given (node, feature, bin) for ordered features, using the given label.
    +   * @param nodeOffset  Pre-computed node offset from [[getNodeOffset]].
    +   */
    +  def nodeUpdate(nodeOffset: Int, featureIndex: Int, binIndex: Int, label: Double): Unit = {
    +    val i = nodeOffset + featureOffsets(featureIndex) + binIndex * statsSize
    +    impurityAggregator.update(allStats, i, label)
    +  }
    +
    +  /**
    +   * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]].
    +   * For ordered features only.
    +   */
    +  def getNodeFeatureOffset(nodeIndex: Int, featureIndex: Int): Int = {
    +    require(!isUnordered(featureIndex),
    +      s"DTStatsAggregator.getNodeFeatureOffset is for ordered features only, but was called" +
    +      s" for unordered feature $featureIndex.")
    +    nodeIndex * nodeStride + featureOffsets(featureIndex)
    +  }
    +
    +  /**
    +   * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]].
    +   * For unordered features only.
    +   */
    +  def getLeftRightNodeFeatureOffsets(nodeIndex: Int, featureIndex: Int): (Int, Int) = {
    +    require(isUnordered(featureIndex),
    +      s"DTStatsAggregator.getLeftRightNodeFeatureOffsets is for unordered features only," +
    +        s" but was called for ordered feature $featureIndex.")
    +    val baseOffset = nodeIndex * nodeStride + featureOffsets(featureIndex)
    +    (baseOffset, baseOffset + numBins(featureIndex) * statsSize)
    +  }
    +
    +  /**
    +   * Faster version of [[update]].
    +   * Update the stats for a given (node, feature, bin), using the given label.
    +   * @param nodeFeatureOffset  For ordered features, this is a pre-computed (node, feature) offset
    +   *                           from [[getNodeFeatureOffset]].
    +   *                           For unordered features, this is a pre-computed
    +   *                           (node, feature, left/right child) offset from
    +   *                           [[getLeftRightNodeFeatureOffsets]].
    +   */
    +  def nodeFeatureUpdate(nodeFeatureOffset: Int, binIndex: Int, label: Double): Unit = {
    +    impurityAggregator.update(allStats, nodeFeatureOffset + binIndex * statsSize, label)
    +  }
    +
    +  /**
    +   * For a given (node, feature), merge the stats for two bins.
    +   * @param nodeFeatureOffset  For ordered features, this is a pre-computed (node, feature) offset
    +   *                           from [[getNodeFeatureOffset]].
    +   *                           For unordered features, this is a pre-computed
    +   *                           (node, feature, left/right child) offset from
    +   *                           [[getLeftRightNodeFeatureOffsets]].
    +   * @param binIndex  The other bin is merged into this bin.
    +   * @param otherBinIndex  This bin is not modified.
    +   */
    +  def mergeForNodeFeature(nodeFeatureOffset: Int, binIndex: Int, otherBinIndex: Int): Unit = {
    +    impurityAggregator.merge(allStats, nodeFeatureOffset + binIndex * statsSize,
    +      nodeFeatureOffset + otherBinIndex * statsSize)
    +  }
    +
    +  /**
    +   * Merge this aggregator with another, and returns this aggregator.
    +   * This method modifies this aggregator in-place.
    +   */
    +  def merge(other: DTStatsAggregator): DTStatsAggregator = {
    --- End diff --
    
    We should also possibly create a JIRA for handling precision loss while using Double for large aggregates during variance calculation. This was an observation from @mengxr during the first DT PR. I tried to incorporate it then but it was hard with that code structure. This abstraction might be a good place to incorporate @mengxr 's suggestion. Again, not urgent but good for future improvement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16864832
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -70,32 +83,48 @@ private[tree] object DecisionTreeMetadata {
           case Regression => 0
         }
     
    -    val maxBins = math.min(strategy.maxBins, numExamples).toInt
    -    val log2MaxBinsp1 = math.log(maxBins + 1) / math.log(2.0)
    +    val maxPossibleBins = math.min(strategy.maxBins, numExamples).toInt
    +    val log2MaxPossibleBinsp1 = math.log(maxPossibleBins + 1) / math.log(2.0)
    --- End diff --
    
    Plus1 instead of p1 might be better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16867125
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -65,36 +66,39 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo
         val retaggedInput = input.retag(classOf[LabeledPoint])
         val metadata = DecisionTreeMetadata.buildMetadata(retaggedInput, strategy)
         logDebug("algo = " + strategy.algo)
    +    logDebug("maxBins = " + metadata.maxBins)
     
         // Find the splits and the corresponding bins (interval between the splits) using a sample
         // of the input data.
         timer.start("findSplitsBins")
         val (splits, bins) = DecisionTree.findSplitsBins(retaggedInput, metadata)
    -    val numBins = bins(0).length
         timer.stop("findSplitsBins")
    -    logDebug("numBins = " + numBins)
    +    logDebug("numBins: feature: number of bins")
    +    Range(0, metadata.numFeatures).foreach { featureIndex =>
    +      logDebug(s"\t$featureIndex\t${metadata.numBins(featureIndex)}")
    +    }
     
         // Bin feature values (TreePoint representation).
         // Cache input RDD for speedup during multiple passes.
         val treeInput = TreePoint.convertToTreeRDD(retaggedInput, bins, metadata)
           .persist(StorageLevel.MEMORY_AND_DISK)
     
    -    val numFeatures = metadata.numFeatures
         // depth of the decision tree
         val maxDepth = strategy.maxDepth
    -    // the max number of nodes possible given the depth of the tree
    -    val maxNumNodes = (2 << maxDepth) - 1
    +    // the max number of nodes possible given the depth of the tree, plus 1
    +    val maxNumNodes_p1 = Node.maxNodesInLevel(maxDepth + 1)
    --- End diff --
    
    Another suggestion ```maxNumNodesPlus1```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2125


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16865423
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -70,32 +83,48 @@ private[tree] object DecisionTreeMetadata {
           case Regression => 0
         }
     
    -    val maxBins = math.min(strategy.maxBins, numExamples).toInt
    -    val log2MaxBinsp1 = math.log(maxBins + 1) / math.log(2.0)
    +    val maxPossibleBins = math.min(strategy.maxBins, numExamples).toInt
    +    val log2MaxPossibleBinsp1 = math.log(maxPossibleBins + 1) / math.log(2.0)
     
    +    // We check the number of bins here against maxPossibleBins.
    +    // This needs to be checked here instead of in Strategy since maxPossibleBins can be modified
    +    // based on the number of training examples.
         val unorderedFeatures = new mutable.HashSet[Int]()
    +    val numBins = Array.fill[Int](numFeatures)(maxPossibleBins)
         if (numClasses > 2) {
           strategy.categoricalFeaturesInfo.foreach { case (f, k) =>
    -        if (k - 1 < log2MaxBinsp1) {
    +        if (k - 1 < log2MaxPossibleBinsp1) {
    --- End diff --
    
    Might be worthwhile highlighting why are we doing this check. Basically, mentioning whether a categorical feature can be ordered or not. It will help with code maintenance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16936329
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -26,14 +26,15 @@ import org.apache.spark.mllib.tree.configuration.Strategy
     import org.apache.spark.mllib.tree.impurity.Impurity
     import org.apache.spark.rdd.RDD
     
    -
     /**
      * Learning and dataset metadata for DecisionTree.
      *
      * @param numClasses    For classification: labels can take values {0, ..., numClasses - 1}.
      *                      For regression: fixed at 0 (no meaning).
    + * @param maxBins  Maximum number of bins, for all features.
      * @param featureArity  Map: categorical feature index --> arity.
      *                      I.e., the feature takes values in {0, ..., arity - 1}.
    --- End diff --
    
    Javadoc does not match the method arguments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16931674
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala ---
    @@ -0,0 +1,208 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impl
    +
    +import org.apache.spark.mllib.tree.impurity._
    +
    +/**
    + * DecisionTree statistics aggregator.
    + * This holds a flat array of statistics for a set of (nodes, features, bins)
    + * and helps with indexing.
    + */
    +private[tree] class DTStatsAggregator(
    +    metadata: DecisionTreeMetadata,
    +    val numNodes: Int) extends Serializable {
    +
    +  /**
    +   * [[ImpurityAggregator]] instance specifying the impurity type.
    +   */
    +  val impurityAggregator: ImpurityAggregator = metadata.impurity match {
    +    case Gini => new GiniAggregator(metadata.numClasses)
    +    case Entropy => new EntropyAggregator(metadata.numClasses)
    +    case Variance => new VarianceAggregator()
    +    case _ => throw new IllegalArgumentException(s"Bad impurity parameter: ${metadata.impurity}")
    +  }
    +
    +  /**
    +   * Number of elements (Double values) used for the sufficient statistics of each bin.
    +   */
    +  val statsSize: Int = impurityAggregator.statsSize
    +
    +  val numFeatures: Int = metadata.numFeatures
    +
    +  /**
    +   * Number of bins for each feature.  This is indexed by the feature index.
    +   */
    +  val numBins: Array[Int] = metadata.numBins
    +
    +  /**
    +   * Indicator for each feature of whether that feature is an unordered feature.
    +   * TODO: Is Array[Boolean] any faster?
    +   */
    +  def isUnordered(featureIndex: Int): Boolean = metadata.isUnordered(featureIndex)
    +
    +  /**
    +   * Offset for each feature for calculating indices into the [[allStats]] array.
    +   */
    +  private val featureOffsets: Array[Int] = {
    +    def featureOffsetsCalc(total: Int, featureIndex: Int): Int = {
    +      if (isUnordered(featureIndex)) {
    +        total + 2 * numBins(featureIndex)
    +      } else {
    +        total + numBins(featureIndex)
    +      }
    +    }
    +    Range(0, numFeatures).scanLeft(0)(featureOffsetsCalc).map(statsSize * _).toArray
    +  }
    +
    +  /**
    +   * Number of elements for each node, corresponding to stride between nodes in [[allStats]].
    +   */
    +  private val nodeStride: Int = featureOffsets.last * statsSize
    +
    +  /**
    +   * Total number of elements stored in this aggregator.
    +   */
    +  val allStatsSize: Int = numNodes * nodeStride
    +
    +  /**
    +   * Flat array of elements.
    +   * Index for start of stats for a (node, feature, bin) is:
    +   *   index = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize
    +   * Note: For unordered features, the left child stats have binIndex in [0, numBins(featureIndex))
    +   *       and the right child stats in [numBins(featureIndex), 2 * numBins(featureIndex))
    +   */
    +  val allStats: Array[Double] = new Array[Double](allStatsSize)
    +
    +  /**
    +   * Get an [[ImpurityCalculator]] for a given (node, feature, bin).
    +   * @param nodeFeatureOffset  For ordered features, this is a pre-computed (node, feature) offset
    +   *                           from [[getNodeFeatureOffset]].
    +   *                           For unordered features, this is a pre-computed
    +   *                           (node, feature, left/right child) offset from
    +   *                           [[getLeftRightNodeFeatureOffsets]].
    +   */
    +  def getImpurityCalculator(nodeFeatureOffset: Int, binIndex: Int): ImpurityCalculator = {
    +    impurityAggregator.getCalculator(allStats, nodeFeatureOffset + binIndex * statsSize)
    +  }
    +
    +  /**
    +   * Update the stats for a given (node, feature, bin) for ordered features, using the given label.
    +   */
    +  def update(nodeIndex: Int, featureIndex: Int, binIndex: Int, label: Double): Unit = {
    +    val i = nodeIndex * nodeStride + featureOffsets(featureIndex) + binIndex * statsSize
    +    impurityAggregator.update(allStats, i, label)
    +  }
    +
    +  /**
    +   * Pre-compute node offset for use with [[nodeUpdate]].
    +   */
    +  def getNodeOffset(nodeIndex: Int): Int = nodeIndex * nodeStride
    +
    +  /**
    +   * Faster version of [[update]].
    +   * Update the stats for a given (node, feature, bin) for ordered features, using the given label.
    +   * @param nodeOffset  Pre-computed node offset from [[getNodeOffset]].
    +   */
    +  def nodeUpdate(nodeOffset: Int, featureIndex: Int, binIndex: Int, label: Double): Unit = {
    +    val i = nodeOffset + featureOffsets(featureIndex) + binIndex * statsSize
    +    impurityAggregator.update(allStats, i, label)
    +  }
    +
    +  /**
    +   * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]].
    +   * For ordered features only.
    +   */
    +  def getNodeFeatureOffset(nodeIndex: Int, featureIndex: Int): Int = {
    +    require(!isUnordered(featureIndex),
    +      s"DTStatsAggregator.getNodeFeatureOffset is for ordered features only, but was called" +
    +      s" for unordered feature $featureIndex.")
    +    nodeIndex * nodeStride + featureOffsets(featureIndex)
    +  }
    +
    +  /**
    +   * Pre-compute (node, feature) offset for use with [[nodeFeatureUpdate]].
    +   * For unordered features only.
    +   */
    +  def getLeftRightNodeFeatureOffsets(nodeIndex: Int, featureIndex: Int): (Int, Int) = {
    +    require(isUnordered(featureIndex),
    --- End diff --
    
    Should we enable these requires only in debug mode?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16864683
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala ---
    @@ -65,36 +66,39 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo
         val retaggedInput = input.retag(classOf[LabeledPoint])
         val metadata = DecisionTreeMetadata.buildMetadata(retaggedInput, strategy)
         logDebug("algo = " + strategy.algo)
    +    logDebug("maxBins = " + metadata.maxBins)
     
         // Find the splits and the corresponding bins (interval between the splits) using a sample
         // of the input data.
         timer.start("findSplitsBins")
         val (splits, bins) = DecisionTree.findSplitsBins(retaggedInput, metadata)
    -    val numBins = bins(0).length
         timer.stop("findSplitsBins")
    -    logDebug("numBins = " + numBins)
    +    logDebug("numBins: feature: number of bins")
    +    Range(0, metadata.numFeatures).foreach { featureIndex =>
    +      logDebug(s"\t$featureIndex\t${metadata.numBins(featureIndex)}")
    --- End diff --
    
    The foreach will be executed even when debug mode is off.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r17032306
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala ---
    @@ -0,0 +1,213 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impl
    +
    +import org.apache.spark.mllib.tree.impurity._
    +
    +/**
    + * DecisionTree statistics aggregator.
    + * This holds a flat array of statistics for a set of (nodes, features, bins)
    + * and helps with indexing.
    + */
    +private[tree] class DTStatsAggregator(
    +    val metadata: DecisionTreeMetadata,
    +    val numNodes: Int) extends Serializable {
    +
    +  /**
    +   * [[ImpurityAggregator]] instance specifying the impurity type.
    +   */
    +  val impurityAggregator: ImpurityAggregator = metadata.impurity match {
    +    case Gini => new GiniAggregator(metadata.numClasses)
    +    case Entropy => new EntropyAggregator(metadata.numClasses)
    +    case Variance => new VarianceAggregator()
    +    case _ => throw new IllegalArgumentException(s"Bad impurity parameter: ${metadata.impurity}")
    +  }
    +
    +  /**
    +   * Number of elements (Double values) used for the sufficient statistics of each bin.
    +   */
    +  val statsSize: Int = impurityAggregator.statsSize
    +
    +  val numFeatures: Int = metadata.numFeatures
    +
    +  /**
    +   * Number of bins for each feature.  This is indexed by the feature index.
    +   */
    +  val numBins: Array[Int] = metadata.numBins
    +
    +  /**
    +   * Number of splits for the given feature.
    +   */
    +  def numSplits(featureIndex: Int): Int = metadata.numSplits(featureIndex)
    +
    +  /**
    +   * Indicator for each feature of whether that feature is an unordered feature.
    +   * TODO: Is Array[Boolean] any faster?
    +   */
    +  def isUnordered(featureIndex: Int): Boolean = metadata.isUnordered(featureIndex)
    +
    +  /**
    +   * Offset for each feature for calculating indices into the [[allStats]] array.
    +   */
    +  private val featureOffsets: Array[Int] = {
    +    def featureOffsetsCalc(total: Int, featureIndex: Int): Int = {
    +      if (isUnordered(featureIndex)) {
    +        total + 2 * numBins(featureIndex)
    +      } else {
    +        total + numBins(featureIndex)
    +      }
    +    }
    +    Range(0, numFeatures).scanLeft(0)(featureOffsetsCalc).map(statsSize * _).toArray
    +  }
    +
    +  /**
    +   * Number of elements for each node, corresponding to stride between nodes in [[allStats]].
    +   */
    +  private val nodeStride: Int = featureOffsets.last * statsSize
    --- End diff --
    
    Yikes; good catch!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16696040
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/Node.scala ---
    @@ -148,3 +145,50 @@ class Node (
       }
     
     }
    +
    +private[tree] object Node {
    +
    +  /**
    +   * Return the index of the left child of this node.
    +   */
    +  def leftChildIndex(nodeIndex: Int): Int = nodeIndex * 2
    +
    +  /**
    +   * Return the index of the right child of this node.
    +   */
    +  def rightChildIndex(nodeIndex: Int): Int = nodeIndex * 2 + 1
    +
    +  /**
    +   * Get the parent index of the given node, or 0 if it is the root.
    +   */
    +  def parentIndex(nodeIndex: Int): Int = nodeIndex >> 1
    +
    +  /**
    +   * Return the level of a tree which the given node is in.
    +   */
    +  def indexToLevel(nodeIndex: Int): Int = if (nodeIndex == 0) {
    +    throw new IllegalArgumentException(s"0 is not a valid node index.")
    +  } else {
    +    java.lang.Integer.numberOfTrailingZeros(java.lang.Integer.highestOneBit(nodeIndex))
    +  }
    +
    +  /**
    +   * Returns true if this is a left child.
    +   * Note: Returns false for the root.
    +   */
    +  def isLeftChild(nodeIndex: Int): Boolean = nodeIndex > 1 && nodeIndex % 2 == 0
    +
    +  /**
    +   * Return the maximum number of nodes which can be in the given level of the tree.
    +   * @param level  Level of tree (0 = root).
    +   */
    +  private[tree] def maxNodesInLevel(level: Int): Int = 1 << level
    +
    +  /**
    +   * Return the maximum number of nodes which can be in or above the given level of the tree
    +   * (i.e., for the entire subtree from the root to this level).
    +   * @param level  Level of tree (0 = root).
    +   */
    +  private[tree] def maxNodesInSubtree(level: Int): Int = (1 << level + 1) - 1
    --- End diff --
    
    This is mostly used in determining the starting node index of a level. Shall we add a new method called `startIndexInLevel(level: Int): Int = 1 << level`? Also the name `subtree` is usually used to refer the entire descendants of a node.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16865673
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -70,32 +83,48 @@ private[tree] object DecisionTreeMetadata {
           case Regression => 0
         }
     
    -    val maxBins = math.min(strategy.maxBins, numExamples).toInt
    -    val log2MaxBinsp1 = math.log(maxBins + 1) / math.log(2.0)
    +    val maxPossibleBins = math.min(strategy.maxBins, numExamples).toInt
    +    val log2MaxPossibleBinsp1 = math.log(maxPossibleBins + 1) / math.log(2.0)
     
    +    // We check the number of bins here against maxPossibleBins.
    +    // This needs to be checked here instead of in Strategy since maxPossibleBins can be modified
    +    // based on the number of training examples.
         val unorderedFeatures = new mutable.HashSet[Int]()
    +    val numBins = Array.fill[Int](numFeatures)(maxPossibleBins)
         if (numClasses > 2) {
    --- End diff --
    
    val isMulticlassClassification = numClasses > 2 might be more readable


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r17031187
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala ---
    @@ -0,0 +1,213 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.tree.impl
    +
    +import org.apache.spark.mllib.tree.impurity._
    +
    +/**
    + * DecisionTree statistics aggregator.
    + * This holds a flat array of statistics for a set of (nodes, features, bins)
    + * and helps with indexing.
    + */
    +private[tree] class DTStatsAggregator(
    +    val metadata: DecisionTreeMetadata,
    +    val numNodes: Int) extends Serializable {
    +
    +  /**
    +   * [[ImpurityAggregator]] instance specifying the impurity type.
    +   */
    +  val impurityAggregator: ImpurityAggregator = metadata.impurity match {
    +    case Gini => new GiniAggregator(metadata.numClasses)
    +    case Entropy => new EntropyAggregator(metadata.numClasses)
    +    case Variance => new VarianceAggregator()
    +    case _ => throw new IllegalArgumentException(s"Bad impurity parameter: ${metadata.impurity}")
    +  }
    +
    +  /**
    +   * Number of elements (Double values) used for the sufficient statistics of each bin.
    +   */
    +  val statsSize: Int = impurityAggregator.statsSize
    +
    +  val numFeatures: Int = metadata.numFeatures
    +
    +  /**
    +   * Number of bins for each feature.  This is indexed by the feature index.
    +   */
    +  val numBins: Array[Int] = metadata.numBins
    +
    +  /**
    +   * Number of splits for the given feature.
    +   */
    +  def numSplits(featureIndex: Int): Int = metadata.numSplits(featureIndex)
    +
    +  /**
    +   * Indicator for each feature of whether that feature is an unordered feature.
    +   * TODO: Is Array[Boolean] any faster?
    +   */
    +  def isUnordered(featureIndex: Int): Boolean = metadata.isUnordered(featureIndex)
    +
    +  /**
    +   * Offset for each feature for calculating indices into the [[allStats]] array.
    +   */
    +  private val featureOffsets: Array[Int] = {
    +    def featureOffsetsCalc(total: Int, featureIndex: Int): Int = {
    +      if (isUnordered(featureIndex)) {
    +        total + 2 * numBins(featureIndex)
    +      } else {
    +        total + numBins(featureIndex)
    +      }
    +    }
    +    Range(0, numFeatures).scanLeft(0)(featureOffsetsCalc).map(statsSize * _).toArray
    +  }
    +
    +  /**
    +   * Number of elements for each node, corresponding to stride between nodes in [[allStats]].
    +   */
    +  private val nodeStride: Int = featureOffsets.last * statsSize
    --- End diff --
    
    Remove `* statsSize`, which is already included in `featureOffsets`. I ran into memory problem otherwise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by manishamde <gi...@git.apache.org>.

Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54222981
  
    Thanks for the clarification. Improvement in accuracy without loss in performance is always good. :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-54677299
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2125#issuecomment-54255291

@mengxr I agree memory limits are a problem.

1. I am OK with changing maxBins to 32. For maxMemoryInMB, does 256 seem reasonable when the default executor memory is 512?

2. For choosing the best splits, I agree we could distribute it, but I'm not sure about the gains. The memory requirements might be hard to reduce given the current setup, for each executor needs to update bins for all nodes during its one pass over its data points. If we maintained a mapping from nodes to relevant examples, then I could imagine spawning one job per node; that would reduce the memory requirement but might mean lots more passes over the data. As far as the driver's load, fairly little time is spent outside of aggregation, so I don't think it's a big issue. Am I misunderstanding though?

I could imagine 2 main ways to reduce memory usage without doing more passes over the data:
(a) Simple way: We could use different types to compress data, as others have done.
(b) Fancy way: We could use many fewer maxBins by default but re-bin features at each node to ameliorate the effects of a small number of bins. E.g., the top node might split a continuous feature into bins [0, 20], [20, 40], ... and choose to split at 20; in the next level, the left node might use bins [0, 5], [5, 10], ... [15, 20], and the right node might use bins [20, 30], [30, 40], ....

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r16696027
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/Node.scala ---
    @@ -148,3 +145,50 @@ class Node (
       }
     
     }
    +
    +private[tree] object Node {
    +
    +  /**
    +   * Return the index of the left child of this node.
    +   */
    +  def leftChildIndex(nodeIndex: Int): Int = nodeIndex * 2
    --- End diff --
    
    `nodeIndex << 1`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2125#issuecomment-53374813
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19197/consoleFull) for   PR 2125 at commit [`92f934f`](https://github.com/apache/spark/commit/92f934f789decc0b4d8eaf8614438e90a226a934).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `    $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$`
      * `    $FWDIR/bin/spark-submit --class org.apache.spark.repl.Main "$`
      * `class KMeansModel (val clusterCenters: Array[Vector]) extends Serializable `
      * `    logDebug("isMulticlass = " + metadata.isMulticlass)`
      * `    logDebug("isMulticlass = " + metadata.isMulticlass)`
      * `class BoundedFloat(float):`
      * `class JoinedRow2 extends Row `
      * `class JoinedRow3 extends Row `
      * `class JoinedRow4 extends Row `
      * `class JoinedRow5 extends Row `
      * `class GenericRow(protected[sql] val values: Array[Any]) extends Row `
      * `abstract class MutableValue extends Serializable `
      * `final class MutableInt extends MutableValue `
      * `final class MutableFloat extends MutableValue `
      * `final class MutableBoolean extends MutableValue `
      * `final class MutableDouble extends MutableValue `
      * `final class MutableShort extends MutableValue `
      * `final class MutableLong extends MutableValue `
      * `final class MutableByte extends MutableValue `
      * `final class MutableAny extends MutableValue `
      * `final class SpecificMutableRow(val values: Array[MutableValue]) extends MutableRow `
      * `case class CountDistinct(expressions: Seq[Expression]) extends PartialAggregate `
      * `case class CollectHashSet(expressions: Seq[Expression]) extends AggregateExpression `
      * `case class CollectHashSetFunction(`
      * `case class CombineSetsAndCount(inputSet: Expression) extends AggregateExpression `
      * `case class CombineSetsAndCountFunction(`
      * `case class CountDistinctFunction(`
      * `case class MaxOf(left: Expression, right: Expression) extends Expression `
      * `case class NewSet(elementType: DataType) extends LeafExpression `
      * `case class AddItemToSet(item: Expression, set: Expression) extends Expression `
      * `case class CombineSets(left: Expression, right: Expression) extends BinaryExpression `
      * `case class CountSet(child: Expression) extends UnaryExpression `
      * `case class ExplainCommand(plan: LogicalPlan, extended: Boolean = false) extends Command `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3086] [SPARK-3043] [SPARK-3156] [mllib]...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2125#discussion_r17017226
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala ---
    @@ -26,14 +26,15 @@ import org.apache.spark.mllib.tree.configuration.Strategy
     import org.apache.spark.mllib.tree.impurity.Impurity
     import org.apache.spark.rdd.RDD
     
    -
     /**
      * Learning and dataset metadata for DecisionTree.
      *
      * @param numClasses    For classification: labels can take values {0, ..., numClasses - 1}.
      *                      For regression: fixed at 0 (no meaning).
    + * @param maxBins  Maximum number of bins, for all features.
      * @param featureArity  Map: categorical feature index --> arity.
      *                      I.e., the feature takes values in {0, ..., arity - 1}.
    --- End diff --
    
    This actually does match the parameter meaning.  The parameter is set when DecisionTreeMetadata is constructed by buildMetadata, and it is set to be the actual max, not the max possible/allowed number of bins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org