You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jkbradley <gi...@git.apache.org> on 2016/03/21 03:18:15 UTC

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/11855

    [SPARK-12183][ML][MLLIB] Remove mllib tree implementation, and wrap spark.ml one

    Primary change:
    * Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning.
    * spark.mllib now calls the spark.ml implementation.
    * Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed.
    
    ml.tree.DecisionTreeModel
    * Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses.  These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation.
    
    ml.tree.Node
    * Added ```private[tree] def deepCopy```, used by unit tests
    
    Copied developer comments from spark.mllib implementation to spark.ml one.
    
    Moving unit tests
    * Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite.
    * Those tests were all moved to spark.ml.tree.impl.RandomForestSuite.
    * I made minimal changes to each test to allow it to run.  Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values.
    * No new unit tests were added.
    * mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in.  Those same split calculations were already being tested in other unit tests, for each dataset type.
    
    **Changes of behavior** (to be noted in SPARK-13448 once this PR is merged)
    * spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB.  This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit().  Once this PR is merged, I will note the change of behavior on SPARK-13448.
    * spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats.  This does not remove information from the tree, and it will save a bit of storage.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark remove-mllib-tree-impl

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11855.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11855
    
----
commit 109ef199efe212055a7c9858b080c8e70c6fb7a5
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-03-20T19:30:43Z

    Removed spark.mllib DecisionTree implementation and used spark.ml one instead.  Moved unit tests where necessary from mllib to ml

commit 204588748aa6eb66a84902f6133bbbbe77f9ea65
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-03-20T22:31:58Z

    Fixed unit tests

commit d6433dabfe6835d7546e7f32d477ccd48bb833a7
Author: Joseph K. Bradley <jo...@databricks.com>
Date:   2016-03-21T01:58:59Z

    changed ml.tree.Node.toOld so it does not use invalidInformationGainStats for LeafNodes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-199421771
  
    @jkbradley I made a pass. Mostly minor things on test behavior and the new memory changes. 
    
    It feels good to delete this much code! :+1: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200607002
  
    I just rebased to fix the merge conflicts.  The only change other than rebasing is in the last commit.  @sethah Thank you for reviewing!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r56855438
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
    @@ -33,6 +39,406 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       import RandomForestSuite.mapToVec
     
    +  test("Binary classification with continuous features: split and bin calculation") {
    +    val arr = OldDTSuite.generateOrderedLabeledPointsWithLabel1()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, 3, 2, 100)
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    assert(splits(0).length === 99)
    +  }
    +
    +  test("Binary classification with binary (ordered) categorical features:" +
    +    " split and bin calculation") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 2,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 2, 1 -> 2))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("Binary classification with 3-ary (ordered) categorical features," +
    +    " with no samples for one category") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 2,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("extract categories from a number for multiclass classification") {
    +    val l = RandomForest.extractMultiClassCategories(13, 10)
    +    assert(l.length === 3)
    +    assert(List(3.0, 2.0, 0.0).toSeq === l.toSeq)
    +  }
    +
    +  test("find splits for a continuous feature") {
    +    // find splits for normal case
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(6), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array.fill(200000)(math.random)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 5)
    +      assert(fakeMetadata.numSplits(0) === 5)
    +      assert(fakeMetadata.numBins(0) === 6)
    +      // check returned splits are distinct
    +      assert(splits.distinct.length === splits.length)
    +    }
    +
    +    // find splits should not return identical splits
    +    // when there are not enough split candidates, reduce the number of splits in metadata
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(5), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 3)
    +      // check returned splits are distinct
    +      assert(splits.distinct.length === splits.length)
    +    }
    +
    +    // find splits when most samples close to the minimum
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(3), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 5).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 2)
    +      assert(splits(0) === 2.0)
    +      assert(splits(1) === 3.0)
    +    }
    +
    +    // find splits when most samples close to the maximum
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(3), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 1)
    +      assert(splits(0) === 1.0)
    +    }
    +  }
    +
    +  test("Multiclass classification with unordered categorical features:" +
    +    " split and bin calculations") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(
    +      OldAlgo.Classification,
    +      Gini,
    +      maxDepth = 2,
    +      numClasses = 100,
    +      maxBins = 100,
    +      categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(metadata.isUnordered(featureIndex = 0))
    +    assert(metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    assert(splits(0).length === 3)
    +    assert(metadata.numSplits(0) === 3)
    +    assert(metadata.numBins(0) === 3)
    +    assert(metadata.numSplits(1) === 3)
    +    assert(metadata.numBins(1) === 3)
    +
    +    // Expecting 2^2 - 1 = 3 splits per feature
    +    def checkCategoricalSplit(s: Split, featureIndex: Int, leftCategories: Array[Double]): Unit = {
    +      assert(s.featureIndex === featureIndex)
    +      assert(s.isInstanceOf[CategoricalSplit])
    +      val s0 = s.asInstanceOf[CategoricalSplit]
    +      assert(s0.leftCategories === leftCategories)
    +      assert(s0.numCategories === 3)  // for this unit test
    +    }
    +    // Feature 0
    +    checkCategoricalSplit(splits(0)(0), 0, Array(0.0))
    +    checkCategoricalSplit(splits(0)(1), 0, Array(1.0))
    +    checkCategoricalSplit(splits(0)(2), 0, Array(0.0, 1.0))
    +    // Feature 1
    +    checkCategoricalSplit(splits(1)(0), 1, Array(0.0))
    +    checkCategoricalSplit(splits(1)(1), 1, Array(1.0))
    +    checkCategoricalSplit(splits(1)(2), 1, Array(0.0, 1.0))
    +  }
    +
    +  test("Multiclass classification with ordered categorical features: split and bin calculations") {
    +    val arr = OldDTSuite.generateCategoricalDataPointsForMulticlassForOrderedFeatures()
    +    assert(arr.length === 3000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 100,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 10, 1 -> 10))
    +    // 2^(10-1) - 1 > 100, so categorical features will be ordered
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("Avoid aggregation on the last level") {
    +    val arr = Array(
    +      LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)))
    +    val input = sc.parallelize(arr)
    +
    +    val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 1,
    +      numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
    +    val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
    +    val splits = RandomForest.findSplits(input, metadata, seed = 42)
    +
    +    val treeInput = TreePoint.convertToTreeRDD(input, splits, metadata)
    +    val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, withReplacement = false)
    +
    +    val topNode = LearningNode.emptyNode(nodeIndex = 1)
    +    assert(topNode.isLeaf === false)
    +
    +    val nodesForGroup = Map((0, Array(topNode)))
    +    val treeToNodeToIndexInfo = Map((0, Map(
    +      (topNode.id, new RandomForest.NodeIndexInfo(0, None))
    +    )))
    +    val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
    +    RandomForest.findBestSplits(baggedInput, metadata, Array(topNode),
    +      nodesForGroup, treeToNodeToIndexInfo, splits, nodeQueue)
    +
    +    // don't enqueue leaf nodes into node queue
    +    assert(nodeQueue.isEmpty)
    +
    +    // set impurity and predict for topNode
    +    assert(topNode.toNode.prediction !== Double.MinValue)
    +    assert(topNode.stats.impurity !== -1.0)
    +
    +    // set impurity and predict for child nodes
    +    assert(topNode.leftChild.get.toNode.prediction === 0.0)
    +    assert(topNode.rightChild.get.toNode.prediction === 1.0)
    +    assert(topNode.leftChild.get.stats.impurity === 0.0)
    +    assert(topNode.rightChild.get.stats.impurity === 0.0)
    +  }
    +
    +  test("Avoid aggregation if impurity is 0.0") {
    +    val arr = Array(
    +      LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)))
    +    val input = sc.parallelize(arr)
    +
    +    val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 5,
    +      numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
    +    val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
    +    val splits = RandomForest.findSplits(input, metadata, seed = 42)
    +
    +    val treeInput = TreePoint.convertToTreeRDD(input, splits, metadata)
    +    val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, withReplacement = false)
    +
    +    val topNode = LearningNode.emptyNode(nodeIndex = 1)
    +    assert(topNode.isLeaf === false)
    +
    +    val nodesForGroup = Map((0, Array(topNode)))
    +    val treeToNodeToIndexInfo = Map((0, Map(
    +      (topNode.id, new RandomForest.NodeIndexInfo(0, None))
    +    )))
    +    val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
    +    RandomForest.findBestSplits(baggedInput, metadata, Array(topNode),
    +      nodesForGroup, treeToNodeToIndexInfo, splits, nodeQueue)
    +
    +    // don't enqueue a node into node queue if its impurity is 0.0
    +    assert(nodeQueue.isEmpty)
    +
    +    // set impurity and predict for topNode
    +    assert(topNode.toNode.prediction !== Double.MinValue)
    --- End diff --
    
    Comment above applies here as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57263476
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1025,7 +1048,9 @@ private[ml] object RandomForest extends Logging {
           new mutable.HashMap[Int, mutable.HashMap[Int, NodeIndexInfo]]()
         var memUsage: Long = 0L
         var numNodesInGroup = 0
    -    while (nodeQueue.nonEmpty && memUsage < maxMemoryUsage) {
    +    // If maxMemoryInMB is set very small, we want to still try to split 1 node,
    +    // so we allow one iteration if memUsage == 0.
    +    while (nodeQueue.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
    --- End diff --
    
    I'm sure adding a node will always increase memory usage.  But if adding a node really doesn't increase memory usage, then there's presumably no harm in adding it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57075513
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
    @@ -33,6 +39,406 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       import RandomForestSuite.mapToVec
     
    +  test("Binary classification with continuous features: split and bin calculation") {
    +    val arr = OldDTSuite.generateOrderedLabeledPointsWithLabel1()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, 3, 2, 100)
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    assert(splits(0).length === 99)
    +  }
    +
    +  test("Binary classification with binary (ordered) categorical features:" +
    +    " split and bin calculation") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 2,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 2, 1 -> 2))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("Binary classification with 3-ary (ordered) categorical features," +
    +    " with no samples for one category") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 2,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("extract categories from a number for multiclass classification") {
    +    val l = RandomForest.extractMultiClassCategories(13, 10)
    +    assert(l.length === 3)
    +    assert(List(3.0, 2.0, 0.0).toSeq === l.toSeq)
    +  }
    +
    +  test("find splits for a continuous feature") {
    +    // find splits for normal case
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(6), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array.fill(200000)(math.random)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 5)
    +      assert(fakeMetadata.numSplits(0) === 5)
    +      assert(fakeMetadata.numBins(0) === 6)
    +      // check returned splits are distinct
    +      assert(splits.distinct.length === splits.length)
    +    }
    +
    +    // find splits should not return identical splits
    +    // when there are not enough split candidates, reduce the number of splits in metadata
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(5), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 3)
    +      // check returned splits are distinct
    +      assert(splits.distinct.length === splits.length)
    +    }
    +
    +    // find splits when most samples close to the minimum
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(3), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 5).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 2)
    +      assert(splits(0) === 2.0)
    +      assert(splits(1) === 3.0)
    +    }
    +
    +    // find splits when most samples close to the maximum
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(3), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 1)
    +      assert(splits(0) === 1.0)
    +    }
    +  }
    +
    +  test("Multiclass classification with unordered categorical features:" +
    +    " split and bin calculations") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(
    +      OldAlgo.Classification,
    +      Gini,
    +      maxDepth = 2,
    +      numClasses = 100,
    +      maxBins = 100,
    +      categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(metadata.isUnordered(featureIndex = 0))
    +    assert(metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    assert(splits(0).length === 3)
    +    assert(metadata.numSplits(0) === 3)
    +    assert(metadata.numBins(0) === 3)
    +    assert(metadata.numSplits(1) === 3)
    +    assert(metadata.numBins(1) === 3)
    +
    +    // Expecting 2^2 - 1 = 3 splits per feature
    +    def checkCategoricalSplit(s: Split, featureIndex: Int, leftCategories: Array[Double]): Unit = {
    +      assert(s.featureIndex === featureIndex)
    +      assert(s.isInstanceOf[CategoricalSplit])
    +      val s0 = s.asInstanceOf[CategoricalSplit]
    +      assert(s0.leftCategories === leftCategories)
    +      assert(s0.numCategories === 3)  // for this unit test
    +    }
    +    // Feature 0
    +    checkCategoricalSplit(splits(0)(0), 0, Array(0.0))
    +    checkCategoricalSplit(splits(0)(1), 0, Array(1.0))
    +    checkCategoricalSplit(splits(0)(2), 0, Array(0.0, 1.0))
    +    // Feature 1
    +    checkCategoricalSplit(splits(1)(0), 1, Array(0.0))
    +    checkCategoricalSplit(splits(1)(1), 1, Array(1.0))
    +    checkCategoricalSplit(splits(1)(2), 1, Array(0.0, 1.0))
    +  }
    +
    +  test("Multiclass classification with ordered categorical features: split and bin calculations") {
    +    val arr = OldDTSuite.generateCategoricalDataPointsForMulticlassForOrderedFeatures()
    +    assert(arr.length === 3000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 100,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 10, 1 -> 10))
    +    // 2^(10-1) - 1 > 100, so categorical features will be ordered
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("Avoid aggregation on the last level") {
    +    val arr = Array(
    +      LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)))
    +    val input = sc.parallelize(arr)
    +
    +    val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 1,
    +      numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
    +    val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
    +    val splits = RandomForest.findSplits(input, metadata, seed = 42)
    +
    +    val treeInput = TreePoint.convertToTreeRDD(input, splits, metadata)
    +    val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, withReplacement = false)
    +
    +    val topNode = LearningNode.emptyNode(nodeIndex = 1)
    +    assert(topNode.isLeaf === false)
    +
    +    val nodesForGroup = Map((0, Array(topNode)))
    +    val treeToNodeToIndexInfo = Map((0, Map(
    +      (topNode.id, new RandomForest.NodeIndexInfo(0, None))
    +    )))
    +    val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
    +    RandomForest.findBestSplits(baggedInput, metadata, Array(topNode),
    +      nodesForGroup, treeToNodeToIndexInfo, splits, nodeQueue)
    +
    +    // don't enqueue leaf nodes into node queue
    +    assert(nodeQueue.isEmpty)
    +
    +    // set impurity and predict for topNode
    +    assert(topNode.toNode.prediction !== Double.MinValue)
    +    assert(topNode.stats.impurity !== -1.0)
    +
    +    // set impurity and predict for child nodes
    +    assert(topNode.leftChild.get.toNode.prediction === 0.0)
    +    assert(topNode.rightChild.get.toNode.prediction === 1.0)
    +    assert(topNode.leftChild.get.stats.impurity === 0.0)
    +    assert(topNode.rightChild.get.stats.impurity === 0.0)
    +  }
    +
    +  test("Avoid aggregation if impurity is 0.0") {
    +    val arr = Array(
    +      LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)))
    +    val input = sc.parallelize(arr)
    +
    +    val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 5,
    +      numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
    +    val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
    +    val splits = RandomForest.findSplits(input, metadata, seed = 42)
    +
    +    val treeInput = TreePoint.convertToTreeRDD(input, splits, metadata)
    +    val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, withReplacement = false)
    +
    +    val topNode = LearningNode.emptyNode(nodeIndex = 1)
    +    assert(topNode.isLeaf === false)
    +
    +    val nodesForGroup = Map((0, Array(topNode)))
    +    val treeToNodeToIndexInfo = Map((0, Map(
    +      (topNode.id, new RandomForest.NodeIndexInfo(0, None))
    +    )))
    +    val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
    +    RandomForest.findBestSplits(baggedInput, metadata, Array(topNode),
    +      nodesForGroup, treeToNodeToIndexInfo, splits, nodeQueue)
    +
    +    // don't enqueue a node into node queue if its impurity is 0.0
    +    assert(nodeQueue.isEmpty)
    +
    +    // set impurity and predict for topNode
    +    assert(topNode.toNode.prediction !== Double.MinValue)
    +    assert(topNode.stats.impurity !== -1.0)
    +
    +    // set impurity and predict for child nodes
    +    assert(topNode.leftChild.get.toNode.prediction === 0.0)
    +    assert(topNode.rightChild.get.toNode.prediction === 1.0)
    +    assert(topNode.leftChild.get.stats.impurity === 0.0)
    +    assert(topNode.rightChild.get.stats.impurity === 0.0)
    +  }
    +
    +  test("Use soft prediction for binary classification with ordered categorical features") {
    +    // The following dataset is set up such that the best split is {1} vs. {0, 2}.
    +    // If the hard prediction is used to order the categories, then {0} vs. {1, 2} is chosen.
    +    val arr = Array(
    +      LabeledPoint(0.0, Vectors.dense(0.0)),
    +      LabeledPoint(0.0, Vectors.dense(0.0)),
    +      LabeledPoint(0.0, Vectors.dense(0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0)),
    +      LabeledPoint(0.0, Vectors.dense(1.0)),
    +      LabeledPoint(0.0, Vectors.dense(1.0)),
    +      LabeledPoint(0.0, Vectors.dense(1.0)),
    +      LabeledPoint(0.0, Vectors.dense(1.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0)),
    +      LabeledPoint(1.0, Vectors.dense(2.0)))
    +    val input = sc.parallelize(arr)
    +
    +    // Must set maxBins s.t. the feature will be treated as an ordered categorical feature.
    +    val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 1,
    +      numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3), maxBins = 3)
    +
    +    val model = RandomForest.run(input, strategy, numTrees = 1, featureSubsetStrategy = "all",
    +      seed = 42).head
    +    model.rootNode match {
    +      case n: InternalNode => n.split match {
    +        case s: CategoricalSplit =>
    +          assert(s.leftCategories === Array(1.0))
    +      }
    +    }
    +  }
    +
    +  test("Second level node building with vs. without groups") {
    +    val arr = OldDTSuite.generateOrderedLabeledPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    // For tree with 1 group
    +    val strategy1 =
    +      new OldStrategy(OldAlgo.Classification, Entropy, 3, 2, 100, maxMemoryInMB = 1000)
    +    // For tree with multiple groups
    +    val strategy2 =
    +      new OldStrategy(OldAlgo.Classification, Entropy, 3, 2, 100, maxMemoryInMB = 0)
    +
    +    val tree1 = RandomForest.run(rdd, strategy1, numTrees = 1, featureSubsetStrategy = "all",
    +      seed = 42).head
    +    val tree2 = RandomForest.run(rdd, strategy2, numTrees = 1, featureSubsetStrategy = "all",
    +      seed = 42).head
    +
    +    def getChildren(rootNode: Node): Array[InternalNode] = rootNode match {
    +      case n: InternalNode =>
    +        assert(n.leftChild.isInstanceOf[InternalNode])
    +        assert(n.rightChild.isInstanceOf[InternalNode])
    +        Array(n.leftChild.asInstanceOf[InternalNode], n.rightChild.asInstanceOf[InternalNode])
    +    }
    +
    +    // Single group second level tree construction.
    +    val children1 = getChildren(tree1.rootNode)
    +    val children2 = getChildren(tree2.rootNode)
    +
    +    // Verify whether the splits obtained using single group and multiple group level
    +    // construction strategies are the same.
    +    for (i <- 0 until 2) {
    +      assert(children1(i).gain > 0)
    +      assert(children2(i).gain > 0)
    +      assert(children1(i).split === children2(i).split)
    +      assert(children1(i).impurity === children2(i).impurity)
    +      assert(children1(i).impurityStats.stats === children2(i).impurityStats.stats)
    +      assert(children1(i).leftChild.impurity === children2(i).leftChild.impurity)
    +      assert(children1(i).rightChild.impurity === children2(i).rightChild.impurity)
    +      assert(children1(i).prediction === children2(i).prediction)
    +    }
    +  }
    +
    +  def binaryClassificationTestWithContinuousFeaturesAndSubsampledFeatures(strategy: OldStrategy) {
    +    val numFeatures = 50
    +    val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures, 1000)
    +    val rdd = sc.parallelize(arr)
    +
    +    // Select feature subset for top nodes.  Return true if OK.
    +    def checkFeatureSubsetStrategy(
    +                                    numTrees: Int,
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57075802
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -852,10 +399,7 @@ class DecisionTreeSuite extends SparkFunSuite with MLlibTestSparkContext {
         }
     
         // test when no valid split can be found
    -    val rootNode = model.topNode
    -
    -    val gain = rootNode.stats.get
    -    assert(gain == InformationGainStats.invalidInformationGainStats)
    --- End diff --
    
    I'll modify it to check for reasonable values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57075903
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
    @@ -33,6 +39,406 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       import RandomForestSuite.mapToVec
     
    +  test("Binary classification with continuous features: split and bin calculation") {
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-199090613
  
    **[Test build #53649 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53649/consoleFull)** for PR 11855 at commit [`bf6bcae`](https://github.com/apache/spark/commit/bf6bcae335e7a521e61e5f0e4241c80549d69026).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200069056
  
    **[Test build #53830 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53830/consoleFull)** for PR 11855 at commit [`f82cda9`](https://github.com/apache/spark/commit/f82cda99359c278752014572dd42b234b8f1bab0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-199086732
  
    CC: @sethah @MLnick @yanboliang @yinxusen I'm CCing people who have looked at trees recently.  This is a long PR, but I hope the notes make it easy to review.  Thanks in advance!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200069172
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53830/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-199343599
  
    This is not related to your changes, but I noticed that the return type in the comment [here](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L814) is incorrect. This seems like a reasonable PR to fix it in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57078137
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1047,6 +1076,12 @@ private[ml] object RandomForest extends Logging {
           numNodesInGroup += 1
           memUsage += nodeMemUsage
         }
    +    if (memUsage > maxMemoryUsage && maxMemoryUsage != 0) {
    --- End diff --
    
    Thanks, I think I left that in from an earlier solution.
    
    I agree it's a little odd to support 0, but honestly, I did it because it made writing one of the unit tests much easier.  The test for splitting a level of a tree in 1 group vs. multiple groups is pretty hard to write without it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57079588
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1047,6 +1076,12 @@ private[ml] object RandomForest extends Logging {
           numNodesInGroup += 1
           memUsage += nodeMemUsage
         }
    +    if (memUsage > maxMemoryUsage && maxMemoryUsage != 0) {
    --- End diff --
    
    I saw that you had simplified it using that trick, very clever! I don't think it's a big issue either way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57075598
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -539,18 +120,10 @@ class DecisionTreeSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(!metadata.isUnordered(featureIndex = 0))
         assert(!metadata.isUnordered(featureIndex = 1))
     
    -    val (splits, bins) = DecisionTree.findSplitsBins(rdd, metadata)
    -    assert(splits.length === 2)
    -    assert(splits(0).length === 99)
    -    assert(bins.length === 2)
    -    assert(bins(0).length === 100)
    -
         val rootNode = DecisionTree.train(rdd, strategy).topNode
     
    -    val stats = rootNode.stats.get
    -    assert(stats.gain === 0)
    -    assert(stats.leftImpurity === 0)
    -    assert(stats.rightImpurity === 0)
    +    assert(rootNode.impurity === 0)
    +    assert(rootNode.stats.isEmpty)
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-199087371
  
    **[Test build #53644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53644/consoleFull)** for PR 11855 at commit [`d6433da`](https://github.com/apache/spark/commit/d6433dabfe6835d7546e7f32d477ccd48bb833a7).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r56874894
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1047,6 +1076,12 @@ private[ml] object RandomForest extends Logging {
           numNodesInGroup += 1
           memUsage += nodeMemUsage
         }
    +    if (memUsage > maxMemoryUsage && maxMemoryUsage != 0) {
    --- End diff --
    
    I'm not sure why the check `&& maxMemoryUsage != 0` is needed. The way it is here, you can set `maxMemoryInMB` to 0, and it will train the RF one node at a time and issue no warnings. Then again, I'm not sure why you'd be allowed to set `maxMemoryInMB` to 0 at all. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r56869033
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1025,7 +1052,9 @@ private[ml] object RandomForest extends Logging {
           new mutable.HashMap[Int, mutable.HashMap[Int, NodeIndexInfo]]()
         var memUsage: Long = 0L
         var numNodesInGroup = 0
    -    while (nodeQueue.nonEmpty && memUsage < maxMemoryUsage) {
    +    // If maxMemoryInMB is set very small, we want to still try to split 1 node,
    --- End diff --
    
    Should we add a note to document this behavior in the Scala doc?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r56865956
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
    @@ -33,6 +39,406 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       import RandomForestSuite.mapToVec
     
    +  test("Binary classification with continuous features: split and bin calculation") {
    --- End diff --
    
    "split and bin calculation" -> "split calculation"
    
    Same for next test case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200621594
  
    **[Test build #53988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53988/consoleFull)** for PR 11855 at commit [`7bc7e6d`](https://github.com/apache/spark/commit/7bc7e6dfb65e64a7a39c8154cc0707a2bfbdd5f2).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-199098548
  
    **[Test build #53649 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53649/consoleFull)** for PR 11855 at commit [`bf6bcae`](https://github.com/apache/spark/commit/bf6bcae335e7a521e61e5f0e4241c80549d69026).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-199087386
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53644/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200607290
  
    **[Test build #53988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53988/consoleFull)** for PR 11855 at commit [`7bc7e6d`](https://github.com/apache/spark/commit/7bc7e6dfb65e64a7a39c8154cc0707a2bfbdd5f2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57077853
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1025,7 +1052,9 @@ private[ml] object RandomForest extends Logging {
           new mutable.HashMap[Int, mutable.HashMap[Int, NodeIndexInfo]]()
         var memUsage: Long = 0L
         var numNodesInGroup = 0
    -    while (nodeQueue.nonEmpty && memUsage < maxMemoryUsage) {
    +    // If maxMemoryInMB is set very small, we want to still try to split 1 node,
    --- End diff --
    
    This is documented in the spark.ml Param Scala doc, as well as the spark.mllib Strategy Scala doc.  I'll add it to the Python doc too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200054242
  
    **[Test build #53830 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53830/consoleFull)** for PR 11855 at commit [`f82cda9`](https://github.com/apache/spark/commit/f82cda99359c278752014572dd42b234b8f1bab0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57073751
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
    @@ -33,6 +39,406 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       import RandomForestSuite.mapToVec
     
    +  test("Binary classification with continuous features: split and bin calculation") {
    +    val arr = OldDTSuite.generateOrderedLabeledPointsWithLabel1()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, 3, 2, 100)
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    assert(splits(0).length === 99)
    +  }
    +
    +  test("Binary classification with binary (ordered) categorical features:" +
    +    " split and bin calculation") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 2,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 2, 1 -> 2))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("Binary classification with 3-ary (ordered) categorical features," +
    +    " with no samples for one category") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 2,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("extract categories from a number for multiclass classification") {
    +    val l = RandomForest.extractMultiClassCategories(13, 10)
    +    assert(l.length === 3)
    +    assert(List(3.0, 2.0, 0.0).toSeq === l.toSeq)
    +  }
    +
    +  test("find splits for a continuous feature") {
    +    // find splits for normal case
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(6), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array.fill(200000)(math.random)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 5)
    +      assert(fakeMetadata.numSplits(0) === 5)
    +      assert(fakeMetadata.numBins(0) === 6)
    +      // check returned splits are distinct
    +      assert(splits.distinct.length === splits.length)
    +    }
    +
    +    // find splits should not return identical splits
    +    // when there are not enough split candidates, reduce the number of splits in metadata
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(5), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 3)
    +      // check returned splits are distinct
    +      assert(splits.distinct.length === splits.length)
    +    }
    +
    +    // find splits when most samples close to the minimum
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(3), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 5).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 2)
    +      assert(splits(0) === 2.0)
    +      assert(splits(1) === 3.0)
    +    }
    +
    +    // find splits when most samples close to the maximum
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(3), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 1)
    +      assert(splits(0) === 1.0)
    +    }
    +  }
    +
    +  test("Multiclass classification with unordered categorical features:" +
    +    " split and bin calculations") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(
    +      OldAlgo.Classification,
    +      Gini,
    +      maxDepth = 2,
    +      numClasses = 100,
    +      maxBins = 100,
    +      categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(metadata.isUnordered(featureIndex = 0))
    +    assert(metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    assert(splits(0).length === 3)
    +    assert(metadata.numSplits(0) === 3)
    +    assert(metadata.numBins(0) === 3)
    +    assert(metadata.numSplits(1) === 3)
    +    assert(metadata.numBins(1) === 3)
    +
    +    // Expecting 2^2 - 1 = 3 splits per feature
    +    def checkCategoricalSplit(s: Split, featureIndex: Int, leftCategories: Array[Double]): Unit = {
    +      assert(s.featureIndex === featureIndex)
    +      assert(s.isInstanceOf[CategoricalSplit])
    +      val s0 = s.asInstanceOf[CategoricalSplit]
    +      assert(s0.leftCategories === leftCategories)
    +      assert(s0.numCategories === 3)  // for this unit test
    +    }
    +    // Feature 0
    +    checkCategoricalSplit(splits(0)(0), 0, Array(0.0))
    +    checkCategoricalSplit(splits(0)(1), 0, Array(1.0))
    +    checkCategoricalSplit(splits(0)(2), 0, Array(0.0, 1.0))
    +    // Feature 1
    +    checkCategoricalSplit(splits(1)(0), 1, Array(0.0))
    +    checkCategoricalSplit(splits(1)(1), 1, Array(1.0))
    +    checkCategoricalSplit(splits(1)(2), 1, Array(0.0, 1.0))
    +  }
    +
    +  test("Multiclass classification with ordered categorical features: split and bin calculations") {
    +    val arr = OldDTSuite.generateCategoricalDataPointsForMulticlassForOrderedFeatures()
    +    assert(arr.length === 3000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 100,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 10, 1 -> 10))
    +    // 2^(10-1) - 1 > 100, so categorical features will be ordered
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("Avoid aggregation on the last level") {
    +    val arr = Array(
    +      LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)))
    +    val input = sc.parallelize(arr)
    +
    +    val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 1,
    +      numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
    +    val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
    +    val splits = RandomForest.findSplits(input, metadata, seed = 42)
    +
    +    val treeInput = TreePoint.convertToTreeRDD(input, splits, metadata)
    +    val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, withReplacement = false)
    +
    +    val topNode = LearningNode.emptyNode(nodeIndex = 1)
    +    assert(topNode.isLeaf === false)
    +
    +    val nodesForGroup = Map((0, Array(topNode)))
    +    val treeToNodeToIndexInfo = Map((0, Map(
    +      (topNode.id, new RandomForest.NodeIndexInfo(0, None))
    +    )))
    +    val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
    +    RandomForest.findBestSplits(baggedInput, metadata, Array(topNode),
    +      nodesForGroup, treeToNodeToIndexInfo, splits, nodeQueue)
    +
    +    // don't enqueue leaf nodes into node queue
    +    assert(nodeQueue.isEmpty)
    +
    +    // set impurity and predict for topNode
    +    assert(topNode.toNode.prediction !== Double.MinValue)
    --- End diff --
    
    Thanks, good catch


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200621876
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-199086954
  
    **[Test build #53644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53644/consoleFull)** for PR 11855 at commit [`d6433da`](https://github.com/apache/spark/commit/d6433dabfe6835d7546e7f32d477ccd48bb833a7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57077278
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -563,18 +136,10 @@ class DecisionTreeSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(!metadata.isUnordered(featureIndex = 0))
         assert(!metadata.isUnordered(featureIndex = 1))
     
    -    val (splits, bins) = DecisionTree.findSplitsBins(rdd, metadata)
    --- End diff --
    
    I don't think we're losing any coverage.  The choice of splits should have nothing to do with the labels or the impurity.  There are already tests in spark.ml RandomForestSuite testing split calculations for ordered categorical features and for binary classification with continuous features, so those should cover the tests which were removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r56861596
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -539,18 +120,10 @@ class DecisionTreeSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(!metadata.isUnordered(featureIndex = 0))
         assert(!metadata.isUnordered(featureIndex = 1))
     
    -    val (splits, bins) = DecisionTree.findSplitsBins(rdd, metadata)
    -    assert(splits.length === 2)
    -    assert(splits(0).length === 99)
    -    assert(bins.length === 2)
    -    assert(bins(0).length === 100)
    -
         val rootNode = DecisionTree.train(rdd, strategy).topNode
     
    -    val stats = rootNode.stats.get
    -    assert(stats.gain === 0)
    -    assert(stats.leftImpurity === 0)
    -    assert(stats.rightImpurity === 0)
    +    assert(rootNode.impurity === 0)
    +    assert(rootNode.stats.isEmpty)
    --- End diff --
    
    We should add `assert(rootNode.predict.predict === 0)` to be consistent with the test below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r56855217
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
    @@ -33,6 +39,406 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       import RandomForestSuite.mapToVec
     
    +  test("Binary classification with continuous features: split and bin calculation") {
    +    val arr = OldDTSuite.generateOrderedLabeledPointsWithLabel1()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, 3, 2, 100)
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    assert(splits(0).length === 99)
    +  }
    +
    +  test("Binary classification with binary (ordered) categorical features:" +
    +    " split and bin calculation") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 2,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 2, 1 -> 2))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("Binary classification with 3-ary (ordered) categorical features," +
    +    " with no samples for one category") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 2,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("extract categories from a number for multiclass classification") {
    +    val l = RandomForest.extractMultiClassCategories(13, 10)
    +    assert(l.length === 3)
    +    assert(List(3.0, 2.0, 0.0).toSeq === l.toSeq)
    +  }
    +
    +  test("find splits for a continuous feature") {
    +    // find splits for normal case
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(6), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array.fill(200000)(math.random)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 5)
    +      assert(fakeMetadata.numSplits(0) === 5)
    +      assert(fakeMetadata.numBins(0) === 6)
    +      // check returned splits are distinct
    +      assert(splits.distinct.length === splits.length)
    +    }
    +
    +    // find splits should not return identical splits
    +    // when there are not enough split candidates, reduce the number of splits in metadata
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(5), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 3)
    +      // check returned splits are distinct
    +      assert(splits.distinct.length === splits.length)
    +    }
    +
    +    // find splits when most samples close to the minimum
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(3), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 5).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 2)
    +      assert(splits(0) === 2.0)
    +      assert(splits(1) === 3.0)
    +    }
    +
    +    // find splits when most samples close to the maximum
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(3), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 1)
    +      assert(splits(0) === 1.0)
    +    }
    +  }
    +
    +  test("Multiclass classification with unordered categorical features:" +
    +    " split and bin calculations") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(
    +      OldAlgo.Classification,
    +      Gini,
    +      maxDepth = 2,
    +      numClasses = 100,
    +      maxBins = 100,
    +      categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(metadata.isUnordered(featureIndex = 0))
    +    assert(metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    assert(splits(0).length === 3)
    +    assert(metadata.numSplits(0) === 3)
    +    assert(metadata.numBins(0) === 3)
    +    assert(metadata.numSplits(1) === 3)
    +    assert(metadata.numBins(1) === 3)
    +
    +    // Expecting 2^2 - 1 = 3 splits per feature
    +    def checkCategoricalSplit(s: Split, featureIndex: Int, leftCategories: Array[Double]): Unit = {
    +      assert(s.featureIndex === featureIndex)
    +      assert(s.isInstanceOf[CategoricalSplit])
    +      val s0 = s.asInstanceOf[CategoricalSplit]
    +      assert(s0.leftCategories === leftCategories)
    +      assert(s0.numCategories === 3)  // for this unit test
    +    }
    +    // Feature 0
    +    checkCategoricalSplit(splits(0)(0), 0, Array(0.0))
    +    checkCategoricalSplit(splits(0)(1), 0, Array(1.0))
    +    checkCategoricalSplit(splits(0)(2), 0, Array(0.0, 1.0))
    +    // Feature 1
    +    checkCategoricalSplit(splits(1)(0), 1, Array(0.0))
    +    checkCategoricalSplit(splits(1)(1), 1, Array(1.0))
    +    checkCategoricalSplit(splits(1)(2), 1, Array(0.0, 1.0))
    +  }
    +
    +  test("Multiclass classification with ordered categorical features: split and bin calculations") {
    +    val arr = OldDTSuite.generateCategoricalDataPointsForMulticlassForOrderedFeatures()
    +    assert(arr.length === 3000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 100,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 10, 1 -> 10))
    +    // 2^(10-1) - 1 > 100, so categorical features will be ordered
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("Avoid aggregation on the last level") {
    +    val arr = Array(
    +      LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)))
    +    val input = sc.parallelize(arr)
    +
    +    val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 1,
    +      numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
    +    val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
    +    val splits = RandomForest.findSplits(input, metadata, seed = 42)
    +
    +    val treeInput = TreePoint.convertToTreeRDD(input, splits, metadata)
    +    val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, withReplacement = false)
    +
    +    val topNode = LearningNode.emptyNode(nodeIndex = 1)
    +    assert(topNode.isLeaf === false)
    +
    +    val nodesForGroup = Map((0, Array(topNode)))
    +    val treeToNodeToIndexInfo = Map((0, Map(
    +      (topNode.id, new RandomForest.NodeIndexInfo(0, None))
    +    )))
    +    val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
    +    RandomForest.findBestSplits(baggedInput, metadata, Array(topNode),
    +      nodesForGroup, treeToNodeToIndexInfo, splits, nodeQueue)
    +
    +    // don't enqueue leaf nodes into node queue
    +    assert(nodeQueue.isEmpty)
    +
    +    // set impurity and predict for topNode
    +    assert(topNode.toNode.prediction !== Double.MinValue)
    --- End diff --
    
    This check seems a bit incoherent to me. The original test basically created an empty node, checked that the values were the proper defaults, then trained the node, then checked that the values had changed. Checking `prediction !== Double.MinValue` doesn't make sense since prediction was never set to `Double.MinValue`. IMO the proper analogue of this would be to check that `topNode.stats === null`, train the node, and then check that `topNode.stats !== null`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-199087381
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200473537
  
    @sethah Let me know when you think this is ready---thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57073360
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala ---
    @@ -443,84 +272,4 @@ object RandomForest extends Serializable with Logging {
       @Since("1.2.0")
       val supportedFeatureSubsetStrategies: Array[String] =
         Array("auto", "all", "sqrt", "log2", "onethird")
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11855


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200621881
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53988/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200069171
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200656539
  
    I'll go ahead and merge this with master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-199098590
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57075419
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
    @@ -33,6 +39,406 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       import RandomForestSuite.mapToVec
     
    +  test("Binary classification with continuous features: split and bin calculation") {
    +    val arr = OldDTSuite.generateOrderedLabeledPointsWithLabel1()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, 3, 2, 100)
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    assert(splits(0).length === 99)
    +  }
    +
    +  test("Binary classification with binary (ordered) categorical features:" +
    +    " split and bin calculation") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 2,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 2, 1 -> 2))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("Binary classification with 3-ary (ordered) categorical features," +
    +    " with no samples for one category") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 2,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("extract categories from a number for multiclass classification") {
    +    val l = RandomForest.extractMultiClassCategories(13, 10)
    +    assert(l.length === 3)
    +    assert(List(3.0, 2.0, 0.0).toSeq === l.toSeq)
    +  }
    +
    +  test("find splits for a continuous feature") {
    +    // find splits for normal case
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(6), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array.fill(200000)(math.random)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 5)
    +      assert(fakeMetadata.numSplits(0) === 5)
    +      assert(fakeMetadata.numBins(0) === 6)
    +      // check returned splits are distinct
    +      assert(splits.distinct.length === splits.length)
    +    }
    +
    +    // find splits should not return identical splits
    +    // when there are not enough split candidates, reduce the number of splits in metadata
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(5), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 3)
    +      // check returned splits are distinct
    +      assert(splits.distinct.length === splits.length)
    +    }
    +
    +    // find splits when most samples close to the minimum
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(3), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 5).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 2)
    +      assert(splits(0) === 2.0)
    +      assert(splits(1) === 3.0)
    +    }
    +
    +    // find splits when most samples close to the maximum
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(3), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 1)
    +      assert(splits(0) === 1.0)
    +    }
    +  }
    +
    +  test("Multiclass classification with unordered categorical features:" +
    +    " split and bin calculations") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(
    +      OldAlgo.Classification,
    +      Gini,
    +      maxDepth = 2,
    +      numClasses = 100,
    +      maxBins = 100,
    +      categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(metadata.isUnordered(featureIndex = 0))
    +    assert(metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    assert(splits(0).length === 3)
    +    assert(metadata.numSplits(0) === 3)
    +    assert(metadata.numBins(0) === 3)
    +    assert(metadata.numSplits(1) === 3)
    +    assert(metadata.numBins(1) === 3)
    +
    +    // Expecting 2^2 - 1 = 3 splits per feature
    +    def checkCategoricalSplit(s: Split, featureIndex: Int, leftCategories: Array[Double]): Unit = {
    +      assert(s.featureIndex === featureIndex)
    +      assert(s.isInstanceOf[CategoricalSplit])
    +      val s0 = s.asInstanceOf[CategoricalSplit]
    +      assert(s0.leftCategories === leftCategories)
    +      assert(s0.numCategories === 3)  // for this unit test
    +    }
    +    // Feature 0
    +    checkCategoricalSplit(splits(0)(0), 0, Array(0.0))
    +    checkCategoricalSplit(splits(0)(1), 0, Array(1.0))
    +    checkCategoricalSplit(splits(0)(2), 0, Array(0.0, 1.0))
    +    // Feature 1
    +    checkCategoricalSplit(splits(1)(0), 1, Array(0.0))
    +    checkCategoricalSplit(splits(1)(1), 1, Array(1.0))
    +    checkCategoricalSplit(splits(1)(2), 1, Array(0.0, 1.0))
    +  }
    +
    +  test("Multiclass classification with ordered categorical features: split and bin calculations") {
    +    val arr = OldDTSuite.generateCategoricalDataPointsForMulticlassForOrderedFeatures()
    +    assert(arr.length === 3000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 100,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 10, 1 -> 10))
    +    // 2^(10-1) - 1 > 100, so categorical features will be ordered
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("Avoid aggregation on the last level") {
    +    val arr = Array(
    +      LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)))
    +    val input = sc.parallelize(arr)
    +
    +    val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 1,
    +      numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
    +    val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
    +    val splits = RandomForest.findSplits(input, metadata, seed = 42)
    +
    +    val treeInput = TreePoint.convertToTreeRDD(input, splits, metadata)
    +    val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, withReplacement = false)
    +
    +    val topNode = LearningNode.emptyNode(nodeIndex = 1)
    +    assert(topNode.isLeaf === false)
    +
    +    val nodesForGroup = Map((0, Array(topNode)))
    +    val treeToNodeToIndexInfo = Map((0, Map(
    +      (topNode.id, new RandomForest.NodeIndexInfo(0, None))
    +    )))
    +    val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
    +    RandomForest.findBestSplits(baggedInput, metadata, Array(topNode),
    +      nodesForGroup, treeToNodeToIndexInfo, splits, nodeQueue)
    +
    +    // don't enqueue leaf nodes into node queue
    +    assert(nodeQueue.isEmpty)
    +
    +    // set impurity and predict for topNode
    +    assert(topNode.toNode.prediction !== Double.MinValue)
    +    assert(topNode.stats.impurity !== -1.0)
    +
    +    // set impurity and predict for child nodes
    +    assert(topNode.leftChild.get.toNode.prediction === 0.0)
    +    assert(topNode.rightChild.get.toNode.prediction === 1.0)
    +    assert(topNode.leftChild.get.stats.impurity === 0.0)
    +    assert(topNode.rightChild.get.stats.impurity === 0.0)
    +  }
    +
    +  test("Avoid aggregation if impurity is 0.0") {
    +    val arr = Array(
    +      LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)))
    +    val input = sc.parallelize(arr)
    +
    +    val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 5,
    +      numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
    +    val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
    +    val splits = RandomForest.findSplits(input, metadata, seed = 42)
    +
    +    val treeInput = TreePoint.convertToTreeRDD(input, splits, metadata)
    +    val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, withReplacement = false)
    +
    +    val topNode = LearningNode.emptyNode(nodeIndex = 1)
    +    assert(topNode.isLeaf === false)
    +
    +    val nodesForGroup = Map((0, Array(topNode)))
    +    val treeToNodeToIndexInfo = Map((0, Map(
    +      (topNode.id, new RandomForest.NodeIndexInfo(0, None))
    +    )))
    +    val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
    +    RandomForest.findBestSplits(baggedInput, metadata, Array(topNode),
    +      nodesForGroup, treeToNodeToIndexInfo, splits, nodeQueue)
    +
    +    // don't enqueue a node into node queue if its impurity is 0.0
    +    assert(nodeQueue.isEmpty)
    +
    +    // set impurity and predict for topNode
    +    assert(topNode.toNode.prediction !== Double.MinValue)
    --- End diff --
    
    done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200518240
  
    You can remove `InformationGainStats` companion object altogether now, I believe. It is not used after this PR. This LGTM other than that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-200054197
  
    @sethah Thanks a lot for reviewing this patch!  I think I addressed everything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r56866777
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -563,18 +136,10 @@ class DecisionTreeSuite extends SparkFunSuite with MLlibTestSparkContext {
         assert(!metadata.isUnordered(featureIndex = 0))
         assert(!metadata.isUnordered(featureIndex = 1))
     
    -    val (splits, bins) = DecisionTree.findSplitsBins(rdd, metadata)
    --- End diff --
    
    Seems we are losing coverage by removing the test for `findSplits` here and not adding them back in the ML test suites. Since we have other tests for `findSplits` I'm not sure if that's a problem, but if we are simply trying to _move_ all of these tests then these tests of `findSplitsBins` then they should be added in ML RandomForestTestSuite.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r56863232
  
    --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala ---
    @@ -852,10 +399,7 @@ class DecisionTreeSuite extends SparkFunSuite with MLlibTestSparkContext {
         }
     
         // test when no valid split can be found
    -    val rootNode = model.topNode
    -
    -    val gain = rootNode.stats.get
    -    assert(gain == InformationGainStats.invalidInformationGainStats)
    --- End diff --
    
    Does the check on L380 make sense anymore? It looks like we never use `InformationGainStats.invalidInformationGainStats` anymore so we could remove it altogether?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r57218161
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1025,7 +1048,9 @@ private[ml] object RandomForest extends Logging {
           new mutable.HashMap[Int, mutable.HashMap[Int, NodeIndexInfo]]()
         var memUsage: Long = 0L
         var numNodesInGroup = 0
    -    while (nodeQueue.nonEmpty && memUsage < maxMemoryUsage) {
    +    // If maxMemoryInMB is set very small, we want to still try to split 1 node,
    +    // so we allow one iteration if memUsage == 0.
    +    while (nodeQueue.nonEmpty && (memUsage < maxMemoryUsage || memUsage == 0)) {
    --- End diff --
    
    The idea is to train at minimum one node, regardless of memory constraints, I believe. Checking if `memUsage == 0` works, but a more direct way of saying this might be `mutableNodesForGroup.isEmpty`. The only reason I point this out is because, as I read the code, I was wondering if you could ever add a node without increasing `memUsage` (can `nodeMemUsage` ever be zero -- it cannot). So, it might make the code more understandable/explicit, but I don't feel too strongly about it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11855#issuecomment-199098591
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53649/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r56857278
  
    --- Diff: mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala ---
    @@ -33,6 +39,406 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext {
     
       import RandomForestSuite.mapToVec
     
    +  test("Binary classification with continuous features: split and bin calculation") {
    +    val arr = OldDTSuite.generateOrderedLabeledPointsWithLabel1()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, 3, 2, 100)
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    assert(splits(0).length === 99)
    +  }
    +
    +  test("Binary classification with binary (ordered) categorical features:" +
    +    " split and bin calculation") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 2,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 2, 1 -> 2))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("Binary classification with 3-ary (ordered) categorical features," +
    +    " with no samples for one category") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 2,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("extract categories from a number for multiclass classification") {
    +    val l = RandomForest.extractMultiClassCategories(13, 10)
    +    assert(l.length === 3)
    +    assert(List(3.0, 2.0, 0.0).toSeq === l.toSeq)
    +  }
    +
    +  test("find splits for a continuous feature") {
    +    // find splits for normal case
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(6), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array.fill(200000)(math.random)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 5)
    +      assert(fakeMetadata.numSplits(0) === 5)
    +      assert(fakeMetadata.numBins(0) === 6)
    +      // check returned splits are distinct
    +      assert(splits.distinct.length === splits.length)
    +    }
    +
    +    // find splits should not return identical splits
    +    // when there are not enough split candidates, reduce the number of splits in metadata
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(5), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 3)
    +      // check returned splits are distinct
    +      assert(splits.distinct.length === splits.length)
    +    }
    +
    +    // find splits when most samples close to the minimum
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(3), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 5).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 2)
    +      assert(splits(0) === 2.0)
    +      assert(splits(1) === 3.0)
    +    }
    +
    +    // find splits when most samples close to the maximum
    +    {
    +      val fakeMetadata = new DecisionTreeMetadata(1, 0, 0, 0,
    +        Map(), Set(),
    +        Array(3), Gini, QuantileStrategy.Sort,
    +        0, 0, 0.0, 0, 0
    +      )
    +      val featureSamples = Array(0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2).map(_.toDouble)
    +      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
    +      assert(splits.length === 1)
    +      assert(splits(0) === 1.0)
    +    }
    +  }
    +
    +  test("Multiclass classification with unordered categorical features:" +
    +    " split and bin calculations") {
    +    val arr = OldDTSuite.generateCategoricalDataPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(
    +      OldAlgo.Classification,
    +      Gini,
    +      maxDepth = 2,
    +      numClasses = 100,
    +      maxBins = 100,
    +      categoricalFeaturesInfo = Map(0 -> 3, 1 -> 3))
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(metadata.isUnordered(featureIndex = 0))
    +    assert(metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    assert(splits(0).length === 3)
    +    assert(metadata.numSplits(0) === 3)
    +    assert(metadata.numBins(0) === 3)
    +    assert(metadata.numSplits(1) === 3)
    +    assert(metadata.numBins(1) === 3)
    +
    +    // Expecting 2^2 - 1 = 3 splits per feature
    +    def checkCategoricalSplit(s: Split, featureIndex: Int, leftCategories: Array[Double]): Unit = {
    +      assert(s.featureIndex === featureIndex)
    +      assert(s.isInstanceOf[CategoricalSplit])
    +      val s0 = s.asInstanceOf[CategoricalSplit]
    +      assert(s0.leftCategories === leftCategories)
    +      assert(s0.numCategories === 3)  // for this unit test
    +    }
    +    // Feature 0
    +    checkCategoricalSplit(splits(0)(0), 0, Array(0.0))
    +    checkCategoricalSplit(splits(0)(1), 0, Array(1.0))
    +    checkCategoricalSplit(splits(0)(2), 0, Array(0.0, 1.0))
    +    // Feature 1
    +    checkCategoricalSplit(splits(1)(0), 1, Array(0.0))
    +    checkCategoricalSplit(splits(1)(1), 1, Array(1.0))
    +    checkCategoricalSplit(splits(1)(2), 1, Array(0.0, 1.0))
    +  }
    +
    +  test("Multiclass classification with ordered categorical features: split and bin calculations") {
    +    val arr = OldDTSuite.generateCategoricalDataPointsForMulticlassForOrderedFeatures()
    +    assert(arr.length === 3000)
    +    val rdd = sc.parallelize(arr)
    +    val strategy = new OldStrategy(OldAlgo.Classification, Gini, maxDepth = 2, numClasses = 100,
    +      maxBins = 100, categoricalFeaturesInfo = Map(0 -> 10, 1 -> 10))
    +    // 2^(10-1) - 1 > 100, so categorical features will be ordered
    +
    +    val metadata = DecisionTreeMetadata.buildMetadata(rdd, strategy)
    +    assert(!metadata.isUnordered(featureIndex = 0))
    +    assert(!metadata.isUnordered(featureIndex = 1))
    +    val splits = RandomForest.findSplits(rdd, metadata, seed = 42)
    +    assert(splits.length === 2)
    +    // no splits pre-computed for ordered categorical features
    +    assert(splits(0).length === 0)
    +  }
    +
    +  test("Avoid aggregation on the last level") {
    +    val arr = Array(
    +      LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)))
    +    val input = sc.parallelize(arr)
    +
    +    val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 1,
    +      numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
    +    val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
    +    val splits = RandomForest.findSplits(input, metadata, seed = 42)
    +
    +    val treeInput = TreePoint.convertToTreeRDD(input, splits, metadata)
    +    val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, withReplacement = false)
    +
    +    val topNode = LearningNode.emptyNode(nodeIndex = 1)
    +    assert(topNode.isLeaf === false)
    +
    +    val nodesForGroup = Map((0, Array(topNode)))
    +    val treeToNodeToIndexInfo = Map((0, Map(
    +      (topNode.id, new RandomForest.NodeIndexInfo(0, None))
    +    )))
    +    val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
    +    RandomForest.findBestSplits(baggedInput, metadata, Array(topNode),
    +      nodesForGroup, treeToNodeToIndexInfo, splits, nodeQueue)
    +
    +    // don't enqueue leaf nodes into node queue
    +    assert(nodeQueue.isEmpty)
    +
    +    // set impurity and predict for topNode
    +    assert(topNode.toNode.prediction !== Double.MinValue)
    +    assert(topNode.stats.impurity !== -1.0)
    +
    +    // set impurity and predict for child nodes
    +    assert(topNode.leftChild.get.toNode.prediction === 0.0)
    +    assert(topNode.rightChild.get.toNode.prediction === 1.0)
    +    assert(topNode.leftChild.get.stats.impurity === 0.0)
    +    assert(topNode.rightChild.get.stats.impurity === 0.0)
    +  }
    +
    +  test("Avoid aggregation if impurity is 0.0") {
    +    val arr = Array(
    +      LabeledPoint(0.0, Vectors.dense(1.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 1.0, 1.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0, 0.0, 0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0, 2.0, 1.0)))
    +    val input = sc.parallelize(arr)
    +
    +    val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 5,
    +      numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3))
    +    val metadata = DecisionTreeMetadata.buildMetadata(input, strategy)
    +    val splits = RandomForest.findSplits(input, metadata, seed = 42)
    +
    +    val treeInput = TreePoint.convertToTreeRDD(input, splits, metadata)
    +    val baggedInput = BaggedPoint.convertToBaggedRDD(treeInput, 1.0, 1, withReplacement = false)
    +
    +    val topNode = LearningNode.emptyNode(nodeIndex = 1)
    +    assert(topNode.isLeaf === false)
    +
    +    val nodesForGroup = Map((0, Array(topNode)))
    +    val treeToNodeToIndexInfo = Map((0, Map(
    +      (topNode.id, new RandomForest.NodeIndexInfo(0, None))
    +    )))
    +    val nodeQueue = new mutable.Queue[(Int, LearningNode)]()
    +    RandomForest.findBestSplits(baggedInput, metadata, Array(topNode),
    +      nodesForGroup, treeToNodeToIndexInfo, splits, nodeQueue)
    +
    +    // don't enqueue a node into node queue if its impurity is 0.0
    +    assert(nodeQueue.isEmpty)
    +
    +    // set impurity and predict for topNode
    +    assert(topNode.toNode.prediction !== Double.MinValue)
    +    assert(topNode.stats.impurity !== -1.0)
    +
    +    // set impurity and predict for child nodes
    +    assert(topNode.leftChild.get.toNode.prediction === 0.0)
    +    assert(topNode.rightChild.get.toNode.prediction === 1.0)
    +    assert(topNode.leftChild.get.stats.impurity === 0.0)
    +    assert(topNode.rightChild.get.stats.impurity === 0.0)
    +  }
    +
    +  test("Use soft prediction for binary classification with ordered categorical features") {
    +    // The following dataset is set up such that the best split is {1} vs. {0, 2}.
    +    // If the hard prediction is used to order the categories, then {0} vs. {1, 2} is chosen.
    +    val arr = Array(
    +      LabeledPoint(0.0, Vectors.dense(0.0)),
    +      LabeledPoint(0.0, Vectors.dense(0.0)),
    +      LabeledPoint(0.0, Vectors.dense(0.0)),
    +      LabeledPoint(1.0, Vectors.dense(0.0)),
    +      LabeledPoint(0.0, Vectors.dense(1.0)),
    +      LabeledPoint(0.0, Vectors.dense(1.0)),
    +      LabeledPoint(0.0, Vectors.dense(1.0)),
    +      LabeledPoint(0.0, Vectors.dense(1.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0)),
    +      LabeledPoint(0.0, Vectors.dense(2.0)),
    +      LabeledPoint(1.0, Vectors.dense(2.0)))
    +    val input = sc.parallelize(arr)
    +
    +    // Must set maxBins s.t. the feature will be treated as an ordered categorical feature.
    +    val strategy = new OldStrategy(algo = OldAlgo.Classification, impurity = Gini, maxDepth = 1,
    +      numClasses = 2, categoricalFeaturesInfo = Map(0 -> 3), maxBins = 3)
    +
    +    val model = RandomForest.run(input, strategy, numTrees = 1, featureSubsetStrategy = "all",
    +      seed = 42).head
    +    model.rootNode match {
    +      case n: InternalNode => n.split match {
    +        case s: CategoricalSplit =>
    +          assert(s.leftCategories === Array(1.0))
    +      }
    +    }
    +  }
    +
    +  test("Second level node building with vs. without groups") {
    +    val arr = OldDTSuite.generateOrderedLabeledPoints()
    +    assert(arr.length === 1000)
    +    val rdd = sc.parallelize(arr)
    +    // For tree with 1 group
    +    val strategy1 =
    +      new OldStrategy(OldAlgo.Classification, Entropy, 3, 2, 100, maxMemoryInMB = 1000)
    +    // For tree with multiple groups
    +    val strategy2 =
    +      new OldStrategy(OldAlgo.Classification, Entropy, 3, 2, 100, maxMemoryInMB = 0)
    +
    +    val tree1 = RandomForest.run(rdd, strategy1, numTrees = 1, featureSubsetStrategy = "all",
    +      seed = 42).head
    +    val tree2 = RandomForest.run(rdd, strategy2, numTrees = 1, featureSubsetStrategy = "all",
    +      seed = 42).head
    +
    +    def getChildren(rootNode: Node): Array[InternalNode] = rootNode match {
    +      case n: InternalNode =>
    +        assert(n.leftChild.isInstanceOf[InternalNode])
    +        assert(n.rightChild.isInstanceOf[InternalNode])
    +        Array(n.leftChild.asInstanceOf[InternalNode], n.rightChild.asInstanceOf[InternalNode])
    +    }
    +
    +    // Single group second level tree construction.
    +    val children1 = getChildren(tree1.rootNode)
    +    val children2 = getChildren(tree2.rootNode)
    +
    +    // Verify whether the splits obtained using single group and multiple group level
    +    // construction strategies are the same.
    +    for (i <- 0 until 2) {
    +      assert(children1(i).gain > 0)
    +      assert(children2(i).gain > 0)
    +      assert(children1(i).split === children2(i).split)
    +      assert(children1(i).impurity === children2(i).impurity)
    +      assert(children1(i).impurityStats.stats === children2(i).impurityStats.stats)
    +      assert(children1(i).leftChild.impurity === children2(i).leftChild.impurity)
    +      assert(children1(i).rightChild.impurity === children2(i).rightChild.impurity)
    +      assert(children1(i).prediction === children2(i).prediction)
    +    }
    +  }
    +
    +  def binaryClassificationTestWithContinuousFeaturesAndSubsampledFeatures(strategy: OldStrategy) {
    +    val numFeatures = 50
    +    val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures, 1000)
    +    val rdd = sc.parallelize(arr)
    +
    +    // Select feature subset for top nodes.  Return true if OK.
    +    def checkFeatureSubsetStrategy(
    +                                    numTrees: Int,
    --- End diff --
    
    indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12183][ML][MLLIB] Remove mllib tree imp...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11855#discussion_r56844348
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala ---
    @@ -443,84 +272,4 @@ object RandomForest extends Serializable with Logging {
       @Since("1.2.0")
       val supportedFeatureSubsetStrategies: Array[String] =
         Array("auto", "all", "sqrt", "log2", "onethird")
    --- End diff --
    
    These are now hard coded in two places. I think it would be better to have this call `RandomForestParams.supportedFeatureSubsetStrategies` in ml.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org