You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by smurching <gi...@git.apache.org> on 2017/10/04 22:30:59 UTC

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib][WIP] Add local tree training...

GitHub user smurching opened a pull request:

    https://github.com/apache/spark/pull/19433

    [SPARK-3162] [MLlib][WIP] Add local tree training for decision tree regressors

    ## What changes were proposed in this pull request?
    #### WIP, DO NOT MERGE
    
    ### Overview
    This PR adds local tree training for decision tree regressors as a first step for addressing [SPARK-3162](https://issues.apache.org/jira/browse/SPARK-3162) (train decision trees locally when possible). See [this design doc](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit) for a detailed description of the proposed changes.
    
    Distributed training logic has been refactored but only minimally modified; the local tree training implementation leverages existing distributed training logic for computing impurities and splits. This shared logic has been refactored into `...Utils` objects (e.g. `SplitUtils.scala`, `ImpurityUtils.scala`). 
    
    ### How to Review
    
    Each commit in this PR adds non-overlapping functionality, so the PR should be reviewable commit-by-commit.
    
    Changes introduced by each commit:
    1. Adds new data structures for local tree training (`FeatureVector`, `TrainingInfo`) & associated unit tests (`LocalTreeDataSuite`)
    2. Adds shared utility methods for computing splits/impurities (`SplitUtils`, `ImpurityUtils`, `AggUpdateUtils`), largely copied from existing distributed training code in `RandomForest.scala`.
    3. Unit tests for split/impurity utility methods (`TreeSplitUtilsSuite`)
    4. Updates distributed training code in `RandomForest.scala` to depend on the utility methods introduced in 2.
    5. Adds local tree training logic (`LocalDecisionTree`) 
    6. Local tree unit/integration tests (`LocalTreeUnitSuite`, `LocalTreeIntegrationSuite`)
    
    ## How was this patch tested?
    No existing tests were modified. The following new tests were added (also described above):
    * Unit tests for new data structures specific to local tree training (`LocalTreeDataSuite`, `LocalTreeUtilsSuite`)
    * Unit tests for impurity/split utility methods (`TreeSplitUtilsSuite`)
    * Unit tests for local tree training logic (`LocalTreeUnitSuite`)
    * Integration tests verifying that local & distributed tree training produce the same trees (`LocalTreeIntegrationSuite`)
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/smurching/spark pr-splitup

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19433.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19433
    
----
commit 219a12001383017e70f10cd7c785272e70e64b28
Author: Sid Murching <si...@databricks.com>
Date:   2017-10-04T20:55:35Z

    Add data structures for local tree training & associated tests (in LocalTreeDataSuite):
        * TrainingInfo: primary local tree training data structure, contains all information required to describe state of
        algorithm at any point during learning
        * FeatureVector: Stores data for an individual feature as an Array[Int]

commit 710714395c966f664af7f7b62226336675ec2ea7
Author: Sid Murching <si...@databricks.com>
Date:   2017-10-04T20:57:30Z

    Add utility methods used for impurity and split calculations during both local & distributed training:
     * AggUpdateUtils: Helper methods for updating sufficient stats for a given node
     * ImpurityUtils: Helper methods for impurity-related calcluations during node split decisions
     * SplitUtils: Helper methods for choosing splits given sufficient stats
    
    NOTE: Both ImpurityUtils and SplitUtils primarily contain code taken from RandomForest.scala, with slight modifications.
    Tests for SplitUtils are contained in the next commit.

commit 49bf0ae9b275264e757de573f81b816437be77e7
Author: Sid Murching <si...@databricks.com>
Date:   2017-10-04T21:36:15Z

    Add test suites for utility methods used during best-split computation:
     * TreeSplitUtilsSuite: Test suite for SplitUtils
     * TreeTests: Add utility method (getMetadata) for TreeSplitUtilsSuite
    
     Also add methods used by these tests in LocalDecisionTree.scala, RandomForest.scala

commit bc54b165849202269b80bbac1a84afb857e87e31
Author: Sid Murching <si...@databricks.com>
Date:   2017-10-04T21:48:33Z

     Update RandomForest.scala to use new utility methods for impurity/split calculations

commit 6a68a5cc6a6b7087163bbe5681ad41aef5e3fd0a
Author: Sid Murching <si...@databricks.com>
Date:   2017-10-04T21:51:39Z

    Add local decision tree training logic

commit 9a7174ed4a62033abfe2325dc1a8c5850e07f5f3
Author: Sid Murching <si...@databricks.com>
Date:   2017-10-04T21:52:06Z

    Add local decision tree unit/integration tests

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r147307553
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
    @@ -0,0 +1,250 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree._
    +import org.apache.spark.mllib.tree.model.ImpurityStats
    +
    +/** Object exposing methods for local training of decision trees */
    +private[ml] object LocalDecisionTree {
    +
    +  /**
    +   * Fully splits the passed-in node on the provided local dataset, returning
    +   * an InternalNode/LeafNode corresponding to the root of the resulting tree.
    +   *
    +   * @param node LearningNode to use as the root of the subtree fit on the passed-in dataset
    +   * @param metadata learning and dataset metadata for DecisionTree
    +   * @param splits splits(i) = array of splits for feature i
    +   */
    +  private[ml] def fitNode(
    +      input: Array[TreePoint],
    +      instanceWeights: Array[Double],
    +      node: LearningNode,
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]): Node = {
    +
    +    // The case with 1 node (depth = 0) is handled separately.
    +    // This allows all iterations in the depth > 0 case to use the same code.
    +    // TODO: Check that learning works when maxDepth > 0 but learning stops at 1 node (because of
    +    //       other parameters).
    +    if (metadata.maxDepth == 0) {
    +      return node.toNode
    +    }
    +
    +    // Prepare column store.
    +    //   Note: rowToColumnStoreDense checks to make sure numRows < Int.MaxValue.
    +    val colStoreInit: Array[Array[Int]]
    +    = LocalDecisionTreeUtils.rowToColumnStoreDense(input.map(_.binnedFeatures))
    +    val labels = input.map(_.label)
    +
    +    // Fit a regression model on the dataset, throwing an error if metadata indicates that
    +    // we should train a classifier.
    +    // TODO: Add support for training classifiers
    +    if (metadata.numClasses > 1 && metadata.numClasses <= 32) {
    +      throw new UnsupportedOperationException("Local training of a decision tree classifier is " +
    +        "unsupported; currently, only regression is supported")
    +    } else {
    +      trainRegressor(node, colStoreInit, instanceWeights, labels, metadata, splits)
    +    }
    +  }
    +
    +  /**
    +   * Locally fits a decision tree regressor.
    +   * TODO(smurching): Logic for fitting a classifier & regressor is the same; only difference
    +   * is impurity metric. Use the same logic for fitting a classifier.
    +   *
    +   * @param rootNode Node to use as root of the tree fit on the passed-in dataset
    +   * @param colStoreInit Array of columns of training data
    +   * @param instanceWeights Array of weights for each training example
    +   * @param metadata learning and dataset metadata for DecisionTree
    +   * @param splits splits(i) = Array of possible splits for feature i
    +   * @return LeafNode or InternalNode representation of rootNode
    +   */
    +  private[ml] def trainRegressor(
    +      rootNode: LearningNode,
    +      colStoreInit: Array[Array[Int]],
    +      instanceWeights: Array[Double],
    +      labels: Array[Double],
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]): Node = {
    +
    +    // Sort each column by decision tree node.
    +    val colStore: Array[FeatureVector] = colStoreInit.zipWithIndex.map { case (col, featureIndex) =>
    +      val featureArity: Int = metadata.featureArity.getOrElse(featureIndex, 0)
    +      FeatureVector(featureIndex, featureArity, col)
    +    }
    +
    +    val numRows = colStore.headOption match {
    +      case None => 0
    +      case Some(column) => column.values.length
    +    }
    +
    +    // Create a new PartitionInfo describing the status of our partially-trained subtree
    +    // at each iteration of training
    +    var trainingInfo: TrainingInfo = TrainingInfo(colStore, instanceWeights,
    +      nodeOffsets = Array[(Int, Int)]((0, numRows)), activeNodes = Array(rootNode))
    +
    +    // Iteratively learn, one level of the tree at a time.
    +    // Note: We do not use node IDs.
    +    var currentLevel = 0
    +    var doneLearning = false
    +
    +    while (currentLevel < metadata.maxDepth && !doneLearning) {
    +      // Splits each active node if possible, returning an array of new active nodes
    +      val activeNodes: Array[LearningNode] =
    +        computeBestSplits(trainingInfo, labels, metadata, splits)
    +      // Filter active node periphery by impurity.
    +      val estimatedRemainingActive = activeNodes.count(_.stats.impurity > 0.0)
    --- End diff --
    
    Oh true -- I'll doc for `currentLevelActiveNodes` to say:
    ```
     * @param currentLevelActiveNodes  Nodes which are active (could still be split).
     *                                 Inactive nodes are known to be leaves in the final tree.
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r150159113
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala ---
    @@ -0,0 +1,215 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree.{CategoricalSplit, Split}
    +import org.apache.spark.mllib.tree.impurity.ImpurityCalculator
    +import org.apache.spark.mllib.tree.model.ImpurityStats
    +
    +/** Utility methods for choosing splits during local & distributed tree training. */
    +private[impl] object SplitUtils {
    +
    +  /** Sorts ordered feature categories by label centroid, returning an ordered list of categories */
    +  private def sortByCentroid(
    +      binAggregates: DTStatsAggregator,
    +      featureIndex: Int,
    +      featureIndexIdx: Int): List[Int] = {
    +    /* Each bin is one category (feature value).
    +     * The bins are ordered based on centroidForCategories, and this ordering determines which
    +     * splits are considered.  (With K categories, we consider K - 1 possible splits.)
    +     *
    +     * centroidForCategories is a list: (category, centroid)
    +     */
    +    val numCategories = binAggregates.metadata.numBins(featureIndex)
    +    val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
    +
    +    val centroidForCategories = Range(0, numCategories).map { featureValue =>
    +      val categoryStats =
    +        binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue)
    +      val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, categoryStats)
    +      (featureValue, centroid)
    +    }
    +    // TODO(smurching): How to handle logging statements like these?
    +    // logDebug("Centroids for categorical variable: " + centroidForCategories.mkString(","))
    +    // bins sorted by centroids
    +    val categoriesSortedByCentroid = centroidForCategories.toList.sortBy(_._2).map(_._1)
    +    // logDebug("Sorted centroids for categorical variable = " +
    +    //   categoriesSortedByCentroid.mkString(","))
    +    categoriesSortedByCentroid
    +  }
    +
    +  /**
    +   * Find the best split for an unordered categorical feature at a single node.
    +   *
    +   * Algorithm:
    +   *  - Considers all possible subsets (exponentially many)
    +   *
    +   * @param featureIndex  Global index of feature being split.
    +   * @param featureIndexIdx Index of feature being split within subset of features for current node.
    +   * @param featureSplits Array of splits for the current feature
    +   * @param parentCalculator Optional: ImpurityCalculator containing impurity stats for current node
    +   * @return  (best split, statistics for split)  If no valid split was found, the returned
    +   *          ImpurityStats instance will be invalid (have member valid = false).
    +   */
    +  private[impl] def chooseUnorderedCategoricalSplit(
    +      binAggregates: DTStatsAggregator,
    +      featureIndex: Int,
    +      featureIndexIdx: Int,
    +      featureSplits: Array[Split],
    +      parentCalculator: Option[ImpurityCalculator] = None): (Split, ImpurityStats) = {
    +    // Unordered categorical feature
    +    val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
    +    val numSplits = binAggregates.metadata.numSplits(featureIndex)
    +    var parentCalc = parentCalculator
    +    val (bestFeatureSplitIndex, bestFeatureGainStats) =
    +      Range(0, numSplits).map { splitIndex =>
    +        val leftChildStats = binAggregates.getImpurityCalculator(nodeFeatureOffset, splitIndex)
    +        val rightChildStats = binAggregates.getParentImpurityCalculator()
    +          .subtract(leftChildStats)
    +        val gainAndImpurityStats = ImpurityUtils.calculateImpurityStats(parentCalc,
    +          leftChildStats, rightChildStats, binAggregates.metadata)
    +        // Compute parent stats once, when considering first split for current feature
    +        if (parentCalc.isEmpty) {
    +          parentCalc = Some(gainAndImpurityStats.impurityCalculator)
    +        }
    +        (splitIndex, gainAndImpurityStats)
    +      }.maxBy(_._2.gain)
    +    (featureSplits(bestFeatureSplitIndex), bestFeatureGainStats)
    +
    +  }
    +
    +  /**
    +   * Choose splitting rule: feature value <= threshold
    +   *
    +   * @return  (best split, statistics for split)  If the best split actually puts all instances
    +   *          in one leaf node, then it will be set to None.  If no valid split was found, the
    +   *          returned ImpurityStats instance will be invalid (have member valid = false)
    +   */
    +  private[impl] def chooseContinuousSplit(
    +      binAggregates: DTStatsAggregator,
    +      featureIndex: Int,
    +      featureIndexIdx: Int,
    +      featureSplits: Array[Split],
    +      parentCalculator: Option[ImpurityCalculator] = None): (Split, ImpurityStats) = {
    +    // For a continuous feature, bins are already sorted for splitting
    +    // Number of "categories" = number of bins
    +    val sortedCategories = Range(0, binAggregates.metadata.numBins(featureIndex)).toList
    +    // Get & return best split info
    +    val (bestFeatureSplitIndex, bestFeatureGainStats) = orderedSplitHelper(binAggregates,
    +      featureIndex, featureIndexIdx, sortedCategories, parentCalculator)
    +    (featureSplits(bestFeatureSplitIndex), bestFeatureGainStats)
    +  }
    +
    +  /**
    +   * Computes the index of the best split for an ordered feature.
    +   * @param parentCalculator Optional: ImpurityCalculator containing impurity stats for current node
    +   */
    +  private def orderedSplitHelper(
    +      binAggregates: DTStatsAggregator,
    +      featureIndex: Int,
    +      featureIndexIdx: Int,
    +      categoriesSortedByCentroid: List[Int],
    +      parentCalculator: Option[ImpurityCalculator]): (Int, ImpurityStats) = {
    +    // Cumulative sum (scanLeft) of bin statistics.
    +    // Afterwards, binAggregates for a bin is the sum of aggregates for
    +    // that bin + all preceding bins.
    +    assert(!binAggregates.metadata.isUnordered(featureIndex))
    --- End diff --
    
    Remove this (If there's any chance of this, then we should find ways to test it.)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83093/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r146735946
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
    @@ -0,0 +1,250 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree._
    +import org.apache.spark.mllib.tree.model.ImpurityStats
    +
    +/** Object exposing methods for local training of decision trees */
    +private[ml] object LocalDecisionTree {
    +
    +  /**
    +   * Fully splits the passed-in node on the provided local dataset, returning
    +   * an InternalNode/LeafNode corresponding to the root of the resulting tree.
    +   *
    +   * @param node LearningNode to use as the root of the subtree fit on the passed-in dataset
    +   * @param metadata learning and dataset metadata for DecisionTree
    +   * @param splits splits(i) = array of splits for feature i
    +   */
    +  private[ml] def fitNode(
    +      input: Array[TreePoint],
    +      instanceWeights: Array[Double],
    +      node: LearningNode,
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]): Node = {
    +
    +    // The case with 1 node (depth = 0) is handled separately.
    +    // This allows all iterations in the depth > 0 case to use the same code.
    +    // TODO: Check that learning works when maxDepth > 0 but learning stops at 1 node (because of
    +    //       other parameters).
    +    if (metadata.maxDepth == 0) {
    +      return node.toNode
    +    }
    +
    +    // Prepare column store.
    +    //   Note: rowToColumnStoreDense checks to make sure numRows < Int.MaxValue.
    +    val colStoreInit: Array[Array[Int]]
    +    = LocalDecisionTreeUtils.rowToColumnStoreDense(input.map(_.binnedFeatures))
    +    val labels = input.map(_.label)
    +
    +    // Fit a regression model on the dataset, throwing an error if metadata indicates that
    +    // we should train a classifier.
    +    // TODO: Add support for training classifiers
    +    if (metadata.numClasses > 1 && metadata.numClasses <= 32) {
    +      throw new UnsupportedOperationException("Local training of a decision tree classifier is " +
    +        "unsupported; currently, only regression is supported")
    +    } else {
    +      trainRegressor(node, colStoreInit, instanceWeights, labels, metadata, splits)
    +    }
    +  }
    +
    +  /**
    +   * Locally fits a decision tree regressor.
    +   * TODO(smurching): Logic for fitting a classifier & regressor is the same; only difference
    +   * is impurity metric. Use the same logic for fitting a classifier.
    +   *
    +   * @param rootNode Node to use as root of the tree fit on the passed-in dataset
    +   * @param colStoreInit Array of columns of training data
    +   * @param instanceWeights Array of weights for each training example
    +   * @param metadata learning and dataset metadata for DecisionTree
    +   * @param splits splits(i) = Array of possible splits for feature i
    +   * @return LeafNode or InternalNode representation of rootNode
    +   */
    +  private[ml] def trainRegressor(
    +      rootNode: LearningNode,
    +      colStoreInit: Array[Array[Int]],
    +      instanceWeights: Array[Double],
    +      labels: Array[Double],
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]): Node = {
    +
    +    // Sort each column by decision tree node.
    +    val colStore: Array[FeatureVector] = colStoreInit.zipWithIndex.map { case (col, featureIndex) =>
    +      val featureArity: Int = metadata.featureArity.getOrElse(featureIndex, 0)
    +      FeatureVector(featureIndex, featureArity, col)
    +    }
    +
    +    val numRows = colStore.headOption match {
    +      case None => 0
    +      case Some(column) => column.values.length
    +    }
    +
    +    // Create a new PartitionInfo describing the status of our partially-trained subtree
    +    // at each iteration of training
    +    var trainingInfo: TrainingInfo = TrainingInfo(colStore, instanceWeights,
    +      nodeOffsets = Array[(Int, Int)]((0, numRows)), activeNodes = Array(rootNode))
    +
    +    // Iteratively learn, one level of the tree at a time.
    +    // Note: We do not use node IDs.
    +    var currentLevel = 0
    +    var doneLearning = false
    +
    +    while (currentLevel < metadata.maxDepth && !doneLearning) {
    +      // Splits each active node if possible, returning an array of new active nodes
    +      val activeNodes: Array[LearningNode] =
    +        computeBestSplits(trainingInfo, labels, metadata, splits)
    +      // Filter active node periphery by impurity.
    +      val estimatedRemainingActive = activeNodes.count(_.stats.impurity > 0.0)
    --- End diff --
    
    Yes. Sorry for confusing you. The change what I said was changing to:
    ```
    val nextLevelNodes: Array[LearningNode] =
            computeBestSplits(trainingInfo, labels, metadata, splits)
    ```
    Does it look more reasonable ?
    And change the member name in `trainingInfo`:
    `TrainingInfo.activeNodes` ==> `TrainingInfo.currentLevelNodes`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #83507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83507/testReport)** for PR 19433 at commit [`b7e6e40`](https://github.com/apache/spark/commit/b7e6e40976612546b81d9775c194b274c146dc85).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #83874 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83874/testReport)** for PR 19433 at commit [`d86dd18`](https://github.com/apache/spark/commit/d86dd18e47451c2e4463c68db441f92a898ac765).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #82570 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82570/testReport)** for PR 19433 at commit [`abc86b2`](https://github.com/apache/spark/commit/abc86b2042e0fd42cc0e9fe20cf79967b16e9779).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r146731070
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala ---
    @@ -0,0 +1,85 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree.Split
    +
    +/**
    + * Helpers for updating DTStatsAggregators during collection of sufficient stats for tree training.
    + */
    +private[impl] object AggUpdateUtils {
    +
    +  /**
    +   * Updates the parent node stats of the passed-in impurity aggregator with the labels
    +   * corresponding to the feature values at indices [from, to).
    +   */
    +  private[impl] def updateParentImpurity(
    +      statsAggregator: DTStatsAggregator,
    +      col: FeatureVector,
    +      from: Int,
    +      to: Int,
    +      instanceWeights: Array[Double],
    +      labels: Array[Double]): Unit = {
    +    from.until(to).foreach { idx =>
    +      val rowIndex = col.indices(idx)
    +      val label = labels(rowIndex)
    +      statsAggregator.updateParent(label, instanceWeights(rowIndex))
    +    }
    +  }
    +
    +  /**
    +   * Update aggregator for an (unordered feature, label) pair
    +   * @param splits Array of arrays of splits for each feature; splits(i) = splits for feature i.
    +   */
    +  private[impl] def updateUnorderedFeature(
    +      agg: DTStatsAggregator,
    +      featureValue: Int,
    +      label: Double,
    +      featureIndex: Int,
    +      featureIndexIdx: Int,
    +      splits: Array[Array[Split]],
    --- End diff --
    
    Good call, I'll make this change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82557/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #83093 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83093/testReport)** for PR 19433 at commit [`ebade23`](https://github.com/apache/spark/commit/ebade2349153708c125fbeca900daec3d84c3513).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r150154413
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala ---
    @@ -0,0 +1,86 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree.Split
    +
    +/**
    + * Helpers for updating DTStatsAggregators during collection of sufficient stats for tree training.
    + */
    +private[impl] object AggUpdateUtils {
    +
    +  /**
    +   * Updates the parent node stats of the passed-in impurity aggregator with the labels
    +   * corresponding to the feature values at indices [from, to).
    +   * @param indices Array of row indices for feature values; indices(i) = row index of the ith
    +   *                feature value
    +   */
    +  private[impl] def updateParentImpurity(
    +      statsAggregator: DTStatsAggregator,
    +      indices: Array[Int],
    +      from: Int,
    +      to: Int,
    +      instanceWeights: Array[Double],
    +      labels: Array[Double]): Unit = {
    +    from.until(to).foreach { idx =>
    +      val rowIndex = indices(idx)
    +      val label = labels(rowIndex)
    +      statsAggregator.updateParent(label, instanceWeights(rowIndex))
    +    }
    +  }
    +
    +  /**
    +   * Update aggregator for an (unordered feature, label) pair
    +   * @param featureSplits Array of splits for the current feature
    +   */
    +  private[impl] def updateUnorderedFeature(
    +      agg: DTStatsAggregator,
    +      featureValue: Int,
    +      label: Double,
    +      featureIndex: Int,
    +      featureIndexIdx: Int,
    +      featureSplits: Array[Split],
    +      instanceWeight: Double): Unit = {
    +    val leftNodeFeatureOffset = agg.getFeatureOffset(featureIndexIdx)
    +    // Each unordered split has a corresponding bin for impurity stats of data points that fall
    +    // onto the left side of the split. For each unordered split, update left-side bin if applicable
    +    // for the current data point.
    +    val numSplits = agg.metadata.numSplits(featureIndex)
    +    var splitIndex = 0
    +    while (splitIndex < numSplits) {
    +      if (featureSplits(splitIndex).shouldGoLeft(featureValue, featureSplits)) {
    +        agg.featureUpdate(leftNodeFeatureOffset, splitIndex, label, instanceWeight)
    +      }
    +      splitIndex += 1
    +    }
    +  }
    +
    +  /** Update aggregator for an (ordered feature, label) pair */
    +  private[impl] def updateOrderedFeature(
    +      agg: DTStatsAggregator,
    +      featureValue: Int,
    +      label: Double,
    +      featureIndex: Int,
    --- End diff --
    
    featureIndex is not used


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83874/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #83503 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83503/testReport)** for PR 19433 at commit [`3f72cc0`](https://github.com/apache/spark/commit/3f72cc0d92132f850c5eebe2686473bff111199f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #83503 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83503/testReport)** for PR 19433 at commit [`3f72cc0`](https://github.com/apache/spark/commit/3f72cc0d92132f850c5eebe2686473bff111199f).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    The failing tests (in `DecisionTreeSuite`) fail because we've historically handled
    
    a) splits that have 0 gain 
    
    differently from
    
    b) splits that fail to achieve user-specified minimum gain (`metadata.minInfoGain`) or don't meet minimum instance-counts per node (`metadata.minInstancesPerNode`).
    
    Previously we'd create a leaf node with valid impurity stats in case a) and invalid impurity stats in case b). This PR creates a leaf node with invalid impurity stats in both cases.
    
    As a fix I'd suggest creating a `LeafNode` with correct impurity stats in case a), but with the `stats.valid` member set to `false` to indicate that the node should not be split.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r144790384
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala ---
    @@ -0,0 +1,85 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree.Split
    +
    +/**
    + * Helpers for updating DTStatsAggregators during collection of sufficient stats for tree training.
    + */
    +private[impl] object AggUpdateUtils {
    +
    +  /**
    +   * Updates the parent node stats of the passed-in impurity aggregator with the labels
    +   * corresponding to the feature values at indices [from, to).
    +   */
    +  private[impl] def updateParentImpurity(
    +      statsAggregator: DTStatsAggregator,
    +      col: FeatureVector,
    +      from: Int,
    +      to: Int,
    +      instanceWeights: Array[Double],
    +      labels: Array[Double]): Unit = {
    +    from.until(to).foreach { idx =>
    +      val rowIndex = col.indices(idx)
    +      val label = labels(rowIndex)
    +      statsAggregator.updateParent(label, instanceWeights(rowIndex))
    +    }
    +  }
    +
    +  /**
    +   * Update aggregator for an (unordered feature, label) pair
    +   * @param splits Array of arrays of splits for each feature; splits(i) = splits for feature i.
    +   */
    +  private[impl] def updateUnorderedFeature(
    +      agg: DTStatsAggregator,
    +      featureValue: Int,
    +      label: Double,
    +      featureIndex: Int,
    +      featureIndexIdx: Int,
    +      splits: Array[Array[Split]],
    --- End diff --
    
    You only need to pass in the `featureSplit: Array[Split]`, don't pass all splits for all features.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83464/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    > We'll actually only have to run an O(n log n) sort on continuous feature values once (i.e. in the FeatureVector constructor), since once the continuous features are sorted we can update them as we would for categorical features when splitting nodes (in O(n) time) and they'll remain sorted.
    
    Nice! so only one pass global sort, and then each split only need O(n) time copy.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r144789015
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/TrainingInfo.scala ---
    @@ -0,0 +1,144 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.ml.tree.{LearningNode, Split}
    +import org.apache.spark.util.collection.BitSet
    +
    +/**
    + * Maintains intermediate state of data (columns) and tree during local tree training.
    + * Primary local tree training data structure; contains all information required to describe
    + * the state of the algorithm at any point during learning.??
    + *
    + * Nodes are indexed left-to-right along the periphery of the tree, with 0-based indices.
    + * The "periphery" is the set of leaf nodes (active and inactive).
    + *
    + * @param columns  Array of columns.
    + *                 Each column is sorted first by nodes (left-to-right along the tree periphery);
    + *                 all columns share this first level of sorting.
    + *                 Within each node's group, each column is sorted based on feature value;
    + *                 this second level of sorting differs across columns.
    + * @param instanceWeights Array of weights for each training example
    + * @param nodeOffsets  Offsets into the columns indicating the first level of sorting (by node).
    + *                     The rows corresponding to the node activeNodes(i) are in the range
    + *                     [nodeOffsets(i)(0), nodeOffsets(i)(1)) .
    + * @param activeNodes  Nodes which are active (still being split).
    + *                     Inactive nodes are known to be leaves in the final tree.
    + */
    +private[impl] case class TrainingInfo(
    +    columns: Array[FeatureVector],
    +    instanceWeights: Array[Double],
    --- End diff --
    
    The `instanceWeights` will never be updated in each iteration, so why put it in the `TrainingInfo` structure ? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r150160368
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala ---
    @@ -0,0 +1,215 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree.{CategoricalSplit, Split}
    +import org.apache.spark.mllib.tree.impurity.ImpurityCalculator
    +import org.apache.spark.mllib.tree.model.ImpurityStats
    +
    +/** Utility methods for choosing splits during local & distributed tree training. */
    +private[impl] object SplitUtils {
    +
    +  /** Sorts ordered feature categories by label centroid, returning an ordered list of categories */
    +  private def sortByCentroid(
    +      binAggregates: DTStatsAggregator,
    +      featureIndex: Int,
    +      featureIndexIdx: Int): List[Int] = {
    +    /* Each bin is one category (feature value).
    +     * The bins are ordered based on centroidForCategories, and this ordering determines which
    +     * splits are considered.  (With K categories, we consider K - 1 possible splits.)
    +     *
    +     * centroidForCategories is a list: (category, centroid)
    +     */
    +    val numCategories = binAggregates.metadata.numBins(featureIndex)
    +    val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
    +
    +    val centroidForCategories = Range(0, numCategories).map { featureValue =>
    +      val categoryStats =
    +        binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue)
    +      val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, categoryStats)
    +      (featureValue, centroid)
    +    }
    +    // TODO(smurching): How to handle logging statements like these?
    +    // logDebug("Centroids for categorical variable: " + centroidForCategories.mkString(","))
    +    // bins sorted by centroids
    +    val categoriesSortedByCentroid = centroidForCategories.toList.sortBy(_._2).map(_._1)
    +    // logDebug("Sorted centroids for categorical variable = " +
    +    //   categoriesSortedByCentroid.mkString(","))
    +    categoriesSortedByCentroid
    +  }
    +
    +  /**
    +   * Find the best split for an unordered categorical feature at a single node.
    +   *
    +   * Algorithm:
    +   *  - Considers all possible subsets (exponentially many)
    +   *
    +   * @param featureIndex  Global index of feature being split.
    +   * @param featureIndexIdx Index of feature being split within subset of features for current node.
    +   * @param featureSplits Array of splits for the current feature
    +   * @param parentCalculator Optional: ImpurityCalculator containing impurity stats for current node
    +   * @return  (best split, statistics for split)  If no valid split was found, the returned
    +   *          ImpurityStats instance will be invalid (have member valid = false).
    +   */
    +  private[impl] def chooseUnorderedCategoricalSplit(
    +      binAggregates: DTStatsAggregator,
    +      featureIndex: Int,
    +      featureIndexIdx: Int,
    +      featureSplits: Array[Split],
    +      parentCalculator: Option[ImpurityCalculator] = None): (Split, ImpurityStats) = {
    +    // Unordered categorical feature
    +    val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
    +    val numSplits = binAggregates.metadata.numSplits(featureIndex)
    +    var parentCalc = parentCalculator
    --- End diff --
    
    It'd be nice to calculate the parentCalc right away here, if needed.  That seems possible just by taking the first candidate split.  Then we could simplify calculateImpurityStats by not passing in parentCalc as an option.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #83873 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83873/testReport)** for PR 19433 at commit [`0b27c56`](https://github.com/apache/spark/commit/0b27c56d1ea4e1108a62b77e9eca8ae160740756).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #82717 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82717/testReport)** for PR 19433 at commit [`c9a8e01`](https://github.com/apache/spark/commit/c9a8e01cead78d2ea6eb6a9ffa007314cbbcf60a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #83874 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83874/testReport)** for PR 19433 at commit [`d86dd18`](https://github.com/apache/spark/commit/d86dd18e47451c2e4463c68db441f92a898ac765).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #82721 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82721/testReport)** for PR 19433 at commit [`93e17fc`](https://github.com/apache/spark/commit/93e17fc74958d4fa8f3bea38731ecec662e4ca66).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #83464 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83464/testReport)** for PR 19433 at commit [`3f72cc0`](https://github.com/apache/spark/commit/3f72cc0d92132f850c5eebe2686473bff111199f).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r146731083
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
    @@ -0,0 +1,250 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree._
    +import org.apache.spark.mllib.tree.model.ImpurityStats
    +
    +/** Object exposing methods for local training of decision trees */
    +private[ml] object LocalDecisionTree {
    +
    +  /**
    +   * Fully splits the passed-in node on the provided local dataset, returning
    +   * an InternalNode/LeafNode corresponding to the root of the resulting tree.
    +   *
    +   * @param node LearningNode to use as the root of the subtree fit on the passed-in dataset
    +   * @param metadata learning and dataset metadata for DecisionTree
    +   * @param splits splits(i) = array of splits for feature i
    +   */
    +  private[ml] def fitNode(
    +      input: Array[TreePoint],
    +      instanceWeights: Array[Double],
    +      node: LearningNode,
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]): Node = {
    +
    +    // The case with 1 node (depth = 0) is handled separately.
    +    // This allows all iterations in the depth > 0 case to use the same code.
    +    // TODO: Check that learning works when maxDepth > 0 but learning stops at 1 node (because of
    +    //       other parameters).
    +    if (metadata.maxDepth == 0) {
    +      return node.toNode
    +    }
    +
    +    // Prepare column store.
    +    //   Note: rowToColumnStoreDense checks to make sure numRows < Int.MaxValue.
    +    val colStoreInit: Array[Array[Int]]
    +    = LocalDecisionTreeUtils.rowToColumnStoreDense(input.map(_.binnedFeatures))
    +    val labels = input.map(_.label)
    +
    +    // Fit a regression model on the dataset, throwing an error if metadata indicates that
    +    // we should train a classifier.
    +    // TODO: Add support for training classifiers
    +    if (metadata.numClasses > 1 && metadata.numClasses <= 32) {
    +      throw new UnsupportedOperationException("Local training of a decision tree classifier is " +
    +        "unsupported; currently, only regression is supported")
    +    } else {
    +      trainRegressor(node, colStoreInit, instanceWeights, labels, metadata, splits)
    +    }
    +  }
    +
    +  /**
    +   * Locally fits a decision tree regressor.
    +   * TODO(smurching): Logic for fitting a classifier & regressor is the same; only difference
    +   * is impurity metric. Use the same logic for fitting a classifier.
    +   *
    +   * @param rootNode Node to use as root of the tree fit on the passed-in dataset
    +   * @param colStoreInit Array of columns of training data
    +   * @param instanceWeights Array of weights for each training example
    +   * @param metadata learning and dataset metadata for DecisionTree
    +   * @param splits splits(i) = Array of possible splits for feature i
    +   * @return LeafNode or InternalNode representation of rootNode
    +   */
    +  private[ml] def trainRegressor(
    +      rootNode: LearningNode,
    +      colStoreInit: Array[Array[Int]],
    +      instanceWeights: Array[Double],
    +      labels: Array[Double],
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]): Node = {
    +
    +    // Sort each column by decision tree node.
    +    val colStore: Array[FeatureVector] = colStoreInit.zipWithIndex.map { case (col, featureIndex) =>
    +      val featureArity: Int = metadata.featureArity.getOrElse(featureIndex, 0)
    +      FeatureVector(featureIndex, featureArity, col)
    +    }
    +
    +    val numRows = colStore.headOption match {
    +      case None => 0
    +      case Some(column) => column.values.length
    +    }
    +
    +    // Create a new PartitionInfo describing the status of our partially-trained subtree
    +    // at each iteration of training
    +    var trainingInfo: TrainingInfo = TrainingInfo(colStore, instanceWeights,
    +      nodeOffsets = Array[(Int, Int)]((0, numRows)), activeNodes = Array(rootNode))
    +
    +    // Iteratively learn, one level of the tree at a time.
    +    // Note: We do not use node IDs.
    +    var currentLevel = 0
    +    var doneLearning = false
    +
    +    while (currentLevel < metadata.maxDepth && !doneLearning) {
    +      // Splits each active node if possible, returning an array of new active nodes
    +      val activeNodes: Array[LearningNode] =
    +        computeBestSplits(trainingInfo, labels, metadata, splits)
    +      // Filter active node periphery by impurity.
    +      val estimatedRemainingActive = activeNodes.count(_.stats.impurity > 0.0)
    --- End diff --
    
    Agreed on using `isLeaf` instead of checking for positive impurity, thanks for the suggestion.
    
    AFAICT at this point in the code `activeNodes` actually does refer to the nodes in the current level; the children of nodes in `activeNodes` are the nodes in the next level, and are returned by `computeBestSplits`. I forgot to include the return type of `computeBestSplit` in its method signature, which probably made this more confusing - my mistake.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    add to whitelist


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Thanks! I'll remove the WIP. To clear things up for the future, I'd thought [WIP] was the appropriate tag for a PR that's ready for review but not ready to be merged (based on https://spark.apache.org/contributing.html) -- have we stopped using the WIP tag?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #3983 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3983/testReport)** for PR 19433 at commit [`b7e6e40`](https://github.com/apache/spark/commit/b7e6e40976612546b81d9775c194b274c146dc85).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83310/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    After discussion and modifications, I approve this PR overall. Ping @jkbradley Can you take a look now ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    @WeichenXu123 Thanks for the comments! I'll respond inline:
    
    > In your doc, you said "Specifically, we only need to store sufficient stats for each bin of a single feature, as opposed to each bin of every feature", BUT, current implementation, you still allocate space for all features when computing: -- see DTStatsAggregator implementation, you pass featureSubset = None so DTStatsAggregator will allocate space for every features. According to your purpose, you should pass featureSubset = Some(Array(currentFeatureIndex)).
    
    I like your proposed solution (pass `featureSubset = Some(Array(currentFeatureIndex))`). I'll go ahead & implement it.
    
    > Current implementation still use binnedFeatures. You said in future it will be improved to sort feature values for continuous feature (for more precise tree training), if you want to consider every possible thresholds, you need hold rawFeatures instead of binnedFeatures in the columnar feature array, and in each split range offset, you need sort every continuous features. Is this the thing you want to do in the future ? This will increase calculation amount.
    
    Yep, we'll have to pass raw (`Double`) continuous features to the local tree training methods, which will require them to accept an `Array[LabeledPoint]` instead of an `Array[TreePoint]` as input & increase memory usage (along with requiring us to store additional indices). 
    
    We'll actually only have to run an `O(n log n)` sort on continuous feature values once (i.e. in the `FeatureVector` constructor), since once the continuous features are sorted we can update them as we would for categorical features when splitting nodes (in `O(n)` time) and they'll remain sorted.
    
    > For current implementation(using binnedFeature) , there is no need to sort continuous features inside each split offset. So the indices for each feature is exactly the same. In order to save memory, I think these indices should be shared, no need to create separate indices array for each features. Even if you add the improvements for continuous features mentioned above, you can create separate indices array for only continuous features, the categorical features can still share the same indices array.
    
    Agreed, I'll make this change.
    
    > About locality advantage of columnar format, I have some doubts. Current implementation, you do not reorder the label and weight array, access label and weight value need use indices, when calculating DTStat, this break locality. (But I'm not sure how much impact to perf this will bring).
    
    Yeah, I'm not sure if it'd be better to reorder the labels/weights arrays to achieve improved locality.
    I think we could experiment with both, but I'd prefer to save that for a follow-up PR unless you or another reviewer think it'll make a big perf difference.
    
    > About the overhead of columnar format: when making reordering (when get new split, we need reorder left sub-tree samples into front), so you need reordering on each column, and at the same time, update the indices array. But, if we use row format, like:
    Array[(features, label, weight)], reordering will be much easier, and do not need indices.
    So, I am considering, whether we can use row format, but at the time when we need DTStatsAggregator computation, copy the data we need from the row format into columnar format array (only need to copy rows between sub-node offset and only copy the sampled features if using feature subsampling).
    
    This is an interesting idea, my main concern is that on the first iteration of local tree training we'd need to copy the entire training data matrix from row -> columnar format, which negates any memory savings we get from not using indices. I'm also concerned about the overhead of repeatedly copying data from row -> columnar format.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r144790803
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/TrainingInfo.scala ---
    @@ -0,0 +1,144 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.ml.tree.{LearningNode, Split}
    +import org.apache.spark.util.collection.BitSet
    +
    +/**
    + * Maintains intermediate state of data (columns) and tree during local tree training.
    + * Primary local tree training data structure; contains all information required to describe
    + * the state of the algorithm at any point during learning.??
    + *
    + * Nodes are indexed left-to-right along the periphery of the tree, with 0-based indices.
    + * The "periphery" is the set of leaf nodes (active and inactive).
    + *
    + * @param columns  Array of columns.
    + *                 Each column is sorted first by nodes (left-to-right along the tree periphery);
    + *                 all columns share this first level of sorting.
    + *                 Within each node's group, each column is sorted based on feature value;
    + *                 this second level of sorting differs across columns.
    + * @param instanceWeights Array of weights for each training example
    + * @param nodeOffsets  Offsets into the columns indicating the first level of sorting (by node).
    + *                     The rows corresponding to the node activeNodes(i) are in the range
    + *                     [nodeOffsets(i)(0), nodeOffsets(i)(1)) .
    + * @param activeNodes  Nodes which are active (still being split).
    + *                     Inactive nodes are known to be leaves in the final tree.
    + */
    +private[impl] case class TrainingInfo(
    +    columns: Array[FeatureVector],
    +    instanceWeights: Array[Double],
    +    nodeOffsets: Array[(Int, Int)],
    +    activeNodes: Array[LearningNode]) extends Serializable {
    +
    +  // pre-allocated temporary buffers that we use to sort
    +  // instances in left and right children during update
    +  val tempVals: Array[Int] = new Array[Int](columns(0).values.length)
    +  val tempIndices: Array[Int] = new Array[Int](columns(0).values.length)
    +
    +  /** For debugging */
    +  override def toString: String = {
    +    "PartitionInfo(" +
    +      "  columns: {\n" +
    +      columns.mkString(",\n") +
    +      "  },\n" +
    +      s"  nodeOffsets: ${nodeOffsets.mkString(", ")},\n" +
    +      s"  activeNodes: ${activeNodes.iterator.mkString(", ")},\n" +
    +      ")\n"
    +  }
    +
    +  /**
    +   * Update columns and nodeOffsets for the next level of the tree.
    +   *
    +   * Update columns:
    +   *   For each (previously) active node,
    +   *     Compute bitset indicating whether each training instance under the node splits left/right
    +   *     For each column,
    +   *       Sort corresponding range of instances based on bitset.
    +   * Update nodeOffsets, activeNodes:
    +   *   Split offsets for nodes which split (which can be identified using the bitset).
    +   *
    +   * @return Updated partition info
    +   */
    +  def update(splits: Array[Array[Split]], newActiveNodes: Array[LearningNode]): TrainingInfo = {
    +    // Create buffers for storing our new arrays of node offsets & impurities
    +    val newNodeOffsets = new ArrayBuffer[(Int, Int)]()
    +    // Update (per-node) sorting of each column to account for creation of new nodes
    +    var nodeIdx = 0
    +    while (nodeIdx < activeNodes.length) {
    +      val node = activeNodes(nodeIdx)
    +      // Get new active node offsets from active nodes that were split
    +      if (node.split.isDefined) {
    +        // Get split and FeatureVector corresponding to feature for split
    +        val split = node.split.get
    +        val col = columns(split.featureIndex)
    +        val (from, to) = nodeOffsets(nodeIdx)
    +        // Compute bitset indicating whether each training example splits left/right
    +        val bitset = TrainingInfo.bitSetFromSplit(col, from, to, split, splits)
    +        // Update each column according to the bitset
    +        val numRows = to - from
    +        // Allocate shared temp buffers (shared across all columns) for reordering
    +        // feature values/indices for current node.
    +        val tempVals = new Array[Int](numRows)
    +        val tempIndices = new Array[Int](numRows)
    +        val numLeftRows = numRows - bitset.cardinality()
    +        columns.foreach(_.updateForSplit(from, to, numLeftRows, tempVals, tempIndices, bitset))
    +
    +        // Add new node offsets to array
    +        val leftIndices = (from, from + numLeftRows)
    +        val rightIndices = (from + numLeftRows, to)
    +        newNodeOffsets ++= Array(leftIndices, rightIndices)
    +      }
    +      nodeIdx += 1
    +    }
    +    TrainingInfo(columns, instanceWeights, newNodeOffsets.toArray, newActiveNodes)
    +  }
    +
    +}
    +
    +/** Training-info specific utility methods. */
    +private[impl] object TrainingInfo {
    +  /**
    +   * For a given feature, for a given node, apply a split and return a bitset indicating the
    +   * outcome of the split for each instance at that node.
    +   *
    +   * @param col  Column for feature
    +   * @param from  Start offset in col for the node
    +   * @param to  End offset in col for the node
    +   * @param split  Split to apply to instances at this node.
    +   * @return  Bitset indicating splits for instances at this node.
    +   *          These bits are sorted by the row indices.
    +   *          bitset(i) = true if ith example for current node splits right, false otherwise.
    +   */
    +  private[impl] def bitSetFromSplit(
    +      col: FeatureVector,
    +      from: Int,
    +      to: Int,
    +      split: Split,
    +      allSplits: Array[Array[Split]]): BitSet = {
    --- End diff --
    
    Ditto, you only need to pass in the featureSplit: Array[Split], don't pass all splits for all features.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83353/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82721/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r147036693
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
    @@ -0,0 +1,250 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree._
    +import org.apache.spark.mllib.tree.model.ImpurityStats
    +
    +/** Object exposing methods for local training of decision trees */
    +private[ml] object LocalDecisionTree {
    +
    +  /**
    +   * Fully splits the passed-in node on the provided local dataset, returning
    +   * an InternalNode/LeafNode corresponding to the root of the resulting tree.
    +   *
    +   * @param node LearningNode to use as the root of the subtree fit on the passed-in dataset
    +   * @param metadata learning and dataset metadata for DecisionTree
    +   * @param splits splits(i) = array of splits for feature i
    +   */
    +  private[ml] def fitNode(
    +      input: Array[TreePoint],
    +      instanceWeights: Array[Double],
    +      node: LearningNode,
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]): Node = {
    +
    +    // The case with 1 node (depth = 0) is handled separately.
    +    // This allows all iterations in the depth > 0 case to use the same code.
    +    // TODO: Check that learning works when maxDepth > 0 but learning stops at 1 node (because of
    +    //       other parameters).
    +    if (metadata.maxDepth == 0) {
    +      return node.toNode
    +    }
    +
    +    // Prepare column store.
    +    //   Note: rowToColumnStoreDense checks to make sure numRows < Int.MaxValue.
    +    val colStoreInit: Array[Array[Int]]
    +    = LocalDecisionTreeUtils.rowToColumnStoreDense(input.map(_.binnedFeatures))
    +    val labels = input.map(_.label)
    +
    +    // Fit a regression model on the dataset, throwing an error if metadata indicates that
    +    // we should train a classifier.
    +    // TODO: Add support for training classifiers
    +    if (metadata.numClasses > 1 && metadata.numClasses <= 32) {
    +      throw new UnsupportedOperationException("Local training of a decision tree classifier is " +
    +        "unsupported; currently, only regression is supported")
    +    } else {
    +      trainRegressor(node, colStoreInit, instanceWeights, labels, metadata, splits)
    +    }
    +  }
    +
    +  /**
    +   * Locally fits a decision tree regressor.
    +   * TODO(smurching): Logic for fitting a classifier & regressor is the same; only difference
    +   * is impurity metric. Use the same logic for fitting a classifier.
    +   *
    +   * @param rootNode Node to use as root of the tree fit on the passed-in dataset
    +   * @param colStoreInit Array of columns of training data
    +   * @param instanceWeights Array of weights for each training example
    +   * @param metadata learning and dataset metadata for DecisionTree
    +   * @param splits splits(i) = Array of possible splits for feature i
    +   * @return LeafNode or InternalNode representation of rootNode
    +   */
    +  private[ml] def trainRegressor(
    +      rootNode: LearningNode,
    +      colStoreInit: Array[Array[Int]],
    +      instanceWeights: Array[Double],
    +      labels: Array[Double],
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]): Node = {
    +
    +    // Sort each column by decision tree node.
    +    val colStore: Array[FeatureVector] = colStoreInit.zipWithIndex.map { case (col, featureIndex) =>
    +      val featureArity: Int = metadata.featureArity.getOrElse(featureIndex, 0)
    +      FeatureVector(featureIndex, featureArity, col)
    +    }
    +
    +    val numRows = colStore.headOption match {
    +      case None => 0
    +      case Some(column) => column.values.length
    +    }
    +
    +    // Create a new PartitionInfo describing the status of our partially-trained subtree
    +    // at each iteration of training
    +    var trainingInfo: TrainingInfo = TrainingInfo(colStore, instanceWeights,
    +      nodeOffsets = Array[(Int, Int)]((0, numRows)), activeNodes = Array(rootNode))
    +
    +    // Iteratively learn, one level of the tree at a time.
    +    // Note: We do not use node IDs.
    +    var currentLevel = 0
    +    var doneLearning = false
    +
    +    while (currentLevel < metadata.maxDepth && !doneLearning) {
    +      // Splits each active node if possible, returning an array of new active nodes
    +      val activeNodes: Array[LearningNode] =
    +        computeBestSplits(trainingInfo, labels, metadata, splits)
    +      // Filter active node periphery by impurity.
    +      val estimatedRemainingActive = activeNodes.count(_.stats.impurity > 0.0)
    --- End diff --
    
    Wait... I check the code here: `trainingInfo = trainingInfo.update(splits, activeNodes)` So it seems you do not filter out the leaf node from the "activeNodes"(which is actually the `nextLevelNode` I mentioned above).
    So I think `TrainingInfo.activeNodes` is still possible to contains leaf node.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83219/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Made a few updates, here’s a quick summary/what I’d propose moving forward:
    
    Right now:
    * Shared row indices for all (categorical & continuous) features are stored & updated in `TrainingInfo`
    * `LocalDecisionTree.computeBestSplits` computes best splits/sufficient stats for a single feature at a time
    * A utility method (`LocalDecisionTreeUtils.updateArrayForSplit`) is used to sort both feature values and shared row indices
    
    When we add support for raw continuous feature values:
    * Add a subclass of `FeatureColumn` (e.g. `ContinuousFeatureColumn`) that stores and sorts its own array of row indices, pass these row indices to methods requiring them.
    
    I also renamed `FeatureVector` to `FeatureColumn` since the former seemed like it’d confuse developers (`FeatureVector` sounds like a single data point)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r144789990
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/AggUpdateUtils.scala ---
    @@ -0,0 +1,85 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree.Split
    +
    +/**
    + * Helpers for updating DTStatsAggregators during collection of sufficient stats for tree training.
    + */
    +private[impl] object AggUpdateUtils {
    +
    +  /**
    +   * Updates the parent node stats of the passed-in impurity aggregator with the labels
    +   * corresponding to the feature values at indices [from, to).
    +   */
    +  private[impl] def updateParentImpurity(
    +      statsAggregator: DTStatsAggregator,
    +      col: FeatureVector,
    --- End diff --
    
    Actually, `updateParentImpurity` has no relation with any feature column, but here you pass in the `feature` column only want to use the `indices` array, passing anyone feature column will be OK. But, this looks weird, maybe it can be better designed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r151011879
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/InformationGainStats.scala ---
    @@ -112,7 +113,7 @@ private[spark] object ImpurityStats {
        * minimum number of instances per node.
        */
       def getInvalidImpurityStats(impurityCalculator: ImpurityCalculator): ImpurityStats = {
    -    new ImpurityStats(Double.MinValue, impurityCalculator.calculate(),
    +    new ImpurityStats(Double.MinValue, impurity = -1,
    --- End diff --
    
    I changed this to be -1 here since node impurity would eventually get set to -1 anyways when `LearningNodes` with invalid `ImpurityStats` were converted into decision tree leaf nodes (see [`LearningNode.toNode`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala#L279))


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83025/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/19433

@smurching I found some issues and have some thoughts on the columnar features format:

- In your doc, you said "Specifically, we only need to store sufficient stats for each bin of a single feature, as opposed to each bin of every feature", BUT, current implementation, you still allocate space for all features when computing: -- see `DTStatsAggregator` implementation, you pass `featureSubset = None` so `DTStatsAggregator` will allocate space for every features. According to your purpose, you should pass `featureSubset = Some(Array(currentFeatureIndex))`.

- Current implementation still use binnedFeatures. You said in future it will be improved to sort feature values for continuous feature (for more precise tree training), if you want to consider every possible thresholds, you need hold rawFeatures instead of binnedFeatures in the columnar feature array, and in each split range offset, you need sort every continuous features. Is this the thing you want to do in the future ? This will increase calculation amount.

- For current implementation(using binnedFeature) , there is no need to sort continuous features inside each split offset. So the `indices` for each feature is exactly the same. In order to save memory, I think these `indices` should be shared, no need to create separate indices array for each features. Even if you add the improvements for continuous features mentioned above, you can create separate `indices` array for **only** continuous features, the categorical features can still share the same `indices` array.

- About locality advantage of columnar format, I have some doubts. Current implementation, you do not reorder the `label` and `weight` array, access `label` and `weight` value need use `indices`, when calculating `DTStat`, this break locality. (But I'm not sure how much impact to perf this will bring).

- About the overhead of columnar format: when making reordering (when get new split, we need reorder left sub-tree samples into front), so you need reordering on each column, and at the same time, update the `indices` array. But, if we use row format, like:
`Array[(features, label, weight)]`, reordering will be much easier, and do not need indices.
So, I am considering, whether we can use row format, but at the time when we need `DTStatsAggregator` computation, copy the data we need from the row format into columnar format array (only need to copy rows between sub-node offset and only copy the sampled features if using feature subsampling).

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #82652 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82652/testReport)** for PR 19433 at commit [`5c29d3d`](https://github.com/apache/spark/commit/5c29d3d1e899c8d311633c4d763b57e42a26c660).
     * This patch **fails SparkR unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r151019591
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -852,6 +662,41 @@ private[spark] object RandomForest extends Logging {
       }
     
       /**
    +   * Find the best split for a node.
    +   *
    +   * @param binAggregates Bin statistics.
    +   * @return tuple for best split: (Split, information gain, prediction at node)
    +   */
    +  private[tree] def binsToBestSplit(
    +      binAggregates: DTStatsAggregator,
    +      splits: Array[Array[Split]],
    +      featuresForNode: Option[Array[Int]],
    +      node: LearningNode): (Split, ImpurityStats) = {
    +    val validFeatureSplits = getNonConstantFeatures(binAggregates.metadata, featuresForNode)
    +    // For each (feature, split), calculate the gain, and select the best (feature, split).
    +    val parentImpurityCalc = if (node.stats == null) None else Some(node.stats.impurityCalculator)
    --- End diff --
    
    I believe so, the nodes at the top level are created ([RandomForest.scala:178](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L178)) with [`LearningNode.emptyNode`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala#L341), which sets `node.stats = null`.
    
    I could change this to check node depth (via node index), but if we're planning on deprecating node indices in the future it might be best not to.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #83464 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83464/testReport)** for PR 19433 at commit [`3f72cc0`](https://github.com/apache/spark/commit/3f72cc0d92132f850c5eebe2686473bff111199f).
     * This patch **fails to generate documentation**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82652/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on the issue:

https://github.com/apache/spark/pull/19433

Thanks for the comments!

- Yep, feature subsampling is necessary for using local tree training in distributed training. I was thinking of adding subsampling in a follow-up PR. You're right that we don't need to pass an array of `BaggedPoints` to local tree training; we should just pass an array of `subsampleWeights` (weights for the current tree) and an array of `TreePoints`. I'll push an update for this.

- Agreed that the logic for classification will be the same but with a different impurity metric. I can add support for classification & associated tests in a follow-up PR.

- IMO the primary advantage of the columnar storage format is that it'll eventually enable improvements to best split calculations; specifically, for continuous features we could sort the unbinned feature values and consider every possible threshold. There are also the locality & memory advantages described in the design doc. In brief, `DTStatsAggregator` stores a flat array partitioned by (feature x bin). If we can iterate through all values for a single feature at once, most updates to `DTStatsAggregator`will occur within the same subarray.

- Multithreading could be a nice way to increase parallelism since we don't use Spark during local tree training. I think we could add it in a follow-up PR.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r151017375
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/SplitUtils.scala ---
    @@ -0,0 +1,215 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree.{CategoricalSplit, Split}
    +import org.apache.spark.mllib.tree.impurity.ImpurityCalculator
    +import org.apache.spark.mllib.tree.model.ImpurityStats
    +
    +/** Utility methods for choosing splits during local & distributed tree training. */
    +private[impl] object SplitUtils {
    +
    +  /** Sorts ordered feature categories by label centroid, returning an ordered list of categories */
    +  private def sortByCentroid(
    +      binAggregates: DTStatsAggregator,
    +      featureIndex: Int,
    +      featureIndexIdx: Int): List[Int] = {
    +    /* Each bin is one category (feature value).
    +     * The bins are ordered based on centroidForCategories, and this ordering determines which
    +     * splits are considered.  (With K categories, we consider K - 1 possible splits.)
    +     *
    +     * centroidForCategories is a list: (category, centroid)
    +     */
    +    val numCategories = binAggregates.metadata.numBins(featureIndex)
    +    val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
    +
    +    val centroidForCategories = Range(0, numCategories).map { featureValue =>
    +      val categoryStats =
    +        binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue)
    +      val centroid = ImpurityUtils.getCentroid(binAggregates.metadata, categoryStats)
    +      (featureValue, centroid)
    +    }
    +    // TODO(smurching): How to handle logging statements like these?
    +    // logDebug("Centroids for categorical variable: " + centroidForCategories.mkString(","))
    +    // bins sorted by centroids
    +    val categoriesSortedByCentroid = centroidForCategories.toList.sortBy(_._2).map(_._1)
    +    // logDebug("Sorted centroids for categorical variable = " +
    +    //   categoriesSortedByCentroid.mkString(","))
    +    categoriesSortedByCentroid
    +  }
    +
    +  /**
    +   * Find the best split for an unordered categorical feature at a single node.
    +   *
    +   * Algorithm:
    +   *  - Considers all possible subsets (exponentially many)
    +   *
    +   * @param featureIndex  Global index of feature being split.
    +   * @param featureIndexIdx Index of feature being split within subset of features for current node.
    +   * @param featureSplits Array of splits for the current feature
    +   * @param parentCalculator Optional: ImpurityCalculator containing impurity stats for current node
    +   * @return  (best split, statistics for split)  If no valid split was found, the returned
    +   *          ImpurityStats instance will be invalid (have member valid = false).
    +   */
    +  private[impl] def chooseUnorderedCategoricalSplit(
    +      binAggregates: DTStatsAggregator,
    +      featureIndex: Int,
    +      featureIndexIdx: Int,
    +      featureSplits: Array[Split],
    +      parentCalculator: Option[ImpurityCalculator] = None): (Split, ImpurityStats) = {
    +    // Unordered categorical feature
    +    val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
    +    val numSplits = binAggregates.metadata.numSplits(featureIndex)
    +    var parentCalc = parentCalculator
    +    val (bestFeatureSplitIndex, bestFeatureGainStats) =
    +      Range(0, numSplits).map { splitIndex =>
    +        val leftChildStats = binAggregates.getImpurityCalculator(nodeFeatureOffset, splitIndex)
    +        val rightChildStats = binAggregates.getParentImpurityCalculator()
    +          .subtract(leftChildStats)
    +        val gainAndImpurityStats = ImpurityUtils.calculateImpurityStats(parentCalc,
    +          leftChildStats, rightChildStats, binAggregates.metadata)
    +        // Compute parent stats once, when considering first split for current feature
    +        if (parentCalc.isEmpty) {
    +          parentCalc = Some(gainAndImpurityStats.impurityCalculator)
    +        }
    +        (splitIndex, gainAndImpurityStats)
    +      }.maxBy(_._2.gain)
    +    (featureSplits(bestFeatureSplitIndex), bestFeatureGainStats)
    +
    +  }
    +
    +  /**
    +   * Choose splitting rule: feature value <= threshold
    +   *
    +   * @return  (best split, statistics for split)  If the best split actually puts all instances
    +   *          in one leaf node, then it will be set to None.  If no valid split was found, the
    +   *          returned ImpurityStats instance will be invalid (have member valid = false)
    +   */
    +  private[impl] def chooseContinuousSplit(
    +      binAggregates: DTStatsAggregator,
    +      featureIndex: Int,
    +      featureIndexIdx: Int,
    +      featureSplits: Array[Split],
    +      parentCalculator: Option[ImpurityCalculator] = None): (Split, ImpurityStats) = {
    +    // For a continuous feature, bins are already sorted for splitting
    +    // Number of "categories" = number of bins
    +    val sortedCategories = Range(0, binAggregates.metadata.numBins(featureIndex)).toList
    +    // Get & return best split info
    +    val (bestFeatureSplitIndex, bestFeatureGainStats) = orderedSplitHelper(binAggregates,
    +      featureIndex, featureIndexIdx, sortedCategories, parentCalculator)
    +    (featureSplits(bestFeatureSplitIndex), bestFeatureGainStats)
    +  }
    +
    +  /**
    +   * Computes the index of the best split for an ordered feature.
    +   * @param parentCalculator Optional: ImpurityCalculator containing impurity stats for current node
    +   */
    +  private def orderedSplitHelper(
    +      binAggregates: DTStatsAggregator,
    +      featureIndex: Int,
    +      featureIndexIdx: Int,
    +      categoriesSortedByCentroid: List[Int],
    +      parentCalculator: Option[ImpurityCalculator]): (Int, ImpurityStats) = {
    +    // Cumulative sum (scanLeft) of bin statistics.
    +    // Afterwards, binAggregates for a bin is the sum of aggregates for
    +    // that bin + all preceding bins.
    +    assert(!binAggregates.metadata.isUnordered(featureIndex))
    +    val numSplits = binAggregates.metadata.numSplits(featureIndex)
    +    val nodeFeatureOffset = binAggregates.getFeatureOffset(featureIndexIdx)
    +    var splitIndex = 0
    +    while (splitIndex < numSplits) {
    +      val currentCategory = categoriesSortedByCentroid(splitIndex)
    +      val nextCategory = categoriesSortedByCentroid(splitIndex + 1)
    +      binAggregates.mergeForFeature(nodeFeatureOffset, nextCategory, currentCategory)
    +      splitIndex += 1
    +    }
    +    // lastCategory = index of bin with total aggregates for this (node, feature)
    +    val lastCategory = categoriesSortedByCentroid.last
    +
    +    // Find best split.
    +    var parentCalc = parentCalculator
    +    Range(0, numSplits).map { splitIndex =>
    +      val featureValue = categoriesSortedByCentroid(splitIndex)
    +      val leftChildStats =
    +        binAggregates.getImpurityCalculator(nodeFeatureOffset, featureValue)
    +      val rightChildStats =
    --- End diff --
    
    Exactly, it's the parentCalc minus the left child stats. Since `ImpurityCalculator.subtract()` updates the impurity calculator in place, we call `binAggregates.getParentImpurityCalculator()` to get a copy of the parent impurity calculator, then subtract the left child stats.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r146731072
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/FeatureVector.scala ---
    @@ -0,0 +1,148 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.util.collection.BitSet
    +
    +/**
    + * Stores values for a single training data column (a single continuous or categorical feature).
    + *
    + * Values are currently stored in a dense representation only.
    + * TODO: Support sparse storage (to optimize deeper levels of the tree), and maybe compressed
    + *       storage (to optimize upper levels of the tree).
    + *
    + * TODO: Sort feature values to support more complicated splitting logic (e.g. considering every
    + *       possible continuous split instead of discretizing continuous features).
    + *
    + * NOTE: We could add sorting of feature values in this PR; the only changed required would be to
    + * sort feature values at construction-time. Sorting might improve locality during stats
    + * aggregation (we'd frequently update the same O(statsSize) array for a (feature, bin),
    + * instead of frequently updating for the same feature).
    + *
    + * @param featureArity  For categorical features, this gives the number of categories.
    + *                      For continuous features, this should be set to 0.
    + * @param rowIndices Optional: rowIndices(i) is the row index of the ith feature value (values(i))
    + *                   If unspecified, feature values are assumed to be ordered by row (i.e. values(i)
    + *                   is a feature value from the ith row).
    + */
    +private[impl] class FeatureVector(
    +    val featureIndex: Int,
    +    val featureArity: Int,
    +    val values: Array[Int],
    +    private val rowIndices: Option[Array[Int]])
    +  extends Serializable {
    +  // Associates feature values with training point rows. indices(i) = training point index
    +  // (row index) of ith feature value
    +  val indices = rowIndices.getOrElse(values.indices.toArray)
    +
    +  def isCategorical: Boolean = featureArity > 0
    +
    +  /** For debugging */
    +  override def toString: String = {
    +    "  FeatureVector(" +
    +      s"    featureIndex: $featureIndex,\n" +
    +      s"    featureType: ${if (featureArity == 0) "Continuous" else "Categorical"},\n" +
    +      s"    featureArity: $featureArity,\n" +
    +      s"    values: ${values.mkString(", ")},\n" +
    +      s"    indices: ${indices.mkString(", ")},\n" +
    +      "  )"
    +  }
    +
    +  def deepCopy(): FeatureVector =
    +    new FeatureVector(featureIndex, featureArity, values.clone(), Some(indices.clone()))
    +
    +  override def equals(other: Any): Boolean = {
    +    other match {
    +      case o: FeatureVector =>
    +        featureIndex == o.featureIndex && featureArity == o.featureArity &&
    +          values.sameElements(o.values) && indices.sameElements(o.indices)
    +      case _ => false
    +    }
    +  }
    +
    +  /**
    +   * Reorders the subset of feature values at indices [from, to) in the passed-in column
    +   * according to the split information encoded in instanceBitVector (feature values for rows
    +   * that split left appear before feature values for rows that split right).
    +   *
    +   * @param numLeftRows Number of rows on the left side of the split
    +   * @param tempVals Destination buffer for reordered feature values
    +   * @param tempIndices Destination buffer for row indices corresponding to reordered feature values
    +   * @param instanceBitVector instanceBitVector(i) = true if the row for the ith feature
    +   *                          value splits right, false otherwise
    +   */
    +  private[ml] def updateForSplit(
    +      from: Int,
    +      to: Int,
    +      numLeftRows: Int,
    +      tempVals: Array[Int],
    +      tempIndices: Array[Int],
    +      instanceBitVector: BitSet): Unit = {
    +
    +    // BEGIN SORTING
    +    // We sort the [from, to) slice of col based on instance bit.
    +    // All instances going "left" in the split (which are false)
    +    // should be ordered before the instances going "right". The instanceBitVector
    +    // gives us the split bit value for each instance based on the instance's index.
    +    // We copy our feature values into @tempVals and @tempIndices either:
    +    // 1) in the [from, numLeftRows) range if the bit is false, or
    +    // 2) in the [numBitsNotSet, to) range if the bit is true.
    --- End diff --
    
    Will change this, thanks for the catch!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #82557 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82557/testReport)** for PR 19433 at commit [`9a7174e`](https://github.com/apache/spark/commit/9a7174ed4a62033abfe2325dc1a8c5850e07f5f3).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r150158027
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -627,221 +621,37 @@ private[spark] object RandomForest extends Logging {
       }
     
       /**
    -   * Calculate the impurity statistics for a given (feature, split) based upon left/right
    -   * aggregates.
    -   *
    -   * @param stats the recycle impurity statistics for this feature's all splits,
    -   *              only 'impurity' and 'impurityCalculator' are valid between each iteration
    -   * @param leftImpurityCalculator left node aggregates for this (feature, split)
    -   * @param rightImpurityCalculator right node aggregate for this (feature, split)
    -   * @param metadata learning and dataset metadata for DecisionTree
    -   * @return Impurity statistics for this (feature, split)
    +   * Return a list of pairs (featureIndexIdx, featureIndex) where featureIndex is the global
    +   * (across all trees) index of a feature and featureIndexIdx is the index of a feature within the
    +   * list of features for a given node. Filters out constant features (features with 0 splits)
        */
    -  private def calculateImpurityStats(
    -      stats: ImpurityStats,
    -      leftImpurityCalculator: ImpurityCalculator,
    -      rightImpurityCalculator: ImpurityCalculator,
    -      metadata: DecisionTreeMetadata): ImpurityStats = {
    -
    -    val parentImpurityCalculator: ImpurityCalculator = if (stats == null) {
    -      leftImpurityCalculator.copy.add(rightImpurityCalculator)
    -    } else {
    -      stats.impurityCalculator
    -    }
    -
    -    val impurity: Double = if (stats == null) {
    -      parentImpurityCalculator.calculate()
    -    } else {
    -      stats.impurity
    -    }
    -
    -    val leftCount = leftImpurityCalculator.count
    -    val rightCount = rightImpurityCalculator.count
    -
    -    val totalCount = leftCount + rightCount
    -
    -    // If left child or right child doesn't satisfy minimum instances per node,
    -    // then this split is invalid, return invalid information gain stats.
    -    if ((leftCount < metadata.minInstancesPerNode) ||
    -      (rightCount < metadata.minInstancesPerNode)) {
    -      return ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator)
    -    }
    -
    -    val leftImpurity = leftImpurityCalculator.calculate() // Note: This equals 0 if count = 0
    -    val rightImpurity = rightImpurityCalculator.calculate()
    -
    -    val leftWeight = leftCount / totalCount.toDouble
    -    val rightWeight = rightCount / totalCount.toDouble
    -
    -    val gain = impurity - leftWeight * leftImpurity - rightWeight * rightImpurity
    -
    -    // if information gain doesn't satisfy minimum information gain,
    -    // then this split is invalid, return invalid information gain stats.
    -    if (gain < metadata.minInfoGain) {
    -      return ImpurityStats.getInvalidImpurityStats(parentImpurityCalculator)
    +  private[impl] def getNonConstantFeatures(
    +      metadata: DecisionTreeMetadata,
    +      featuresForNode: Option[Array[Int]]): Seq[(Int, Int)] = {
    +    Range(0, metadata.numFeaturesPerNode).map { featureIndexIdx =>
    --- End diff --
    
    Was there a reason to remove the use of view and withFilter here?  With the output of this method going through further Seq operations, I would expect the previous implementation to be more efficient.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r146731101
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/TrainingInfo.scala ---
    @@ -0,0 +1,144 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import scala.collection.mutable.ArrayBuffer
    +
    +import org.apache.spark.ml.tree.{LearningNode, Split}
    +import org.apache.spark.util.collection.BitSet
    +
    +/**
    + * Maintains intermediate state of data (columns) and tree during local tree training.
    + * Primary local tree training data structure; contains all information required to describe
    + * the state of the algorithm at any point during learning.??
    + *
    + * Nodes are indexed left-to-right along the periphery of the tree, with 0-based indices.
    + * The "periphery" is the set of leaf nodes (active and inactive).
    + *
    + * @param columns  Array of columns.
    + *                 Each column is sorted first by nodes (left-to-right along the tree periphery);
    + *                 all columns share this first level of sorting.
    + *                 Within each node's group, each column is sorted based on feature value;
    + *                 this second level of sorting differs across columns.
    + * @param instanceWeights Array of weights for each training example
    + * @param nodeOffsets  Offsets into the columns indicating the first level of sorting (by node).
    + *                     The rows corresponding to the node activeNodes(i) are in the range
    + *                     [nodeOffsets(i)(0), nodeOffsets(i)(1)) .
    + * @param activeNodes  Nodes which are active (still being split).
    + *                     Inactive nodes are known to be leaves in the final tree.
    + */
    +private[impl] case class TrainingInfo(
    +    columns: Array[FeatureVector],
    +    instanceWeights: Array[Double],
    --- End diff --
    
    Good call, I'll move `instanceWeights` outside `TrainingInfo`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #82721 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82721/testReport)** for PR 19433 at commit [`93e17fc`](https://github.com/apache/spark/commit/93e17fc74958d4fa8f3bea38731ecec662e4ca66).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82570/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82717/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    jenkins retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83104/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r144786233
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/FeatureVector.scala ---
    @@ -0,0 +1,148 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.util.collection.BitSet
    +
    +/**
    + * Stores values for a single training data column (a single continuous or categorical feature).
    + *
    + * Values are currently stored in a dense representation only.
    + * TODO: Support sparse storage (to optimize deeper levels of the tree), and maybe compressed
    + *       storage (to optimize upper levels of the tree).
    + *
    + * TODO: Sort feature values to support more complicated splitting logic (e.g. considering every
    + *       possible continuous split instead of discretizing continuous features).
    + *
    + * NOTE: We could add sorting of feature values in this PR; the only changed required would be to
    + * sort feature values at construction-time. Sorting might improve locality during stats
    + * aggregation (we'd frequently update the same O(statsSize) array for a (feature, bin),
    + * instead of frequently updating for the same feature).
    + *
    + * @param featureArity  For categorical features, this gives the number of categories.
    + *                      For continuous features, this should be set to 0.
    + * @param rowIndices Optional: rowIndices(i) is the row index of the ith feature value (values(i))
    + *                   If unspecified, feature values are assumed to be ordered by row (i.e. values(i)
    + *                   is a feature value from the ith row).
    + */
    +private[impl] class FeatureVector(
    +    val featureIndex: Int,
    +    val featureArity: Int,
    +    val values: Array[Int],
    +    private val rowIndices: Option[Array[Int]])
    +  extends Serializable {
    +  // Associates feature values with training point rows. indices(i) = training point index
    +  // (row index) of ith feature value
    +  val indices = rowIndices.getOrElse(values.indices.toArray)
    +
    +  def isCategorical: Boolean = featureArity > 0
    +
    +  /** For debugging */
    +  override def toString: String = {
    +    "  FeatureVector(" +
    +      s"    featureIndex: $featureIndex,\n" +
    +      s"    featureType: ${if (featureArity == 0) "Continuous" else "Categorical"},\n" +
    +      s"    featureArity: $featureArity,\n" +
    +      s"    values: ${values.mkString(", ")},\n" +
    +      s"    indices: ${indices.mkString(", ")},\n" +
    +      "  )"
    +  }
    +
    +  def deepCopy(): FeatureVector =
    +    new FeatureVector(featureIndex, featureArity, values.clone(), Some(indices.clone()))
    +
    +  override def equals(other: Any): Boolean = {
    +    other match {
    +      case o: FeatureVector =>
    +        featureIndex == o.featureIndex && featureArity == o.featureArity &&
    +          values.sameElements(o.values) && indices.sameElements(o.indices)
    +      case _ => false
    +    }
    +  }
    +
    +  /**
    +   * Reorders the subset of feature values at indices [from, to) in the passed-in column
    +   * according to the split information encoded in instanceBitVector (feature values for rows
    +   * that split left appear before feature values for rows that split right).
    +   *
    +   * @param numLeftRows Number of rows on the left side of the split
    +   * @param tempVals Destination buffer for reordered feature values
    +   * @param tempIndices Destination buffer for row indices corresponding to reordered feature values
    +   * @param instanceBitVector instanceBitVector(i) = true if the row for the ith feature
    +   *                          value splits right, false otherwise
    +   */
    +  private[ml] def updateForSplit(
    +      from: Int,
    +      to: Int,
    +      numLeftRows: Int,
    +      tempVals: Array[Int],
    +      tempIndices: Array[Int],
    +      instanceBitVector: BitSet): Unit = {
    +
    +    // BEGIN SORTING
    +    // We sort the [from, to) slice of col based on instance bit.
    +    // All instances going "left" in the split (which are false)
    +    // should be ordered before the instances going "right". The instanceBitVector
    +    // gives us the split bit value for each instance based on the instance's index.
    +    // We copy our feature values into @tempVals and @tempIndices either:
    +    // 1) in the [from, numLeftRows) range if the bit is false, or
    +    // 2) in the [numBitsNotSet, to) range if the bit is true.
    --- End diff --
    
    Although `numLeftRows` == `numBitsNotSet`, it is better to keep them the same in doc.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r146981434
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
    @@ -0,0 +1,250 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree._
    +import org.apache.spark.mllib.tree.model.ImpurityStats
    +
    +/** Object exposing methods for local training of decision trees */
    +private[ml] object LocalDecisionTree {
    +
    +  /**
    +   * Fully splits the passed-in node on the provided local dataset, returning
    +   * an InternalNode/LeafNode corresponding to the root of the resulting tree.
    +   *
    +   * @param node LearningNode to use as the root of the subtree fit on the passed-in dataset
    +   * @param metadata learning and dataset metadata for DecisionTree
    +   * @param splits splits(i) = array of splits for feature i
    +   */
    +  private[ml] def fitNode(
    +      input: Array[TreePoint],
    +      instanceWeights: Array[Double],
    +      node: LearningNode,
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]): Node = {
    +
    +    // The case with 1 node (depth = 0) is handled separately.
    +    // This allows all iterations in the depth > 0 case to use the same code.
    +    // TODO: Check that learning works when maxDepth > 0 but learning stops at 1 node (because of
    +    //       other parameters).
    +    if (metadata.maxDepth == 0) {
    +      return node.toNode
    +    }
    +
    +    // Prepare column store.
    +    //   Note: rowToColumnStoreDense checks to make sure numRows < Int.MaxValue.
    +    val colStoreInit: Array[Array[Int]]
    +    = LocalDecisionTreeUtils.rowToColumnStoreDense(input.map(_.binnedFeatures))
    +    val labels = input.map(_.label)
    +
    +    // Fit a regression model on the dataset, throwing an error if metadata indicates that
    +    // we should train a classifier.
    +    // TODO: Add support for training classifiers
    +    if (metadata.numClasses > 1 && metadata.numClasses <= 32) {
    +      throw new UnsupportedOperationException("Local training of a decision tree classifier is " +
    +        "unsupported; currently, only regression is supported")
    +    } else {
    +      trainRegressor(node, colStoreInit, instanceWeights, labels, metadata, splits)
    +    }
    +  }
    +
    +  /**
    +   * Locally fits a decision tree regressor.
    +   * TODO(smurching): Logic for fitting a classifier & regressor is the same; only difference
    +   * is impurity metric. Use the same logic for fitting a classifier.
    +   *
    +   * @param rootNode Node to use as root of the tree fit on the passed-in dataset
    +   * @param colStoreInit Array of columns of training data
    +   * @param instanceWeights Array of weights for each training example
    +   * @param metadata learning and dataset metadata for DecisionTree
    +   * @param splits splits(i) = Array of possible splits for feature i
    +   * @return LeafNode or InternalNode representation of rootNode
    +   */
    +  private[ml] def trainRegressor(
    +      rootNode: LearningNode,
    +      colStoreInit: Array[Array[Int]],
    +      instanceWeights: Array[Double],
    +      labels: Array[Double],
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]): Node = {
    +
    +    // Sort each column by decision tree node.
    +    val colStore: Array[FeatureVector] = colStoreInit.zipWithIndex.map { case (col, featureIndex) =>
    +      val featureArity: Int = metadata.featureArity.getOrElse(featureIndex, 0)
    +      FeatureVector(featureIndex, featureArity, col)
    +    }
    +
    +    val numRows = colStore.headOption match {
    +      case None => 0
    +      case Some(column) => column.values.length
    +    }
    +
    +    // Create a new PartitionInfo describing the status of our partially-trained subtree
    +    // at each iteration of training
    +    var trainingInfo: TrainingInfo = TrainingInfo(colStore, instanceWeights,
    +      nodeOffsets = Array[(Int, Int)]((0, numRows)), activeNodes = Array(rootNode))
    +
    +    // Iteratively learn, one level of the tree at a time.
    +    // Note: We do not use node IDs.
    +    var currentLevel = 0
    +    var doneLearning = false
    +
    +    while (currentLevel < metadata.maxDepth && !doneLearning) {
    +      // Splits each active node if possible, returning an array of new active nodes
    +      val activeNodes: Array[LearningNode] =
    +        computeBestSplits(trainingInfo, labels, metadata, splits)
    +      // Filter active node periphery by impurity.
    +      val estimatedRemainingActive = activeNodes.count(_.stats.impurity > 0.0)
    --- End diff --
    
    Gotcha agreed on the naming change, how about `currentLevelActiveNodes`? Since only the non-leaf nodes from the current level are included.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #83025 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83025/testReport)** for PR 19433 at commit [`fd6cdbb`](https://github.com/apache/spark/commit/fd6cdbb66fd8beae944d0bd2029e8d62096747a8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #97977 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97977/testReport)** for PR 19433 at commit [`d86dd18`](https://github.com/apache/spark/commit/d86dd18e47451c2e4463c68db441f92a898ac765).
     * This patch **fails to generate documentation**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19433: [SPARK-3162] [MLlib] Add local tree training for ...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19433#discussion_r144787368
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/LocalDecisionTree.scala ---
    @@ -0,0 +1,250 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml.tree.impl
    +
    +import org.apache.spark.ml.tree._
    +import org.apache.spark.mllib.tree.model.ImpurityStats
    +
    +/** Object exposing methods for local training of decision trees */
    +private[ml] object LocalDecisionTree {
    +
    +  /**
    +   * Fully splits the passed-in node on the provided local dataset, returning
    +   * an InternalNode/LeafNode corresponding to the root of the resulting tree.
    +   *
    +   * @param node LearningNode to use as the root of the subtree fit on the passed-in dataset
    +   * @param metadata learning and dataset metadata for DecisionTree
    +   * @param splits splits(i) = array of splits for feature i
    +   */
    +  private[ml] def fitNode(
    +      input: Array[TreePoint],
    +      instanceWeights: Array[Double],
    +      node: LearningNode,
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]): Node = {
    +
    +    // The case with 1 node (depth = 0) is handled separately.
    +    // This allows all iterations in the depth > 0 case to use the same code.
    +    // TODO: Check that learning works when maxDepth > 0 but learning stops at 1 node (because of
    +    //       other parameters).
    +    if (metadata.maxDepth == 0) {
    +      return node.toNode
    +    }
    +
    +    // Prepare column store.
    +    //   Note: rowToColumnStoreDense checks to make sure numRows < Int.MaxValue.
    +    val colStoreInit: Array[Array[Int]]
    +    = LocalDecisionTreeUtils.rowToColumnStoreDense(input.map(_.binnedFeatures))
    +    val labels = input.map(_.label)
    +
    +    // Fit a regression model on the dataset, throwing an error if metadata indicates that
    +    // we should train a classifier.
    +    // TODO: Add support for training classifiers
    +    if (metadata.numClasses > 1 && metadata.numClasses <= 32) {
    +      throw new UnsupportedOperationException("Local training of a decision tree classifier is " +
    +        "unsupported; currently, only regression is supported")
    +    } else {
    +      trainRegressor(node, colStoreInit, instanceWeights, labels, metadata, splits)
    +    }
    +  }
    +
    +  /**
    +   * Locally fits a decision tree regressor.
    +   * TODO(smurching): Logic for fitting a classifier & regressor is the same; only difference
    +   * is impurity metric. Use the same logic for fitting a classifier.
    +   *
    +   * @param rootNode Node to use as root of the tree fit on the passed-in dataset
    +   * @param colStoreInit Array of columns of training data
    +   * @param instanceWeights Array of weights for each training example
    +   * @param metadata learning and dataset metadata for DecisionTree
    +   * @param splits splits(i) = Array of possible splits for feature i
    +   * @return LeafNode or InternalNode representation of rootNode
    +   */
    +  private[ml] def trainRegressor(
    +      rootNode: LearningNode,
    +      colStoreInit: Array[Array[Int]],
    +      instanceWeights: Array[Double],
    +      labels: Array[Double],
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]): Node = {
    +
    +    // Sort each column by decision tree node.
    +    val colStore: Array[FeatureVector] = colStoreInit.zipWithIndex.map { case (col, featureIndex) =>
    +      val featureArity: Int = metadata.featureArity.getOrElse(featureIndex, 0)
    +      FeatureVector(featureIndex, featureArity, col)
    +    }
    +
    +    val numRows = colStore.headOption match {
    +      case None => 0
    +      case Some(column) => column.values.length
    +    }
    +
    +    // Create a new PartitionInfo describing the status of our partially-trained subtree
    +    // at each iteration of training
    +    var trainingInfo: TrainingInfo = TrainingInfo(colStore, instanceWeights,
    +      nodeOffsets = Array[(Int, Int)]((0, numRows)), activeNodes = Array(rootNode))
    +
    +    // Iteratively learn, one level of the tree at a time.
    +    // Note: We do not use node IDs.
    +    var currentLevel = 0
    +    var doneLearning = false
    +
    +    while (currentLevel < metadata.maxDepth && !doneLearning) {
    +      // Splits each active node if possible, returning an array of new active nodes
    +      val activeNodes: Array[LearningNode] =
    +        computeBestSplits(trainingInfo, labels, metadata, splits)
    +      // Filter active node periphery by impurity.
    +      val estimatedRemainingActive = activeNodes.count(_.stats.impurity > 0.0)
    +      // TODO: Check to make sure we split something, and stop otherwise.
    +      doneLearning = currentLevel + 1 >= metadata.maxDepth || estimatedRemainingActive == 0
    +      if (!doneLearning) {
    +        // Obtain a new trainingInfo instance describing our current training status
    +        trainingInfo = trainingInfo.update(splits, activeNodes)
    +      }
    +      currentLevel += 1
    +    }
    +
    +    // Done with learning
    +    rootNode.toNode
    +  }
    +
    +  /**
    +   * Iterate over feature values and labels for a specific (node, feature), updating stats
    +   * aggregator for the current node.
    +   */
    +  private[impl] def updateAggregator(
    +      statsAggregator: DTStatsAggregator,
    +      col: FeatureVector,
    +      instanceWeights: Array[Double],
    +      labels: Array[Double],
    +      from: Int,
    +      to: Int,
    +      featureIndexIdx: Int,
    +      splits: Array[Array[Split]]): Unit = {
    +    val metadata = statsAggregator.metadata
    +    if (metadata.isUnordered(col.featureIndex)) {
    +      from.until(to).foreach { idx =>
    +        val rowIndex = col.indices(idx)
    +        AggUpdateUtils.updateUnorderedFeature(statsAggregator, col.values(idx), labels(rowIndex),
    +          featureIndex = col.featureIndex, featureIndexIdx, splits,
    +          instanceWeight = instanceWeights(rowIndex))
    +      }
    +    } else {
    +      from.until(to).foreach { idx =>
    +        val rowIndex = col.indices(idx)
    +        AggUpdateUtils.updateOrderedFeature(statsAggregator, col.values(idx), labels(rowIndex),
    +          featureIndex = col.featureIndex, featureIndexIdx,
    +          instanceWeight = instanceWeights(rowIndex))
    +      }
    +    }
    +  }
    +
    +  /**
    +   * Find the best splits for all active nodes
    +   *
    +   * @param trainingInfo Contains node offset info for current set of active nodes
    +   * @return  Array of new active nodes formed by splitting the current set of active nodes.
    +   */
    +  private def computeBestSplits(
    +      trainingInfo: TrainingInfo,
    +      labels: Array[Double],
    +      metadata: DecisionTreeMetadata,
    +      splits: Array[Array[Split]]) = {
    +    // For each node, select the best split across all features
    +    trainingInfo match {
    +      case TrainingInfo(columns: Array[FeatureVector], instanceWeights: Array[Double],
    +      nodeOffsets: Array[(Int, Int)], activeNodes: Array[LearningNode]) => {
    +        // Filter out leaf nodes from the previous iteration
    +        val activeNonLeafs = activeNodes.zipWithIndex.filterNot(_._1.isLeaf)
    +        // Iterate over the active nodes in the current level.
    +        activeNonLeafs.flatMap { case (node: LearningNode, nodeIndex: Int) =>
    --- End diff --
    
    The var name `activeNodes`, `activeNonLeafs` are not accurate I think.
    Here the `activeNodes` are actually "next level nodes", including "probably splittable nodes(active nodes)" and "leaf nodes".


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83503/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #83219 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83219/testReport)** for PR 19433 at commit [`7efb1e0`](https://github.com/apache/spark/commit/7efb1e0b551235036d7e24dfe16f6fb9e517e503).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #82570 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82570/testReport)** for PR 19433 at commit [`abc86b2`](https://github.com/apache/spark/commit/abc86b2042e0fd42cc0e9fe20cf79967b16e9779).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...

Posted by smurching <gi...@git.apache.org>.

Github user smurching commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    @WeichenXu123 would you be able to take an initial look at this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    **[Test build #83104 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83104/testReport)** for PR 19433 at commit [`9cc6333`](https://github.com/apache/spark/commit/9cc63334900d9928f58e3bab5bd1be452d042a53).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    I made a rough pass. I have only a few issues for now, I haven't go into code details:
    
    - The `colStoreInit` currently ignore the `subsampleWeights`, it should be used, isn't it ? I read your doc, in the higher level, the local training will be used to train sub-trees as parts of the global distributed training, `subsampleWeights` should be important info. and here it will train only single tree so `subsampleWeights` only contains one element, does we still need use `BaggedPoint` structure ? 
    
    - The logic of training for regression and for classification will be the same I think, only impurity difference but do not influence the code logic.
    
    - The key idea is to use the columnar storage format for features, is the purpose to improve memory cost & cache locality when finding best splits ? I see the code will do some reordering operation on feature values and use indices, but I haven't go into details. It's a complex part I need more time to review.
    
    - Maybe we can support multithreads in local training, what do you think about it ? 
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib] Add local tree training for decisio...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19433: [SPARK-3162] [MLlib][WIP] Add local tree training for de...

Posted by WeichenXu123 <gi...@git.apache.org>.

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/19433
  
    @smurching Does it still WIP ? If done remove "[WIP]", I will begin review, thanks!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org