You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by sethah <gi...@git.apache.org> on 2015/12/09 23:23:25 UTC

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/10231

    [SPARK-12182][ML] Distributed binning for trees in spark.ml

    This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](https://github.com/apache/spark/pull/8246) which added the same feature for MLlib. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark SPARK-12182

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10231.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10231
    
----
commit 16d936eafa3a93ea6d22fdee60441eabb86a3c65
Author: sethah <se...@gmail.com>
Date:   2015-12-08T23:08:32Z

    distributed findsplits for spark ml

commit 2a4383332ff4670500a9ce55c79e88c0c9d067ec
Author: sethah <se...@gmail.com>
Date:   2015-12-09T00:17:34Z

    removing helper function

commit 8f06b3415553b78bf3d73259750439e9319d5fdb
Author: sethah <se...@gmail.com>
Date:   2015-12-09T22:18:15Z

    type ascription for splits

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r47164981
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -842,60 +842,63 @@ private[ml] object RandomForest extends Logging {
             1.0
           }
           logDebug("fraction of data used for calculating quantiles = " + fraction)
    -      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
    +      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt())
         } else {
    -      new Array[LabeledPoint](0)
    +      input.sparkContext.emptyRDD[LabeledPoint]
         }
     
    -    val splits = new Array[Array[Split]](numFeatures)
    -
    -    // Find all splits.
    -    // Iterate over all features.
    -    var featureIndex = 0
    -    while (featureIndex < numFeatures) {
    -      if (metadata.isContinuous(featureIndex)) {
    -        val featureSamples = sampledInput.map(_.features(featureIndex))
    -        val featureSplits = findSplitsForContinuousFeature(featureSamples, metadata, featureIndex)
    +    findSplitsBinsBySorting(sampledInput, metadata, continuousFeatures)
    +  }
     
    -        val numSplits = featureSplits.length
    -        logDebug(s"featureIndex = $featureIndex, numSplits = $numSplits")
    -        splits(featureIndex) = new Array[Split](numSplits)
    +  private def findSplitsBinsBySorting(
    +      input: RDD[LabeledPoint],
    +      metadata: DecisionTreeMetadata,
    +      continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {
    +
    +    val continuousSplits = {
    +      // reduce the parallelism for split computations when there are less
    +      // continuous features than input partitions. this prevents tasks from
    +      // being spun up that will definitely do no work.
    +      val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
    +
    +      input
    +        .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx))))
    +        .groupByKey(numPartitions)
    +        .map { case (idx, samples) =>
    +          val thresholds = findSplitsForContinuousFeature(samples.toArray, metadata, idx)
    +          val splits: Array[Split] = thresholds.map(thresh => new ContinuousSplit(idx, thresh))
    --- End diff --
    
    (as mentioned in jenkins): scala style long line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198622871
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198997927
  
    LGTM
    Thanks @sethah for the PR and @NathanHowell for reviewing!
    Merging with master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r56441087
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -842,60 +842,59 @@ private[ml] object RandomForest extends Logging {
             1.0
           }
           logDebug("fraction of data used for calculating quantiles = " + fraction)
    -      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
    +      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt())
         } else {
    -      new Array[LabeledPoint](0)
    +      input.sparkContext.emptyRDD[LabeledPoint]
         }
     
    -    val splits = new Array[Array[Split]](numFeatures)
    -
    -    // Find all splits.
    -    // Iterate over all features.
    -    var featureIndex = 0
    -    while (featureIndex < numFeatures) {
    -      if (metadata.isContinuous(featureIndex)) {
    -        val featureSamples = sampledInput.map(_.features(featureIndex))
    -        val featureSplits = findSplitsForContinuousFeature(featureSamples, metadata, featureIndex)
    +    findSplitsBinsBySorting(sampledInput, metadata, continuousFeatures)
    +  }
     
    -        val numSplits = featureSplits.length
    -        logDebug(s"featureIndex = $featureIndex, numSplits = $numSplits")
    -        splits(featureIndex) = new Array[Split](numSplits)
    +  private def findSplitsBinsBySorting(
    +      input: RDD[LabeledPoint],
    +      metadata: DecisionTreeMetadata,
    +      continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {
    +
    +    val continuousSplits = {
    +      // reduce the parallelism for split computations when there are less
    +      // continuous features than input partitions. this prevents tasks from
    +      // being spun up that will definitely do no work.
    +      val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
    +
    +      input
    +        .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx))))
    +        .groupByKey(numPartitions)
    +        .map { case (idx, samples) =>
    +          val thresholds = findSplitsForContinuousFeature(samples.toArray, metadata, idx)
    +          val splits: Array[Split] = thresholds.map(thresh => new ContinuousSplit(idx, thresh))
    +          logDebug(s"featureIndex = $idx, numSplits = ${splits.length}")
    +          (idx, splits)
    +        }.collectAsMap()
    +    }
     
    -        var splitIndex = 0
    -        while (splitIndex < numSplits) {
    -          val threshold = featureSplits(splitIndex)
    -          splits(featureIndex)(splitIndex) = new ContinuousSplit(featureIndex, threshold)
    -          splitIndex += 1
    -        }
    -      } else {
    -        // Categorical feature
    -        if (metadata.isUnordered(featureIndex)) {
    -          val numSplits = metadata.numSplits(featureIndex)
    -          val featureArity = metadata.featureArity(featureIndex)
    -          // TODO: Use an implicit representation mapping each category to a subset of indices.
    -          //       I.e., track indices such that we can calculate the set of bins for which
    -          //       feature value x splits to the left.
    -          // Unordered features
    -          // 2^(maxFeatureValue - 1) - 1 combinations
    -          splits(featureIndex) = new Array[Split](numSplits)
    -          var splitIndex = 0
    -          while (splitIndex < numSplits) {
    -            val categories: List[Double] =
    -              extractMultiClassCategories(splitIndex + 1, featureArity)
    -            splits(featureIndex)(splitIndex) =
    -              new CategoricalSplit(featureIndex, categories.toArray, featureArity)
    -            splitIndex += 1
    -          }
    -        } else {
    -          // Ordered features
    -          //   Bins correspond to feature values, so we do not need to compute splits or bins
    -          //   beforehand.  Splits are constructed as needed during training.
    -          splits(featureIndex) = new Array[Split](0)
    +    val numFeatures = metadata.numFeatures
    +    val splits = Array.tabulate(numFeatures) {
    --- End diff --
    
    Put explicit type for clarity


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r47165377
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -842,60 +842,63 @@ private[ml] object RandomForest extends Logging {
             1.0
           }
           logDebug("fraction of data used for calculating quantiles = " + fraction)
    -      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
    +      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt())
         } else {
    -      new Array[LabeledPoint](0)
    +      input.sparkContext.emptyRDD[LabeledPoint]
         }
     
    -    val splits = new Array[Array[Split]](numFeatures)
    -
    -    // Find all splits.
    -    // Iterate over all features.
    -    var featureIndex = 0
    -    while (featureIndex < numFeatures) {
    -      if (metadata.isContinuous(featureIndex)) {
    -        val featureSamples = sampledInput.map(_.features(featureIndex))
    -        val featureSplits = findSplitsForContinuousFeature(featureSamples, metadata, featureIndex)
    +    findSplitsBinsBySorting(sampledInput, metadata, continuousFeatures)
    +  }
     
    -        val numSplits = featureSplits.length
    -        logDebug(s"featureIndex = $featureIndex, numSplits = $numSplits")
    -        splits(featureIndex) = new Array[Split](numSplits)
    +  private def findSplitsBinsBySorting(
    +      input: RDD[LabeledPoint],
    +      metadata: DecisionTreeMetadata,
    +      continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {
    +
    +    val continuousSplits = {
    +      // reduce the parallelism for split computations when there are less
    +      // continuous features than input partitions. this prevents tasks from
    +      // being spun up that will definitely do no work.
    +      val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
    +
    +      input
    +        .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx))))
    +        .groupByKey(numPartitions)
    +        .map { case (idx, samples) =>
    +          val thresholds = findSplitsForContinuousFeature(samples.toArray, metadata, idx)
    +          val splits: Array[Split] = thresholds.map(thresh => new ContinuousSplit(idx, thresh))
    --- End diff --
    
    Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163998876
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r56707558
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -842,60 +842,59 @@ private[ml] object RandomForest extends Logging {
             1.0
           }
           logDebug("fraction of data used for calculating quantiles = " + fraction)
    -      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
    +      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt())
         } else {
    -      new Array[LabeledPoint](0)
    +      input.sparkContext.emptyRDD[LabeledPoint]
         }
     
    -    val splits = new Array[Array[Split]](numFeatures)
    -
    -    // Find all splits.
    -    // Iterate over all features.
    -    var featureIndex = 0
    -    while (featureIndex < numFeatures) {
    -      if (metadata.isContinuous(featureIndex)) {
    -        val featureSamples = sampledInput.map(_.features(featureIndex))
    -        val featureSplits = findSplitsForContinuousFeature(featureSamples, metadata, featureIndex)
    +    findSplitsBinsBySorting(sampledInput, metadata, continuousFeatures)
    +  }
     
    -        val numSplits = featureSplits.length
    -        logDebug(s"featureIndex = $featureIndex, numSplits = $numSplits")
    -        splits(featureIndex) = new Array[Split](numSplits)
    +  private def findSplitsBinsBySorting(
    +      input: RDD[LabeledPoint],
    +      metadata: DecisionTreeMetadata,
    +      continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {
    +
    +    val continuousSplits = {
    +      // reduce the parallelism for split computations when there are less
    +      // continuous features than input partitions. this prevents tasks from
    +      // being spun up that will definitely do no work.
    +      val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
    +
    +      input
    +        .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx))))
    +        .groupByKey(numPartitions)
    +        .map { case (idx, samples) =>
    +          val thresholds = findSplitsForContinuousFeature(samples.toArray, metadata, idx)
    --- End diff --
    
    Are you suggesting modifying `findSplitsForContinousFeature` to accept an `Iterator[Double]` instead of `Array[Double]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r47382165
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -842,60 +842,63 @@ private[ml] object RandomForest extends Logging {
             1.0
           }
           logDebug("fraction of data used for calculating quantiles = " + fraction)
    -      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
    +      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt())
         } else {
    -      new Array[LabeledPoint](0)
    +      input.sparkContext.emptyRDD[LabeledPoint]
         }
     
    -    val splits = new Array[Array[Split]](numFeatures)
    -
    -    // Find all splits.
    -    // Iterate over all features.
    -    var featureIndex = 0
    -    while (featureIndex < numFeatures) {
    -      if (metadata.isContinuous(featureIndex)) {
    -        val featureSamples = sampledInput.map(_.features(featureIndex))
    -        val featureSplits = findSplitsForContinuousFeature(featureSamples, metadata, featureIndex)
    +    findSplitsBinsBySorting(sampledInput, metadata, continuousFeatures)
    +  }
     
    -        val numSplits = featureSplits.length
    -        logDebug(s"featureIndex = $featureIndex, numSplits = $numSplits")
    -        splits(featureIndex) = new Array[Split](numSplits)
    +  private def findSplitsBinsBySorting(
    +      input: RDD[LabeledPoint],
    +      metadata: DecisionTreeMetadata,
    +      continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {
    +
    +    val continuousSplits = {
    +      // reduce the parallelism for split computations when there are less
    +      // continuous features than input partitions. this prevents tasks from
    +      // being spun up that will definitely do no work.
    +      val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
    +
    +      input
    +        .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx))))
    +        .groupByKey(numPartitions)
    +        .map { case (idx, samples) =>
    +          val thresholds = findSplitsForContinuousFeature(samples.toArray, metadata, idx)
    +          val splits: Array[Split] = thresholds.map(thresh => new ContinuousSplit(idx, thresh))
    +          logDebug(s"featureIndex = $idx, numSplits = ${splits.length}")
    +          (idx, splits)
    +        }.collectAsMap()
    +    }
     
    -        var splitIndex = 0
    -        while (splitIndex < numSplits) {
    -          val threshold = featureSplits(splitIndex)
    -          splits(featureIndex)(splitIndex) = new ContinuousSplit(featureIndex, threshold)
    -          splitIndex += 1
    -        }
    -      } else {
    -        // Categorical feature
    -        if (metadata.isUnordered(featureIndex)) {
    -          val numSplits = metadata.numSplits(featureIndex)
    -          val featureArity = metadata.featureArity(featureIndex)
    -          // TODO: Use an implicit representation mapping each category to a subset of indices.
    -          //       I.e., track indices such that we can calculate the set of bins for which
    -          //       feature value x splits to the left.
    -          // Unordered features
    -          // 2^(maxFeatureValue - 1) - 1 combinations
    -          splits(featureIndex) = new Array[Split](numSplits)
    -          var splitIndex = 0
    -          while (splitIndex < numSplits) {
    -            val categories: List[Double] =
    -              extractMultiClassCategories(splitIndex + 1, featureArity)
    -            splits(featureIndex)(splitIndex) =
    -              new CategoricalSplit(featureIndex, categories.toArray, featureArity)
    -            splitIndex += 1
    -          }
    -        } else {
    -          // Ordered features
    -          //   Bins correspond to feature values, so we do not need to compute splits or bins
    -          //   beforehand.  Splits are constructed as needed during training.
    -          splits(featureIndex) = new Array[Split](0)
    +    val numFeatures = metadata.numFeatures
    +    val splits = Range(0, numFeatures).map {
    +      case i if metadata.isContinuous(i) =>
    +        val split = continuousSplits(i)
    +        metadata.setNumSplits(i, split.length)
    +        split
    +
    +      case i if metadata.isCategorical(i) && metadata.isUnordered(i) =>
    +        // Unordered features
    +        // 2^(maxFeatureValue - 1) - 1 combinations
    +        val featureArity = metadata.featureArity(i)
    +        val split: IndexedSeq[Split] = Range(0, metadata.numSplits(i)).map { splitIndex =>
    --- End diff --
    
    Done. Thanks for the suggestion!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r56440922
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -842,60 +842,59 @@ private[ml] object RandomForest extends Logging {
             1.0
           }
           logDebug("fraction of data used for calculating quantiles = " + fraction)
    -      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
    +      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt())
         } else {
    -      new Array[LabeledPoint](0)
    +      input.sparkContext.emptyRDD[LabeledPoint]
         }
     
    -    val splits = new Array[Array[Split]](numFeatures)
    -
    -    // Find all splits.
    -    // Iterate over all features.
    -    var featureIndex = 0
    -    while (featureIndex < numFeatures) {
    -      if (metadata.isContinuous(featureIndex)) {
    -        val featureSamples = sampledInput.map(_.features(featureIndex))
    -        val featureSplits = findSplitsForContinuousFeature(featureSamples, metadata, featureIndex)
    +    findSplitsBinsBySorting(sampledInput, metadata, continuousFeatures)
    +  }
     
    -        val numSplits = featureSplits.length
    -        logDebug(s"featureIndex = $featureIndex, numSplits = $numSplits")
    -        splits(featureIndex) = new Array[Split](numSplits)
    +  private def findSplitsBinsBySorting(
    +      input: RDD[LabeledPoint],
    +      metadata: DecisionTreeMetadata,
    +      continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {
    +
    +    val continuousSplits = {
    --- End diff --
    
    Put type here for code clarity


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by jodersky <gi...@git.apache.org>.

Github user jodersky commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r47274795
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -842,60 +842,63 @@ private[ml] object RandomForest extends Logging {
             1.0
           }
           logDebug("fraction of data used for calculating quantiles = " + fraction)
    -      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
    +      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt())
         } else {
    -      new Array[LabeledPoint](0)
    +      input.sparkContext.emptyRDD[LabeledPoint]
         }
     
    -    val splits = new Array[Array[Split]](numFeatures)
    -
    -    // Find all splits.
    -    // Iterate over all features.
    -    var featureIndex = 0
    -    while (featureIndex < numFeatures) {
    -      if (metadata.isContinuous(featureIndex)) {
    -        val featureSamples = sampledInput.map(_.features(featureIndex))
    -        val featureSplits = findSplitsForContinuousFeature(featureSamples, metadata, featureIndex)
    +    findSplitsBinsBySorting(sampledInput, metadata, continuousFeatures)
    +  }
     
    -        val numSplits = featureSplits.length
    -        logDebug(s"featureIndex = $featureIndex, numSplits = $numSplits")
    -        splits(featureIndex) = new Array[Split](numSplits)
    +  private def findSplitsBinsBySorting(
    +      input: RDD[LabeledPoint],
    +      metadata: DecisionTreeMetadata,
    +      continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {
    +
    +    val continuousSplits = {
    +      // reduce the parallelism for split computations when there are less
    +      // continuous features than input partitions. this prevents tasks from
    +      // being spun up that will definitely do no work.
    +      val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
    +
    +      input
    +        .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx))))
    +        .groupByKey(numPartitions)
    +        .map { case (idx, samples) =>
    +          val thresholds = findSplitsForContinuousFeature(samples.toArray, metadata, idx)
    +          val splits: Array[Split] = thresholds.map(thresh => new ContinuousSplit(idx, thresh))
    +          logDebug(s"featureIndex = $idx, numSplits = ${splits.length}")
    +          (idx, splits)
    +        }.collectAsMap()
    +    }
     
    -        var splitIndex = 0
    -        while (splitIndex < numSplits) {
    -          val threshold = featureSplits(splitIndex)
    -          splits(featureIndex)(splitIndex) = new ContinuousSplit(featureIndex, threshold)
    -          splitIndex += 1
    -        }
    -      } else {
    -        // Categorical feature
    -        if (metadata.isUnordered(featureIndex)) {
    -          val numSplits = metadata.numSplits(featureIndex)
    -          val featureArity = metadata.featureArity(featureIndex)
    -          // TODO: Use an implicit representation mapping each category to a subset of indices.
    -          //       I.e., track indices such that we can calculate the set of bins for which
    -          //       feature value x splits to the left.
    -          // Unordered features
    -          // 2^(maxFeatureValue - 1) - 1 combinations
    -          splits(featureIndex) = new Array[Split](numSplits)
    -          var splitIndex = 0
    -          while (splitIndex < numSplits) {
    -            val categories: List[Double] =
    -              extractMultiClassCategories(splitIndex + 1, featureArity)
    -            splits(featureIndex)(splitIndex) =
    -              new CategoricalSplit(featureIndex, categories.toArray, featureArity)
    -            splitIndex += 1
    -          }
    -        } else {
    -          // Ordered features
    -          //   Bins correspond to feature values, so we do not need to compute splits or bins
    -          //   beforehand.  Splits are constructed as needed during training.
    -          splits(featureIndex) = new Array[Split](0)
    +    val numFeatures = metadata.numFeatures
    +    val splits = Range(0, numFeatures).map {
    +      case i if metadata.isContinuous(i) =>
    +        val split = continuousSplits(i)
    +        metadata.setNumSplits(i, split.length)
    +        split
    +
    +      case i if metadata.isCategorical(i) && metadata.isUnordered(i) =>
    +        // Unordered features
    +        // 2^(maxFeatureValue - 1) - 1 combinations
    +        val featureArity = metadata.featureArity(i)
    +        val split: IndexedSeq[Split] = Range(0, metadata.numSplits(i)).map { splitIndex =>
    --- End diff --
    
    You could use an Array.tablulate here. Something like
    ```scala
    Array.tabulate[Split](numSplits(i)){splitIndex =>
    ...
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198615653
  
    **[Test build #53591 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53591/consoleFull)** for PR 10231 at commit [`c9bec20`](https://github.com/apache/spark/commit/c9bec2050e273c205456d50028c13494407b50ab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-197620380
  
    No specific dataset size.  I was thinking of something in this ballpark:
    * 1K-10K rows
    * 10-100 columns
    * maxDepth 1 - 2 (shallow tree to avoid amortizing cost of choosing splits)
    
    Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198502199
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53552/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163427555
  
    **[Test build #47453 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47453/consoleFull)** for PR 10231 at commit [`6c4ba6f`](https://github.com/apache/spark/commit/6c4ba6fff8813b931f3e38a79879f6d06ec371fb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198519802
  
    **[Test build #53555 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53555/consoleFull)** for PR 10231 at commit [`d8a4c77`](https://github.com/apache/spark/commit/d8a4c776b668ceb999a8af53c5e71e09a2ba9de1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163437393
  
    **[Test build #47453 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47453/consoleFull)** for PR 10231 at commit [`6c4ba6f`](https://github.com/apache/spark/commit/6c4ba6fff8813b931f3e38a79879f6d06ec371fb).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198482885
  
    @jkbradley I ran some local timings comparing before/after this change. I used `RandomForestRegressor` with all continuous features. It looks like there is a small performance impact on micro datasets, but no noticeable performance hit on larger in-memory datasets. What do you think?
    
    I just ran five trials each, but I can set up something more robust if needed.
    
    ```
    options = {'numRows': 10k, 'numCols': 100, 'maxDepth': 2}
       with_patch  without_patch
    0    0.991490       0.778417
    1    0.867575       0.862355
    2    0.894913       0.987718
    3    0.920691       0.790363
    4    0.933628       0.951237
    ```
    
    ```
    options = {'numRows': 1k, 'numCols': 10, 'maxDepth': 2}
       with_patch  without_patch
    0    0.038660       0.015930
    1    0.051568       0.015814
    2    0.039481       0.018386
    3    0.044415       0.016335
    4    0.049889       0.017497
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by NathanHowell <gi...@git.apache.org>.

Github user NathanHowell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r56742884
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -956,7 +956,7 @@ private[ml] object RandomForest extends Logging {
             valueCounts.map(_._1)
           } else {
             // stride between splits
    -        val stride: Double = featureSamples.length.toDouble / (numSplits + 1)
    +        val stride: Double = featureSamples.size.toDouble / (numSplits + 1)
    --- End diff --
    
    This will do a second pass over the `Iterable`. Would it be preferable to combine this into the `foldLeft` above so it only does a single pass?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198493930
  
    That does not seem that bad.  I'd say we should go ahead with your PR.  If we want to optimize for small data, we can add a local implementation at some point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198502197
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by NathanHowell <gi...@git.apache.org>.

Github user NathanHowell commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-169393380
  
    @sethah looks good to me. :+1: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-197609041
  
    **[Test build #2645 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2645/consoleFull)** for PR 10231 at commit [`c34075b`](https://github.com/apache/spark/commit/c34075be4d5110644a68309be6664901be9974d7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163423636
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-197596033
  
    Would you have time to test this on a small dataset?  The original PR confirmed it's faster for a larger dataset, but I'm curious if it affects timing (adversely) on small data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198502178
  
    **[Test build #53552 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53552/consoleFull)** for PR 10231 at commit [`a847bc9`](https://github.com/apache/spark/commit/a847bc9d9b47f647a9326bec91672b878ceee640).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163998877
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47583/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198519915
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163419168
  
    @NathanHowell would you be able to review this?
    
    cc @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by NathanHowell <gi...@git.apache.org>.

Github user NathanHowell commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163422959
  
    Yeah I can take a look tonight or tomorrow
    On Dec 9, 2015 14:25, "Seth Hendrickson" <no...@github.com> wrote:
    
    > @NathanHowell <https://github.com/NathanHowell> would you be able to
    > review this?
    >
    > cc @jkbradley <https://github.com/jkbradley>
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/10231#issuecomment-163419168>.
    >



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r56441090
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -842,60 +842,59 @@ private[ml] object RandomForest extends Logging {
             1.0
           }
           logDebug("fraction of data used for calculating quantiles = " + fraction)
    -      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
    +      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt())
         } else {
    -      new Array[LabeledPoint](0)
    +      input.sparkContext.emptyRDD[LabeledPoint]
         }
     
    -    val splits = new Array[Array[Split]](numFeatures)
    -
    -    // Find all splits.
    -    // Iterate over all features.
    -    var featureIndex = 0
    -    while (featureIndex < numFeatures) {
    -      if (metadata.isContinuous(featureIndex)) {
    -        val featureSamples = sampledInput.map(_.features(featureIndex))
    -        val featureSplits = findSplitsForContinuousFeature(featureSamples, metadata, featureIndex)
    +    findSplitsBinsBySorting(sampledInput, metadata, continuousFeatures)
    +  }
     
    -        val numSplits = featureSplits.length
    -        logDebug(s"featureIndex = $featureIndex, numSplits = $numSplits")
    -        splits(featureIndex) = new Array[Split](numSplits)
    +  private def findSplitsBinsBySorting(
    +      input: RDD[LabeledPoint],
    +      metadata: DecisionTreeMetadata,
    +      continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {
    +
    +    val continuousSplits = {
    +      // reduce the parallelism for split computations when there are less
    +      // continuous features than input partitions. this prevents tasks from
    +      // being spun up that will definitely do no work.
    +      val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
    +
    +      input
    +        .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx))))
    +        .groupByKey(numPartitions)
    +        .map { case (idx, samples) =>
    +          val thresholds = findSplitsForContinuousFeature(samples.toArray, metadata, idx)
    +          val splits: Array[Split] = thresholds.map(thresh => new ContinuousSplit(idx, thresh))
    +          logDebug(s"featureIndex = $idx, numSplits = ${splits.length}")
    +          (idx, splits)
    +        }.collectAsMap()
    +    }
     
    -        var splitIndex = 0
    -        while (splitIndex < numSplits) {
    -          val threshold = featureSplits(splitIndex)
    -          splits(featureIndex)(splitIndex) = new ContinuousSplit(featureIndex, threshold)
    -          splitIndex += 1
    -        }
    -      } else {
    -        // Categorical feature
    -        if (metadata.isUnordered(featureIndex)) {
    -          val numSplits = metadata.numSplits(featureIndex)
    -          val featureArity = metadata.featureArity(featureIndex)
    -          // TODO: Use an implicit representation mapping each category to a subset of indices.
    -          //       I.e., track indices such that we can calculate the set of bins for which
    -          //       feature value x splits to the left.
    -          // Unordered features
    -          // 2^(maxFeatureValue - 1) - 1 combinations
    -          splits(featureIndex) = new Array[Split](numSplits)
    -          var splitIndex = 0
    -          while (splitIndex < numSplits) {
    -            val categories: List[Double] =
    -              extractMultiClassCategories(splitIndex + 1, featureArity)
    -            splits(featureIndex)(splitIndex) =
    -              new CategoricalSplit(featureIndex, categories.toArray, featureArity)
    -            splitIndex += 1
    -          }
    -        } else {
    -          // Ordered features
    -          //   Bins correspond to feature values, so we do not need to compute splits or bins
    -          //   beforehand.  Splits are constructed as needed during training.
    -          splits(featureIndex) = new Array[Split](0)
    +    val numFeatures = metadata.numFeatures
    +    val splits = Array.tabulate(numFeatures) {
    +      case i if metadata.isContinuous(i) =>
    +        val split = continuousSplits(i)
    +        metadata.setNumSplits(i, split.length)
    +        split
    +
    +      case i if metadata.isCategorical(i) && metadata.isUnordered(i) =>
    +        // Unordered features
    +        // 2^(maxFeatureValue - 1) - 1 combinations
    +        val featureArity = metadata.featureArity(i)
    +        Array.tabulate[Split](metadata.numSplits(i)) { splitIndex =>
    +          val categories = extractMultiClassCategories(splitIndex + 1, featureArity)
    +          new CategoricalSplit(i, categories.toArray, featureArity)
             }
    -      }
    -      featureIndex += 1
    +
    +      case i if metadata.isCategorical(i) =>
    +        // Ordered features
    +        //   Bins correspond to feature values, so we do not need to compute splits or bins
    +        //   beforehand.  Splits are constructed as needed during training.
    +        Array.empty[Split]
         }
    -    splits
    +    splits.toArray
    --- End diff --
    
    This is already an Array.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198630414
  
    **[Test build #53595 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53595/consoleFull)** for PR 10231 at commit [`8f5077f`](https://github.com/apache/spark/commit/8f5077ff0f1df7906eb0a6043572065e88ad3bf7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198630445
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163437528
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47453/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-197595825
  
    **[Test build #2645 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2645/consoleFull)** for PR 10231 at commit [`c34075b`](https://github.com/apache/spark/commit/c34075be4d5110644a68309be6664901be9974d7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r56739542
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -842,60 +842,59 @@ private[ml] object RandomForest extends Logging {
             1.0
           }
           logDebug("fraction of data used for calculating quantiles = " + fraction)
    -      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
    +      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt())
         } else {
    -      new Array[LabeledPoint](0)
    +      input.sparkContext.emptyRDD[LabeledPoint]
         }
     
    -    val splits = new Array[Array[Split]](numFeatures)
    -
    -    // Find all splits.
    -    // Iterate over all features.
    -    var featureIndex = 0
    -    while (featureIndex < numFeatures) {
    -      if (metadata.isContinuous(featureIndex)) {
    -        val featureSamples = sampledInput.map(_.features(featureIndex))
    -        val featureSplits = findSplitsForContinuousFeature(featureSamples, metadata, featureIndex)
    +    findSplitsBinsBySorting(sampledInput, metadata, continuousFeatures)
    +  }
     
    -        val numSplits = featureSplits.length
    -        logDebug(s"featureIndex = $featureIndex, numSplits = $numSplits")
    -        splits(featureIndex) = new Array[Split](numSplits)
    +  private def findSplitsBinsBySorting(
    +      input: RDD[LabeledPoint],
    +      metadata: DecisionTreeMetadata,
    +      continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {
    +
    +    val continuousSplits = {
    +      // reduce the parallelism for split computations when there are less
    +      // continuous features than input partitions. this prevents tasks from
    +      // being spun up that will definitely do no work.
    +      val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
    +
    +      input
    +        .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx))))
    +        .groupByKey(numPartitions)
    +        .map { case (idx, samples) =>
    +          val thresholds = findSplitsForContinuousFeature(samples.toArray, metadata, idx)
    --- End diff --
    
    Right, I'm wondering if that avoids a copy


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198506882
  
    **[Test build #53555 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53555/consoleFull)** for PR 10231 at commit [`d8a4c77`](https://github.com/apache/spark/commit/d8a4c776b668ceb999a8af53c5e71e09a2ba9de1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r56743243
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -956,7 +956,7 @@ private[ml] object RandomForest extends Logging {
             valueCounts.map(_._1)
           } else {
             // stride between splits
    -        val stride: Double = featureSamples.length.toDouble / (numSplits + 1)
    +        val stride: Double = featureSamples.size.toDouble / (numSplits + 1)
    --- End diff --
    
    Thanks for the suggestion! The latest commit should take care of it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163430383
  
    This JIRA was actually created as a blocker JIRA for [SPARK-12183](https://issues.apache.org/jira/browse/SPARK-12183) which is for removing the MLlib code entirely and wrapping to spark.ml. So, the code duplication should be very short-lived.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198622874
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53591/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-197624835
  
    Could you also make this change: [https://github.com/apache/spark/pull/8246/files#diff-8ad842a043888473bb2b527e818de04bR645]
    
    Done with pass.  I added a few minor comments which weren't in the spark.mllib PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163423642
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47451/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10231


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198504535
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r56742175
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -842,60 +842,59 @@ private[ml] object RandomForest extends Logging {
             1.0
           }
           logDebug("fraction of data used for calculating quantiles = " + fraction)
    -      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
    +      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt())
         } else {
    -      new Array[LabeledPoint](0)
    +      input.sparkContext.emptyRDD[LabeledPoint]
         }
     
    -    val splits = new Array[Array[Split]](numFeatures)
    -
    -    // Find all splits.
    -    // Iterate over all features.
    -    var featureIndex = 0
    -    while (featureIndex < numFeatures) {
    -      if (metadata.isContinuous(featureIndex)) {
    -        val featureSamples = sampledInput.map(_.features(featureIndex))
    -        val featureSplits = findSplitsForContinuousFeature(featureSamples, metadata, featureIndex)
    +    findSplitsBinsBySorting(sampledInput, metadata, continuousFeatures)
    +  }
     
    -        val numSplits = featureSplits.length
    -        logDebug(s"featureIndex = $featureIndex, numSplits = $numSplits")
    -        splits(featureIndex) = new Array[Split](numSplits)
    +  private def findSplitsBinsBySorting(
    +      input: RDD[LabeledPoint],
    +      metadata: DecisionTreeMetadata,
    +      continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {
    +
    +    val continuousSplits = {
    +      // reduce the parallelism for split computations when there are less
    +      // continuous features than input partitions. this prevents tasks from
    +      // being spun up that will definitely do no work.
    +      val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
    +
    +      input
    +        .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx))))
    +        .groupByKey(numPartitions)
    +        .map { case (idx, samples) =>
    +          val thresholds = findSplitsForContinuousFeature(samples.toArray, metadata, idx)
    --- End diff --
    
    Yes, there doesn't seem to be a need to cast this to array. I updated both the ML and MLlib implementation to reflect this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198501179
  
    **[Test build #53552 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53552/consoleFull)** for PR 10231 at commit [`a847bc9`](https://github.com/apache/spark/commit/a847bc9d9b47f647a9326bec91672b878ceee640).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198519918
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53555/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163998699
  
    **[Test build #47583 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47583/consoleFull)** for PR 10231 at commit [`c34075b`](https://github.com/apache/spark/commit/c34075be4d5110644a68309be6664901be9974d7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198504516
  
    **[Test build #53553 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53553/consoleFull)** for PR 10231 at commit [`af3559a`](https://github.com/apache/spark/commit/af3559a89dd05c41050a68b4f79b6ed0ec5a230e).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-169395890
  
    @NathanHowell Thank you for reviewing!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163987356
  
    **[Test build #47583 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47583/consoleFull)** for PR 10231 at commit [`c34075b`](https://github.com/apache/spark/commit/c34075be4d5110644a68309be6664901be9974d7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198503860
  
    **[Test build #53553 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53553/consoleFull)** for PR 10231 at commit [`af3559a`](https://github.com/apache/spark/commit/af3559a89dd05c41050a68b4f79b6ed0ec5a230e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163423629
  
    **[Test build #47451 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47451/consoleFull)** for PR 10231 at commit [`8f06b34`](https://github.com/apache/spark/commit/8f06b3415553b78bf3d73259750439e9319d5fdb).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10231#discussion_r56441084
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -842,60 +842,59 @@ private[ml] object RandomForest extends Logging {
             1.0
           }
           logDebug("fraction of data used for calculating quantiles = " + fraction)
    -      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
    +      input.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt())
         } else {
    -      new Array[LabeledPoint](0)
    +      input.sparkContext.emptyRDD[LabeledPoint]
         }
     
    -    val splits = new Array[Array[Split]](numFeatures)
    -
    -    // Find all splits.
    -    // Iterate over all features.
    -    var featureIndex = 0
    -    while (featureIndex < numFeatures) {
    -      if (metadata.isContinuous(featureIndex)) {
    -        val featureSamples = sampledInput.map(_.features(featureIndex))
    -        val featureSplits = findSplitsForContinuousFeature(featureSamples, metadata, featureIndex)
    +    findSplitsBinsBySorting(sampledInput, metadata, continuousFeatures)
    +  }
     
    -        val numSplits = featureSplits.length
    -        logDebug(s"featureIndex = $featureIndex, numSplits = $numSplits")
    -        splits(featureIndex) = new Array[Split](numSplits)
    +  private def findSplitsBinsBySorting(
    +      input: RDD[LabeledPoint],
    +      metadata: DecisionTreeMetadata,
    +      continuousFeatures: IndexedSeq[Int]): Array[Array[Split]] = {
    +
    +    val continuousSplits = {
    +      // reduce the parallelism for split computations when there are less
    +      // continuous features than input partitions. this prevents tasks from
    +      // being spun up that will definitely do no work.
    +      val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
    +
    +      input
    +        .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx))))
    +        .groupByKey(numPartitions)
    +        .map { case (idx, samples) =>
    +          val thresholds = findSplitsForContinuousFeature(samples.toArray, metadata, idx)
    --- End diff --
    
    Does this make an unnecessary copy of samples?  We could probably keep it as an Iterator.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198622658
  
    **[Test build #53591 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53591/consoleFull)** for PR 10231 at commit [`c9bec20`](https://github.com/apache/spark/commit/c9bec2050e273c205456d50028c13494407b50ab).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163427753
  
    At first glance this seems to share a lot of code with the original implementation in MLLib (they both even work with RDDs of LabeledPoints) - maybe we could move much of this to a common util class or similar?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by holdenk <gi...@git.apache.org>.

Github user holdenk commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163433358
  
    Ah great - if were killing the old code soon then no worries on the temporary duplication.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198626508
  
    **[Test build #53595 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53595/consoleFull)** for PR 10231 at commit [`8f5077f`](https://github.com/apache/spark/commit/8f5077ff0f1df7906eb0a6043572065e88ad3bf7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-197597688
  
    I can set something up. Do you have a specific dataset size in mind or even a specific dataset?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198630446
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53595/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163437525
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-163423162
  
    **[Test build #47451 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47451/consoleFull)** for PR 10231 at commit [`8f06b34`](https://github.com/apache/spark/commit/8f06b3415553b78bf3d73259750439e9319d5fdb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-168845446
  
    @NathanHowell do you think you'll have any time to take a look at this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-12182][ML] Distributed binning for tree...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10231#issuecomment-198504543
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53553/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org