You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by lucio-yz <gi...@git.apache.org> on 2018/02/01 11:18:54 UTC

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

GitHub user lucio-yz opened a pull request:

    https://github.com/apache/spark/pull/20472

    [SPARK-22751][ML]Improve ML RandomForest shuffle performance

    ## What changes were proposed in this pull request?
    
    As I mentioned in [SPARK-22751](https://issues.apache.org/jira/browse/SPARK-22751?jql=project%20%3D%20SPARK%20AND%20component%20%3D%20ML%20AND%20text%20~%20randomforest), there is a shuffle performance problem in ML Randomforest when train a RF in high dimensional data. 
    
    The reason is that, in org.apache.spark.tree.impl.RandomForest, the function findSplitsBySorting will actually flatmap a sparse vector into a dense vector, then in groupByKey there will be a huge shuffle write size.
    
    To avoid this, we can add a filter after flatmap, to filter out zero value. And in function findSplitsForContinuousFeature, we can infer the number of zero value by pass a parameter numInput to function findSplitsForContinuousFeature. numInput is the number of samples.
    
    In addition, if a feature only contains zero value, continuousSplits will not has the key of feature id. So I add a check when using continuousSplits.
    
    ## How was this patch tested?
    Ran model locally using spark-submit.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lucio-yz/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20472.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20472
    
----
commit 50cb173dd34dc353c243b97f2686a8c545a03909
Author: lucio <57...@...>
Date:   2018-02-01T09:47:52Z

    fix mllib randomforest shuffle issue

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87823 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87823/testReport)** for PR 20472 at commit [`d634bbd`](https://github.com/apache/spark/commit/d634bbda06af43532240ba498cda461e374e4037).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87834/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r165341639
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -917,11 +916,15 @@ private[spark] object RandomForest extends Logging {
           // being spun up that will definitely do no work.
           val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
     
    +      val numInput = input.count()
    +      val bcNumInput = input.sparkContext.broadcast(numInput)
    +
           input
             .flatMap(point => continuousFeatures.map(idx => (idx, point.features(idx))))
    --- End diff --
    
    instead of adding the filter method there, here you can avoid the generation of the record itself if `point.features(idx)` is 0.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87864/testReport)** for PR 20472 at commit [`656abef`](https://github.com/apache/spark/commit/656abef989717ab2c66e4e3aa6f9b1f76a0f41a8).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87864/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87865 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87865/testReport)** for PR 20472 at commit [`fea3aad`](https://github.com/apache/spark/commit/fea3aad46d1094cf67c2770edab1158b3dece225).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r166250001
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +996,19 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
             case ((m, cnt), x) =>
               (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
           }
    +
    +      // Calculate the number of samples for finding splits
    +      val requiredSamples: Long = (samplesFractionForFindSplits(metadata) * metadata.numExamples.toDouble).toLong
    --- End diff --
    
    I think here it could be:
    ```
    val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87834/testReport)** for PR 20472 at commit [`827061c`](https://github.com/apache/spark/commit/827061ce8c87f483a196a1f5355136978b97ee46).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r169386551
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
             case ((m, cnt), x) =>
               (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
           }
    +
    +      // Calculate the number of samples for finding splits
    +      val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt
    +
    +      // add zero value count and get complete statistics
    +      val valueCountMap: Map[Double, Int] = partValueCountMap + (0.0 -> (numSamples - partNumSamples))
    --- End diff --
    
    There can be negative values right?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r165671106
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +1002,22 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
             case ((m, cnt), x) =>
               (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
           }
    +
    +      // Calculate the number of samples for finding splits
    +      var requiredSamples: Long = math.max(metadata.maxBins * metadata.maxBins, 10000)
    --- End diff --
    
    This logic is somewhat copied from another method. Can we create new method? Or pass the result through the methods?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r169391525
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
             case ((m, cnt), x) =>
               (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
           }
    +
    +      // Calculate the number of samples for finding splits
    +      val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt
    --- End diff --
    
    The main problem I see with this is that the sampling we do for split finding is _approximate_. Just as an example: say you have 1000 samples, and you take 20% for split finding. Your actual sampled RDD has 220 samples in it, and 210 of those are non-zero. So, `partNumSamples = 210`, `numSamples = 200` and you wind up with `numSamples - partNumSamples = -10` zero values. This is not something you expect to happen often (since we care about the highly sparse case), but something that we need to consider. We could just require the subtraction to be non-negative (and live with a bit of approximation), or you could call `count` on the sampled RDD but I don't think it's worth it. Thoughts?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Merged to master


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by lucio-yz <gi...@git.apache.org>.

Github user lucio-yz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r171182634
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging {
         val numFeatures = metadata.numFeatures
         val splits: Array[Array[Split]] = Array.tabulate(numFeatures) {
           case i if metadata.isContinuous(i) =>
    -        val split = continuousSplits(i)
    +        // some features may only contains zero, so continuousSplits will not have a record
    +        val split = if (continuousSplits.contains(i)) continuousSplits(i) else Array.empty[Split]
    --- End diff --
    
    have fixed it


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87839 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87839/testReport)** for PR 20472 at commit [`b5a4741`](https://github.com/apache/spark/commit/b5a47411867c9bc532e9b7e680cc131b232937d0).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r166773606
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    --- End diff --
    
    This bit of (existing) code seems obtuse to me. What about...
    
    ```
    val partNumSamples = featureSamples.size
    val partValueCountMap = scala.collection.mutable.Map[Double,Int]()
    featureSamples.foreach { x =>
      partValueCountMap(x) = partValueCountMap.getOrElse(x, 0) + 1
    }
    ```
      


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r165341133
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -917,11 +916,15 @@ private[spark] object RandomForest extends Logging {
           // being spun up that will definitely do no work.
           val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
     
    +      val numInput = input.count()
    +      val bcNumInput = input.sparkContext.broadcast(numInput)
    --- End diff --
    
    this is not needed, you can use directly `numInput`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by lucio-yz <gi...@git.apache.org>.

Github user lucio-yz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r165595281
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1002,10 +1008,14 @@ private[spark] object RandomForest extends Logging {
           val numSplits = metadata.numSplits(featureIndex)
     
           // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      var (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    --- End diff --
    
    problem has been solved


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by lucio-yz <gi...@git.apache.org>.

Github user lucio-yz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r171182291
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging {
         val numFeatures = metadata.numFeatures
         val splits: Array[Array[Split]] = Array.tabulate(numFeatures) {
           case i if metadata.isContinuous(i) =>
    -        val split = continuousSplits(i)
    +        // some features may only contains zero, so continuousSplits will not have a record
    --- End diff --
    
    got it


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by lucio-yz <gi...@git.apache.org>.

Github user lucio-yz commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    @srowen Any other problems?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87865 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87865/testReport)** for PR 20472 at commit [`fea3aad`](https://github.com/apache/spark/commit/fea3aad46d1094cf67c2770edab1158b3dece225).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by lucio-yz <gi...@git.apache.org>.

Github user lucio-yz commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    previous problems have been solved


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87820 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87820/testReport)** for PR 20472 at commit [`51900da`](https://github.com/apache/spark/commit/51900da3266a9025ace567e3cbd5bf2b26051651).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by lucio-yz <gi...@git.apache.org>.

Github user lucio-yz commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    I tested on 2 datasets:
    
    1.  _rcv1.binary_, which has 47,236 dimensions. Before improvement, the shuffle write size in _findSplitsBySorting_ is 1GB. After improvement, the shuffle size is 7.7MB.
    
    2. _news20.binary_, which has 1,355,191 dimensions. Before improvement, the shuffle write size in _findSplitsBySorting_ is 51 GB. After improvement, the shuffle size is 24.1 MB.
    
    ps: I tested on a cluster which has 10 nodes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #4101 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4101/testReport)** for PR 20472 at commit [`1716acc`](https://github.com/apache/spark/commit/1716accb70b86123f1bbab00ff7af962199c9cd9).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by sethah <gi...@git.apache.org>.

Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    @srowen Can you trigger the tests?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87824/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    @srowen now the PR is good IMHO, do you have other comments? Or do you think we can trigger a build?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r166771033
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
             case ((m, cnt), x) =>
               (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
           }
    +
    +      // Calculate the number of samples for finding splits
    +      val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt
    --- End diff --
    
    Just a style nit but you can drop the types on these two new vars. I think we'd consider that implicit.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20472


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by lucio-yz <gi...@git.apache.org>.

Github user lucio-yz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r171184500
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    --- End diff --
    
    have fixed it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r166249547
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1147,4 +1150,20 @@ private[spark] object RandomForest extends Logging {
           3 * totalBins
         }
       }
    +
    +  /**
    +    * Calculate the subsample fraction for finding splits
    --- End diff --
    
    bad indentation, this should fail scalastyle


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Jenkins, add to whitelist


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #4101 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4101/testReport)** for PR 20472 at commit [`1716acc`](https://github.com/apache/spark/commit/1716accb70b86123f1bbab00ff7af962199c9cd9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r169844579
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
             case ((m, cnt), x) =>
               (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
           }
    +
    +      // Calculate the number of samples for finding splits
    +      val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt
    --- End diff --
    
    Inaccurate zero counts can cause the change of splits, isn't it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r166771387
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
             case ((m, cnt), x) =>
               (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
           }
    +
    +      // Calculate the number of samples for finding splits
    +      val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt
    +
    +      // add zero value count and get complete statistics
    +      val valueCountMap: Map[Double, Int] = partValueCountMap + (0.0 -> (numSamples - partNumSamples))
    --- End diff --
    
    This also probably doesn't matter but won't the new (0.0, ...) element always come first? you could append it below after sorting the rest rather than add it earlier to the Map.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87839 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87839/testReport)** for PR 20472 at commit [`b5a4741`](https://github.com/apache/spark/commit/b5a47411867c9bc532e9b7e680cc131b232937d0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87823/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87557 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87557/testReport)** for PR 20472 at commit [`1716acc`](https://github.com/apache/spark/commit/1716accb70b86123f1bbab00ff7af962199c9cd9).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r165378186
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -931,7 +934,8 @@ private[spark] object RandomForest extends Logging {
         val numFeatures = metadata.numFeatures
         val splits: Array[Array[Split]] = Array.tabulate(numFeatures) {
           case i if metadata.isContinuous(i) =>
    -        val split = continuousSplits(i)
    +        // some features may only contains zero, so continuousSplits will not have a record
    +        val split = if ( continuousSplits.contains(i) ) continuousSplits(i) else Array.empty[Split]
    --- End diff --
    
    Remove space around parens


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r166770380
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging {
         val numFeatures = metadata.numFeatures
         val splits: Array[Array[Split]] = Array.tabulate(numFeatures) {
           case i if metadata.isContinuous(i) =>
    -        val split = continuousSplits(i)
    +        // some features may only contains zero, so continuousSplits will not have a record
    +        val split = if (continuousSplits.contains(i)) continuousSplits(i) else Array.empty[Split]
    --- End diff --
    
    Just `continuousSplits.getOrElse(i, Array.empty[Split])`? Not that it matters, but also avoids searching the map twice. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r167393891
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -931,7 +925,8 @@ private[spark] object RandomForest extends Logging {
         val numFeatures = metadata.numFeatures
         val splits: Array[Array[Split]] = Array.tabulate(numFeatures) {
           case i if metadata.isContinuous(i) =>
    -        val split = continuousSplits(i)
    +        // some features may only contains zero, so continuousSplits will not have a record
    --- End diff --
    
    nit: 'only contains zero` -> `contain only zero`?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87864 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87864/testReport)** for PR 20472 at commit [`656abef`](https://github.com/apache/spark/commit/656abef989717ab2c66e4e3aa6f9b1f76a0f41a8).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by lucio-yz <gi...@git.apache.org>.

Github user lucio-yz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r165872418
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +1002,22 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
             case ((m, cnt), x) =>
               (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
           }
    +
    +      // Calculate the number of samples for finding splits
    +      var requiredSamples: Long = math.max(metadata.maxBins * metadata.maxBins, 10000)
    --- End diff --
    
    I defined a private method _samplesFractionForFindSplits_ in RandomForest.scala


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87820/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87839/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r165378109
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1002,10 +1008,14 @@ private[spark] object RandomForest extends Logging {
           val numSplits = metadata.numSplits(featureIndex)
     
           // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      var (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    --- End diff --
    
    I'd probably avoid a `var` by using a new variable for `numSamples` here.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87834/testReport)** for PR 20472 at commit [`827061c`](https://github.com/apache/spark/commit/827061ce8c87f483a196a1f5355136978b97ee46).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r165340988
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -21,7 +21,6 @@ import java.io.IOException
     
     import scala.collection.mutable
     import scala.util.Random
    -
    --- End diff --
    
    this empty line must be kept


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by lucio-yz <gi...@git.apache.org>.

Github user lucio-yz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r171160692
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +996,18 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
             case ((m, cnt), x) =>
               (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
           }
    +
    +      // Calculate the number of samples for finding splits
    +      val numSamples: Int = (samplesFractionForFindSplits(metadata) * metadata.numExamples).toInt
    --- End diff --
    
    I have seen the note of function _sample_, and _sample_ does not guarantee to provide exactly the fraction of the count of the given RDD. It seems that requiring _numSamples - partNumSamples_ to be non-negative is a more efficient choice than trigger a _count_. The degree of approximation depends upon the degree approximation of _sample_. And it's sure that the splits will be inaccurate.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87824 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87824/testReport)** for PR 20472 at commit [`bd7d3b2`](https://github.com/apache/spark/commit/bd7d3b264c88412aa1001e30b0301d54dad3f159).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87824 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87824/testReport)** for PR 20472 at commit [`bd7d3b2`](https://github.com/apache/spark/commit/bd7d3b264c88412aa1001e30b0301d54dad3f159).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87823 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87823/testReport)** for PR 20472 at commit [`d634bbd`](https://github.com/apache/spark/commit/d634bbda06af43532240ba498cda461e374e4037).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87557/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87557 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87557/testReport)** for PR 20472 at commit [`1716acc`](https://github.com/apache/spark/commit/1716accb70b86123f1bbab00ff7af962199c9cd9).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87865/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by lucio-yz <gi...@git.apache.org>.

Github user lucio-yz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r166501185
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1001,11 +996,19 @@ private[spark] object RandomForest extends Logging {
         } else {
           val numSplits = metadata.numSplits(featureIndex)
     
    -      // get count for each distinct value
    -      val (valueCountMap, numSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
    +      // get count for each distinct value except zero value
    +      val (partValueCountMap, partNumSamples) = featureSamples.foldLeft((Map.empty[Double, Int], 0)) {
             case ((m, cnt), x) =>
               (m + ((x, m.getOrElse(x, 0) + 1)), cnt + 1)
           }
    +
    +      // Calculate the number of samples for finding splits
    +      val requiredSamples: Long = (samplesFractionForFindSplits(metadata) * metadata.numExamples.toDouble).toLong
    --- End diff --
    
    Thanks for your comments, and im sorry for my carelessness. I have fixed them all.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle perform...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20472
  
    **[Test build #87820 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87820/testReport)** for PR 20472 at commit [`51900da`](https://github.com/apache/spark/commit/51900da3266a9025ace567e3cbd5bf2b26051651).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #20472: [SPARK-22751][ML]Improve ML RandomForest shuffle ...

Posted by mgaido91 <gi...@git.apache.org>.

Github user mgaido91 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20472#discussion_r165343004
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -917,11 +916,15 @@ private[spark] object RandomForest extends Logging {
           // being spun up that will definitely do no work.
           val numPartitions = math.min(continuousFeatures.length, input.partitions.length)
     
    +      val numInput = input.count()
    --- End diff --
    
    we can get this from the `metadata.numExamples * fraction` operation in the calling method in order to avoid another job to perform the count


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org