You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/24 10:25:56 UTC

[GitHub] [spark] zhengruifeng opened a new pull request #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

zhengruifeng opened a new pull request #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682
 
 
   ### What changes were proposed in this pull request?
   1, avoid `Iterator.grouped(size: Int)`, which need to maintain an arraybuffer of `size`
   2, keep the number of partitions in curve computation
   
   ### Why are the changes needed?
   1, `BinaryClassificationMetrics` tend to fail (OOM) when `grouping=count/numBins` is too large, due to `Iterator.grouped(size: Int)` need to maintain an arraybuffer of size iterms, however, in BinaryClassificationMetrics we do not need to maintain such a big array;
   2, make sizes of partitions more even;
   
   
   ### Does this PR introduce any user-facing change?
   No
   
   
   ### How was this patch tested?
   existing testsuites
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-590256823
 
 
   **[Test build #118863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118863/testReport)** for PR 27682 at commit [`06bce05`](https://github.com/apache/spark/commit/06bce0533e2cc9c621086b2c5b5e768a091fb494).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-590254281
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-590282796
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118863/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-590254281
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-590282796
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118863/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-590254289
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23612/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-590282428
 
 
   **[Test build #118863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118863/testReport)** for PR 27682 at commit [`06bce05`](https://github.com/apache/spark/commit/06bce0533e2cc9c621086b2c5b5e768a091fb494).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-590282786
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng closed pull request #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#discussion_r383190996
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##########
 @@ -182,28 +197,40 @@ class BinaryClassificationMetrics @Since("3.0.0") (
         val countsSize = counts.count()
         // Group the iterator into chunks of about countsSize / numBins points,
         // so that the resulting number of bins is about numBins
-        var grouping = countsSize / numBins
+        val grouping = countsSize / numBins
         if (grouping < 2) {
           // numBins was more than half of the size; no real point in down-sampling to bins
           logInfo(s"Curve is too small ($countsSize) for $numBins bins to be useful")
           counts
         } else {
-          if (grouping >= Int.MaxValue) {
 
 Review comment:
   `Iterator.grouped(size: Int)` does not support `size` larger than `Int.MaxValue`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-590282786
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-590254922
 
 
   testCode:
   ```scala
   import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
   import scala.util.Random
   
   import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
   import scala.util.Random
   
   val scoreAndLabels = sc.range(0, 40000000L, 1, 4).mapPartitionsWithIndex{ case (pid, iter) => val rng=new Random(pid); iter.map{_ => (rng.nextDouble, rng.nextInt(2).toDouble)} }
   
   scoreAndLabels.count
   
   val metrics = new BinaryClassificationMetrics(scoreAndLabels, 1)
   val start = System.currentTimeMillis; val auc = metrics.areaUnderROC; val end = System.currentTimeMillis; end - start
   
   ```
   
   result:
   
   
   |Test| This PR(--driver-memory=1G) | This PR(--driver-memory=32G) | Master(--driver-memory=1G) | Master(--driver-memory=32G)  |
   |------|----------|------------|----------|------------|
   |Duration|343091|173030|OOM|183258|
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on a change in pull request #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#discussion_r383190996
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ##########
 @@ -182,28 +197,40 @@ class BinaryClassificationMetrics @Since("3.0.0") (
         val countsSize = counts.count()
         // Group the iterator into chunks of about countsSize / numBins points,
         // so that the resulting number of bins is about numBins
-        var grouping = countsSize / numBins
+        val grouping = countsSize / numBins
         if (grouping < 2) {
           // numBins was more than half of the size; no real point in down-sampling to bins
           logInfo(s"Curve is too small ($countsSize) for $numBins bins to be useful")
           counts
         } else {
-          if (grouping >= Int.MaxValue) {
 
 Review comment:
   `Iterator.grouped(size: Int)` does not support `grouping` larger than `Int.MaxValue`
   After this change, `BinaryClassificationMetrics` can deal with `grouping` larger than `Int.MaxValue`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] zhengruifeng commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-592415739
 
 
   Merged to master

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-590254289
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23612/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization 
URL: https://github.com/apache/spark/pull/27682#issuecomment-590256823
 
 
   **[Test build #118863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118863/testReport)** for PR 27682 at commit [`06bce05`](https://github.com/apache/spark/commit/06bce0533e2cc9c621086b2c5b5e768a091fb494).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org