You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/24 10:25:56 UTC
[GitHub] [spark] zhengruifeng opened a new pull request #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
zhengruifeng opened a new pull request #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682
### What changes were proposed in this pull request?
1, avoid `Iterator.grouped(size: Int)`, which need to maintain an arraybuffer of `size`
2, keep the number of partitions in curve computation
### Why are the changes needed?
1, `BinaryClassificationMetrics` tend to fail (OOM) when `grouping=count/numBins` is too large, due to `Iterator.grouped(size: Int)` need to maintain an arraybuffer of size iterms, however, in BinaryClassificationMetrics we do not need to maintain such a big array;
2, make sizes of partitions more even;
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing testsuites
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-590256823
**[Test build #118863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118863/testReport)** for PR 27682 at commit [`06bce05`](https://github.com/apache/spark/commit/06bce0533e2cc9c621086b2c5b5e768a091fb494).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-590254281
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-590282796
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118863/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-590254281
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-590282796
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118863/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-590254289
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23612/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27682: [SPARK-30938][ML][MLLIB]
BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-590282428
**[Test build #118863 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118863/testReport)** for PR 27682 at commit [`06bce05`](https://github.com/apache/spark/commit/06bce0533e2cc9c621086b2c5b5e768a091fb494).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-590282786
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
zhengruifeng closed pull request #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#discussion_r383190996
##########
File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
##########
@@ -182,28 +197,40 @@ class BinaryClassificationMetrics @Since("3.0.0") (
val countsSize = counts.count()
// Group the iterator into chunks of about countsSize / numBins points,
// so that the resulting number of bins is about numBins
- var grouping = countsSize / numBins
+ val grouping = countsSize / numBins
if (grouping < 2) {
// numBins was more than half of the size; no real point in down-sampling to bins
logInfo(s"Curve is too small ($countsSize) for $numBins bins to be useful")
counts
} else {
- if (grouping >= Int.MaxValue) {
Review comment:
`Iterator.grouped(size: Int)` does not support `size` larger than `Int.MaxValue`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-590282786
Merged build finished. Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on issue #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-590254922
testCode:
```scala
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import scala.util.Random
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import scala.util.Random
val scoreAndLabels = sc.range(0, 40000000L, 1, 4).mapPartitionsWithIndex{ case (pid, iter) => val rng=new Random(pid); iter.map{_ => (rng.nextDouble, rng.nextInt(2).toDouble)} }
scoreAndLabels.count
val metrics = new BinaryClassificationMetrics(scoreAndLabels, 1)
val start = System.currentTimeMillis; val auc = metrics.areaUnderROC; val end = System.currentTimeMillis; end - start
```
result:
|Test| This PR(--driver-memory=1G) | This PR(--driver-memory=32G) | Master(--driver-memory=1G) | Master(--driver-memory=32G) |
|------|----------|------------|----------|------------|
|Duration|343091|173030|OOM|183258|
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a change in pull request #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on a change in pull request #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#discussion_r383190996
##########
File path: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
##########
@@ -182,28 +197,40 @@ class BinaryClassificationMetrics @Since("3.0.0") (
val countsSize = counts.count()
// Group the iterator into chunks of about countsSize / numBins points,
// so that the resulting number of bins is about numBins
- var grouping = countsSize / numBins
+ val grouping = countsSize / numBins
if (grouping < 2) {
// numBins was more than half of the size; no real point in down-sampling to bins
logInfo(s"Curve is too small ($countsSize) for $numBins bins to be useful")
counts
} else {
- if (grouping >= Int.MaxValue) {
Review comment:
`Iterator.grouped(size: Int)` does not support `grouping` larger than `Int.MaxValue`
After this change, `BinaryClassificationMetrics` can deal with `grouping` larger than `Int.MaxValue`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] zhengruifeng commented on issue #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
zhengruifeng commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-592415739
Merged to master
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27682:
[SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-590254289
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23612/
Test PASSed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27682: [SPARK-30938][ML][MLLIB]
BinaryClassificationMetrics optimization
Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27682: [SPARK-30938][ML][MLLIB] BinaryClassificationMetrics optimization
URL: https://github.com/apache/spark/pull/27682#issuecomment-590256823
**[Test build #118863 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118863/testReport)** for PR 27682 at commit [`06bce05`](https://github.com/apache/spark/commit/06bce0533e2cc9c621086b2c5b5e768a091fb494).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org