You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Aman Omer (Jira)" <ji...@apache.org> on 2019/11/09 16:53:00 UTC
[jira] [Commented] (SPARK-29816) Missing persist in
mllib.evaluation.BinaryClassificationMetrics.recallByThreshold()
[ https://issues.apache.org/jira/browse/SPARK-29816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970865#comment-16970865 ]
Aman Omer commented on SPARK-29816:
-----------------------------------
Thanks [~spark_cachecheck] for reporting. I will raise a PR for this.
> Missing persist in mllib.evaluation.BinaryClassificationMetrics.recallByThreshold()
> -----------------------------------------------------------------------------------
>
> Key: SPARK-29816
> URL: https://issues.apache.org/jira/browse/SPARK-29816
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 2.4.3
> Reporter: Dong Wang
> Priority: Major
>
> The rdd scoreAndLabels.combineByKey is used by two actions: sortByKey and count(), so it needs to be persisted.
> {code:scala}
> val counts = scoreAndLabels.combineByKey(
> createCombiner = (label: Double) => new BinaryLabelCounter(0L, 0L) += label,
> mergeValue = (c: BinaryLabelCounter, label: Double) => c += label,
> mergeCombiners = (c1: BinaryLabelCounter, c2: BinaryLabelCounter) => c1 += c2
> ).sortByKey(ascending = false) // first use
> val binnedCounts =
> // Only down-sample if bins is > 0
> if (numBins == 0) {
> // Use original directly
> counts
> } else {
> val countsSize = counts.count() //second use
> {scala}
> This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org