You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/04 02:46:41 UTC

[GitHub] [spark] shishaochen edited a comment on issue #24470: [SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics

shishaochen edited a comment on issue #24470: [SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics
URL: https://github.com/apache/spark/pull/24470#issuecomment-489287434
 
 
   @srowen Yes, both are approximations. But it has less error if we choose the last element in each chunk as the threshold.
   And the essential problem is that, the so-called "downsampling" is not real sampling. The [code behind](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L196) calculates precision, recall, etc. based on statistics (like TP, NP, TF, NF) of all elements.
   ```scala
   counts.mapPartitions(_.grouped(grouping.toInt).map { pairs =>
     // The score of the combined point will be just the first one's score
     val firstScore = pairs.head._1
     // The point will contain all counts in this chunk
     val agg = new BinaryLabelCounter()
     pairs.foreach(pair => agg += pair._2)
     (firstScore, agg)
   })
   ```
   You can see, counters (`BinaryLabelCounter`) of all elements are merged into one instead of return the first element directly.
   Thus, from the definition of `threshold`, the score of the last element (which is the minimal one) is the right threshold to use when inference.
   In online systems, we need choose the right threshold to predict whether an instance is positive (`score>=threshold`) or negative (`score<threshold`).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org