You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/04 13:15:58 UTC

[GitHub] [spark] srowen edited a comment on issue #24470: [SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics

srowen edited a comment on issue #24470: [SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics
URL: https://github.com/apache/spark/pull/24470#issuecomment-489326110
 
 
   I still don't see the argument that the first or last is better. They are simply the endpoints of the range of scores within the bin. As the number of bins increases, the range is smaller. If you are worried about this difference, you need more bins. Your argument cuts two ways: having a slightly higher threshold than desired can cause as many problems as slightly smaller.
   
   What would be possibly better here is to compute the score of a bin as a weighted average of its elements. That would be OK though you'd have to change many tests. I think the current implementation is designed to match scikit (?)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org