You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/08/22 09:23:00 UTC

[jira] [Commented] (SPARK-21806) BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading

    [ https://issues.apache.org/jira/browse/SPARK-21806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136554#comment-16136554 ] 

Sean Owen commented on SPARK-21806:
-----------------------------------

I agree. Recall can only be 0 if the top-ranked examples are actually negative. But in that case, precision is *0*, unless the classifier has labeled nothing at all as positive. In that case precision is 0/0, undefined. 

More typically, the top-ranked examples are positive, in which case precision is indeed 1 for small values of recall, but, again undefined when recall = 0 because that would have to be the case that nothing is labeled positive.

I see why it's intuitive to include (0,1) on the curve but agree it can't actually occur!

The issue with excluding it is that it may leave no point at all with recall = 0, and that means the computed area under the curve is over a slightly smaller range of recall values, from [min(recall),1] not [0,1].

What about defining precision at recall = 0, if it doesn't exist, to be the precision at the minimum recall value? Or, 0 or 1, whichever is closer?
I'd love to be consistent with something else here. Am I reading this right that scikit will just accept that there's no point at recall = 0?

CC [~mengxr] -- this is code from a really long time ago. I hesitate to change the behavior but this could be construed as a clear fix.

> BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading
> ----------------------------------------------------------------------
>
>                 Key: SPARK-21806
>                 URL: https://issues.apache.org/jira/browse/SPARK-21806
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.2.0
>            Reporter: Marc Kaminski
>            Priority: Minor
>
> I would like to reference to a [discussion in scikit-learn| https://github.com/scikit-learn/scikit-learn/issues/4223], as this behavior is probably based on the scikit implementation. 
> Summary: 
> Currently, the y-axis intercept of the precision recall curve is set to (0.0, 1.0). This behavior is not ideal in certain edge cases (see example below) and can also have an impact on cross validation, when optimization metric is set to "areaUnderPR". 
> Please consider [blucena's post|https://github.com/scikit-learn/scikit-learn/issues/4223#issuecomment-215273613] for possible alternatives. 
> Edge case example: 
> Consider a bad classifier, that assigns a high probability to all samples. A possible output might look like this: 
> ||Real label || Score ||
> |1.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 0.95 |
> |0.0 | 0.95 |
> |1.0 | 1.0 |
> This results in the following pr points (first line set by default): 
> ||Threshold || Recall ||Precision ||
> |1.0 | 0.0 | 1.0 | 
> |0.95| 1.0 | 0.2 |
> |0.0| 1.0 | 0,16 |
> The auPRC would be around 0.6. Classifiers with a more differentiated probability assignment  will be falsely assumed to perform worse in regard to this auPRC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org