You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shaochen Shi (JIRA)" <ji...@apache.org> on 2019/04/26 09:09:00 UTC

[jira] [Created] (SPARK-27577) Wrong thresholds selected by BinaryClassificationMetrics when downsampling

Shaochen Shi created SPARK-27577:
------------------------------------

             Summary: Wrong thresholds selected by BinaryClassificationMetrics when downsampling
                 Key: SPARK-27577
                 URL: https://issues.apache.org/jira/browse/SPARK-27577
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 2.4.2, 2.4.1, 2.4.0, 2.3.3, 2.3.2, 2.3.1, 2.3.0, 2.2.3, 2.2.2, 2.2.1, 2.2.0, 2.1.3, 2.1.2, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0, 1.6.3, 1.6.2, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0, 1.4.1, 1.4.0, 1.3.1, 1.3.0
            Reporter: Shaochen Shi


In binary metrics, a threshold means any instance with a score >= threshold will be considered as positive.

However, in the existing implementation:
 # When `numBins` is set when creating a `BinaryClassificationMetrics` object, all records (ordered by scores in DESC) will be grouped into chunks.
 # In each chunk, statistics (in `BinaryLabelCounter`) of records are accumulated while the first record's score is selected as threshold.
 # All these generated/sampled records form a new smaller data set to calculate binary metrics.

At the second step, it brings the BUG that the score/threshold of a record is correlated with wrong values like larger `true positive`, smaller `false negative` when calculating `recallByThresholds`, `precisionByThresholds`, etc.

Thus, the BUG fix is straightfoward. Let's pick up the last records's core in all chunks as thresholds while statistics merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org