You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "sam (JIRA)" <ji...@apache.org> on 2015/06/15 13:44:00 UTC

[jira] [Created] (SPARK-8375) BinaryClassificationMetrics in ML Lib has odd API

sam created SPARK-8375:
--------------------------

             Summary: BinaryClassificationMetrics in ML Lib has odd API
                 Key: SPARK-8375
                 URL: https://issues.apache.org/jira/browse/SPARK-8375
             Project: Spark
          Issue Type: Bug
          Components: MLlib
            Reporter: sam


According to https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The constructor takes `RDD[(Double, Double)]` which does not make sense it should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.

In scikit I believe they use the number of unique scores to determine the number of thresholds and the ROC.  I assume this is what BinaryClassificationMetrics is doing since it makes no mention of buckets.  In a Big Data context this does not make as the number of unique scores may be huge.  

Rather user should be able to either specify the number of buckets, or the number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`

Finally it would then be good if either the ROC output type was changed or another method was added that returned confusion matricies, so that the hard integer values can be obtained.  E.g.

```
case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
  // bunch of methods for each of the things in the table here https://en.wikipedia.org/wiki/Receiver_operating_characteristic
}

...
def confusions(numPtsPerBucket: Int): RDD[Confusion]
```






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org