You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "sam (JIRA)" <ji...@apache.org> on 2015/06/15 13:45:02 UTC

[jira] [Updated] (SPARK-8375) BinaryClassificationMetrics in ML Lib has odd API

     [ https://issues.apache.org/jira/browse/SPARK-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

sam updated SPARK-8375:
-----------------------
    Description: 
According to https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The constructor takes `RDD[(Double, Double)]` which does not make sense it should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.

In scikit I believe they use the number of unique scores to determine the number of thresholds and the ROC.  I assume this is what BinaryClassificationMetrics is doing since it makes no mention of buckets.  In a Big Data context this does not make sense as the number of unique scores may be huge.  

Rather user should be able to either specify the number of buckets, or the number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`

Finally it would then be good if either the ROC output type was changed or another method was added that returned confusion matricies, so that the hard integer values can be obtained.  E.g.

```
case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
  // bunch of methods for each of the things in the table here https://en.wikipedia.org/wiki/Receiver_operating_characteristic
}

...
def confusions(numPtsPerBucket: Int): RDD[Confusion]
```




  was:
According to https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

The constructor takes `RDD[(Double, Double)]` which does not make sense it should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.

In scikit I believe they use the number of unique scores to determine the number of thresholds and the ROC.  I assume this is what BinaryClassificationMetrics is doing since it makes no mention of buckets.  In a Big Data context this does not make as the number of unique scores may be huge.  

Rather user should be able to either specify the number of buckets, or the number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`

Finally it would then be good if either the ROC output type was changed or another method was added that returned confusion matricies, so that the hard integer values can be obtained.  E.g.

```
case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
  // bunch of methods for each of the things in the table here https://en.wikipedia.org/wiki/Receiver_operating_characteristic
}

...
def confusions(numPtsPerBucket: Int): RDD[Confusion]
```





> BinaryClassificationMetrics in ML Lib has odd API
> -------------------------------------------------
>
>                 Key: SPARK-8375
>                 URL: https://issues.apache.org/jira/browse/SPARK-8375
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>            Reporter: sam
>
> According to https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
> The constructor takes `RDD[(Double, Double)]` which does not make sense it should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`.
> In scikit I believe they use the number of unique scores to determine the number of thresholds and the ROC.  I assume this is what BinaryClassificationMetrics is doing since it makes no mention of buckets.  In a Big Data context this does not make sense as the number of unique scores may be huge.  
> Rather user should be able to either specify the number of buckets, or the number of data points in each bucket.  E.g. `def roc(numPtsPerBucket: Int)`
> Finally it would then be good if either the ROC output type was changed or another method was added that returned confusion matricies, so that the hard integer values can be obtained.  E.g.
> ```
> case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) {
>   // bunch of methods for each of the things in the table here https://en.wikipedia.org/wiki/Receiver_operating_characteristic
> }
> ...
> def confusions(numPtsPerBucket: Int): RDD[Confusion]
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org