You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sam <sa...@gmail.com> on 2015/07/28 14:44:20 UTC

Re: *Metrics API is odd in MLLib

Hi Xiangrui & Spark People,

I recently got round to writing an evaluation framework for Spark that I
was hoping to PR into MLLib and this would solve some of the aforementioned
issues.  I have put the code on github in a separate repo for now as I
would like to get some sandboxed feedback.  The repo complete with detailed
documentation can be found here https://github.com/samthebest/sceval.

Many thanks,

Sam



On Thu, Jun 18, 2015 at 11:00 AM, Sam <sa...@gmail.com> wrote:

> Firstly apologies for the header of my email containing some junk, I
> believe it's due to a copy and paste error on a smart phone.
>
> Thanks for your response.  I will indeed make the PR you suggest, though
> glancing at the code I realize it's not just a case of making these public
> since the types are also private. Then, there is certain functionality I
> will be exposing, which then ought to be tested, e.g. every bin except
> potentially the last will have an equal number of data points in it*.  I'll
> get round to it at some point.
>
> As for BinaryClassificationMetrics using Double for labels, thanks for the
> explanation.  If I where to make a PR to encapsulate the underlying
> implementation (that uses LabeledPoint) and change the type to Boolean,
> would what be the impact to versioning (since I'd be changing public API)?
> An alternative would be to create a new wrapper class, say
> BinaryClassificationMeasures, and deprecate the old with the intention of
> migrating all the code into the new class.
>
> * Maybe some other part of the code base tests this, since this assumption
> must hold in order to average across folds in x-validation?
>
> On Thu, Jun 18, 2015 at 1:02 AM, Xiangrui Meng <me...@gmail.com> wrote:
>
>> LabeledPoint was used for both classification and regression, where label
>> type is Double for simplicity. So in BinaryClassificationMetrics, we still
>> use Double for labels. We compute the confusion matrix at each threshold
>> internally, but this is not exposed to users (
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L127).
>> Feel free to submit a PR to make it public. -Xiangrui
>>
>> On Mon, Jun 15, 2015 at 7:13 AM, Sam <sa...@gmail.com> wrote:
>>
>>>
>>> Google+
>>> <https://plus.google.com/app/basic?nopromo=1&source=mog&gl=uk>
>>> <http://mail.google.com/mail/x/mog-/gp/?source=mog&gl=uk>
>>> Calendar
>>> <https://www.google.com/calendar/gpcal?source=mog&gl=uk>
>>> Web
>>> <http://www.google.co.uk/?source=mog&gl=uk>
>>> more
>>> Inbox
>>> Apache Spark Email
>>> GmailNot Work
>>> S
>>> sam.savage@barclays.com
>>> to me
>>> 0 minutes ago
>>> Details
>>> According to
>>> https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
>>>
>>> The constructor takes `RDD[(Double, Double)]` meaning lables are
>>> Doubles, this seems odd, shouldn't it be Boolean?  Similarly for
>>> MutlilabelMetrics (I.e. Should be RDD[(Array[Double], Array[Boolean])]),
>>> and for MulticlassMetrics the type of both should be generic?
>>>
>>> Additionally it would be good if either the ROC output type was changed
>>> or another method was added that returned confusion matricies, so that the
>>> hard integer values can be obtained before the divisions. E.g.
>>>
>>> ```
>>> case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int)
>>> {
>>>   // bunch of methods for each of the things in the table here
>>> https://en.wikipedia.org/wiki/Receiver_operating_characteristic
>>> }
>>> ...
>>> def confusions(): RDD[Confusion]
>>> ```
>>>
>>
>>
>