You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by schmit <gi...@git.apache.org> on 2014/03/17 03:55:35 UTC

[GitHub] spark pull request: ROC area under the curve for binary classifica...

GitHub user schmit opened a pull request:

    https://github.com/apache/spark/pull/160

    ROC area under the curve for binary classification

    Implementation of receiver-operator-curve area under the curve goodness of fit measure.
    
    JIRA:
    https://spark-project.atlassian.net/browse/MLLIB-23
    
    References:
    http://web.engr.oregonstate.edu/~tgd/classes/534/slides/part13.pdf, slide 11.
    http://cs.ru.nl/~tomh/onderwijs/dm/dm_files/roc_auc.pdf (@srowen)
    
    Incubator-spark pull request (old):
    https://github.com/apache/incubator-spark/pull/550

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/schmit/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/160.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #160
    
----
commit 9b0e05147f5baceb3c63d5d5c5ad3a4dacc6a3d8
Author: schmit <sc...@stanford.edu>
Date:   2014-03-14T21:49:39Z

    Copy from incubator-spark
    
    Still have to remove the dataset tests from the unit tests

commit 22e56f229c088c1c405944ebfcd3ffac1a41c518
Author: schmit <sc...@stanford.edu>
Date:   2014-03-17T02:39:40Z

    Remove the data tests

commit 6aa93f84c5edbdfadae8554e100a3d466de08209
Author: schmit <sc...@stanford.edu>
Date:   2014-03-17T02:42:43Z

    Merge remote-tracking branch 'upstream/master'

commit ffae83b6602df83f3422f66e7de419eb5dd83d75
Author: schmit <sc...@stanford.edu>
Date:   2014-03-17T02:46:48Z

    remove explicit margin LR predictScore

commit ba7de4d2daffa084988e4f2a7eb7d8a37f5f015c
Author: schmit <sc...@stanford.edu>
Date:   2014-03-17T02:49:03Z

    removed comment

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/160#issuecomment-38881874
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by schmit <gi...@git.apache.org>.

Github user schmit commented on a diff in the pull request:

    https://github.com/apache/spark/pull/160#discussion_r10732053
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/BinaryClassificationModel.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.classification
    +
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.regression._
    +import org.apache.spark.rdd.RDD
    +
    +trait BinaryClassificationModel extends ClassificationModel {
    +  /**
    +   * Return true labels and prediction scores in an RDD
    +   *
    +   * @param input RDD with labelled points to use for the evaluation
    +   * @return RDD[(Double, Double)] Contains a pair of (label, probability)
    +   *         where probability is the probability the model assigns to
    +   *         the label being 1.
    +   */ 
    +  def scoreForEval(input: RDD[LabeledPoint]) : RDD[(Double, Double)] = {
    +    val predictionAndLabel = input.map { point =>
    +        val scores = score(point.features)
    +        (scores, point.label)
    +    }
    +    predictionAndLabel
    +  }
    +
    +  /**
    +   * Evaluate the performance of the model using the score assigned by the model
    +   * to observations and the true label.
    +   * Returns the Receiver operating characteristic area under the curve.
    +   * Note that we consider the prediction of a label to be 0 if the score is less than 0,
    +   * and we predict label 1 if the score is larger than 0.
    +   *
    +   * @param predictionAndLabel RDD with (score by model, true label)
    +   * @return Double Area under curve of ROC
    +   */ 
    +  def areaUnderROC(predictionAndLabel: RDD[(Double, Double)]) : Double = {
    +    val nObs = predictionAndLabel.count
    +    val nPos = predictionAndLabel.filter(x => x._2 == 1.0).count
    --- End diff --
    
    I didn't know that, but upon trying, it does not seem to work in this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/160#issuecomment-38319702
  
    It's all functionally the same, sure. I'm launching a crusade against using doubles for labels (https://spark-project.atlassian.net/browse/MLLIB-29) so this would change again anyway to `== "true"` or something anyway if that happened.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/160#issuecomment-37783918

If I may wade in with a comment. I am not clear a `BinaryClassificationModel` is needed. What it adds, the score method, seems to just return 0/1 depending on the predicted class. The result is already 0/1, or a probability, in which case this adds little. The ROC calculation feels like it does not require a superclass like this just for its own sake.

(Separately: I think there is a design problem here with `ClassificationModel` outputting `Double`. Classifiers output a class, which is an enumerated, opaque value, or a distribution of probabilities over classes, which is a mapping of opaque values to numbers. It so happens that you can map opaque values to 0,1,2. And it so happens these are not only ints but real values. And it so happens that you can map the distribution over 2 classes into a single number in [0,1]. And it so happens that these can all be represented by a `Double`, just like the output of a `RegressionModel`. But it's severe overloading that will cause tears later. But this is a separate issue. A design change along these lines *might* necessitate a special subclass for binary classifiers, but not sure.)

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by schmit <gi...@git.apache.org>.

Github user schmit commented on the pull request:

    https://github.com/apache/spark/pull/160#issuecomment-38003806
  
    On your more general remarks @srowen:
    
    I think those are valid concerns, here is my reasoning for doing it this way:
    The predict function returns the label, but I need the predicted "score" or predicted probability (in case of LR) of the test samples in order to sort them.
    
    Also, in more generality, this seems like a useful function to have. I do not want to change the predict function, since that is what is probably most used and wanted, and it would be annoying to change the score into a label by hand, and only in the binary classification setting.
    
    However, this score function only makes sense in the binary classification setting, and so does ROC AUC. Later I hope to add the PR AUC as well, and that can be added to the same class, but first things first.
    
    The alternative is to define this function for both LR and SVM separately, but I don't like that either.
    
    However, I do agree it is not the most clean code, so your suggestions are very welcome.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by schmit <gi...@git.apache.org>.

Github user schmit commented on the pull request:

    https://github.com/apache/spark/pull/160#issuecomment-38316626
  
    Sorry, I am blind. Anyway, do you agree that > 0.5 makes most sense?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/160#issuecomment-38009810
  
    Yeah actually this is a good partial step to distinguishing three different things: giving an opaque score to a example/label, assigning a probability to an example/label, and choosing the most likely label. The score/predict distinction is not specific to binary classifiers. I suppose I'm saying: I may propose a PR that would change this further, would that be OK? I think `BinaryClassifierModel` might stay useful as at least a marker trait. Yes AUC is useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/160#discussion_r10645689
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/BinaryClassificationModel.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.classification
    +
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.regression._
    +import org.apache.spark.rdd.RDD
    +
    +trait BinaryClassificationModel extends ClassificationModel {
    +  /**
    +   * Return true labels and prediction scores in an RDD
    +   *
    +   * @param input RDD with labelled points to use for the evaluation
    +   * @return RDD[(Double, Double)] Contains a pair of (label, probability)
    +   *         where probability is the probability the model assigns to
    +   *         the label being 1.
    +   */ 
    +  def scoreForEval(input: RDD[LabeledPoint]) : RDD[(Double, Double)] = {
    +    val predictionAndLabel = input.map { point =>
    +        val scores = score(point.features)
    +        (scores, point.label)
    +    }
    +    predictionAndLabel
    +  }
    +
    +  /**
    +   * Evaluate the performance of the model using the score assigned by the model
    +   * to observations and the true label.
    +   * Returns the Receiver operating characteristic area under the curve.
    +   * Note that we consider the prediction of a label to be 0 if the score is less than 0,
    +   * and we predict label 1 if the score is larger than 0.
    +   *
    +   * @param predictionAndLabel RDD with (score by model, true label)
    +   * @return Double Area under curve of ROC
    +   */ 
    +  def areaUnderROC(predictionAndLabel: RDD[(Double, Double)]) : Double = {
    +    val nObs = predictionAndLabel.count
    +    val nPos = predictionAndLabel.filter(x => x._2 == 1.0).count
    --- End diff --
    
    count can take a predicate directly, no? or am I thinking of just plain Scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/160#discussion_r10734041
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/BinaryClassificationModel.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.classification
    +
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.regression._
    +import org.apache.spark.rdd.RDD
    +
    +trait BinaryClassificationModel extends ClassificationModel {
    +  /**
    +   * Return true labels and prediction scores in an RDD
    +   *
    +   * @param input RDD with labelled points to use for the evaluation
    +   * @return RDD[(Double, Double)] Contains a pair of (label, probability)
    +   *         where probability is the probability the model assigns to
    +   *         the label being 1.
    +   */ 
    +  def scoreForEval(input: RDD[LabeledPoint]) : RDD[(Double, Double)] = {
    +    val predictionAndLabel = input.map { point =>
    +        val scores = score(point.features)
    +        (scores, point.label)
    +    }
    +    predictionAndLabel
    +  }
    +
    +  /**
    +   * Evaluate the performance of the model using the score assigned by the model
    +   * to observations and the true label.
    +   * Returns the Receiver operating characteristic area under the curve.
    +   * Note that we consider the prediction of a label to be 0 if the score is less than 0,
    +   * and we predict label 1 if the score is larger than 0.
    +   *
    +   * @param predictionAndLabel RDD with (score by model, true label)
    +   * @return Double Area under curve of ROC
    +   */ 
    +  def areaUnderROC(predictionAndLabel: RDD[(Double, Double)]) : Double = {
    +    val nObs = predictionAndLabel.count
    +    val nPos = predictionAndLabel.filter(x => x._2 == 1.0).count
    +    // sort according to the predicted score and add indices
    +    val sortedPredictionsWithIndex = predictionAndLabel.sortByKey(true).zipWithIndex
    +    // sum of the positive ranks
    +    val sumPosRanks = sortedPredictionsWithIndex.filter(x => (x._1)._2 > 0).map(x => x._2 + 1).sum
    +    // if there are no positive or no negative labels, the area under the curve is not defined.
    +    // Return 0 in that case.
    +    if ((nPos > 0) && (nObs > nPos)) {
    +          (sumPosRanks - nPos * (nPos + 1) / 2) / (nPos * (nObs - nPos))
    --- End diff --
    
    No I mean that this expression will be incorrect well before the `Long` values themselves overflow:
    
    ```
    scala> val nPos = 4000000000L
    nPos: Long = 4000000000
    scala> nPos * (nPos + 1) / 2
    res1: Long = -1223372034854775808
    ```
    (Should have said that the problem occurs a little past 3B, not 2B). Sure the 64-bit integer overflows eventually but not worried about quintillions of elements just yet! but having a few billion elements is quite possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by schmit <gi...@git.apache.org>.

Github user schmit commented on the pull request:

    https://github.com/apache/spark/pull/160#issuecomment-38319124
  
    They are indeed either 0.0 or 1.0, as they are the true labels. However, maybe > 0.5 is a bit safer, since these are implemented as doubles? That's my argument for using > 0.5.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/160#discussion_r10734415
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/BinaryClassificationModel.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.classification
    +
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.regression._
    +import org.apache.spark.rdd.RDD
    +
    +trait BinaryClassificationModel extends ClassificationModel {
    +  /**
    +   * Return true labels and prediction scores in an RDD
    +   *
    +   * @param input RDD with labelled points to use for the evaluation
    +   * @return RDD[(Double, Double)] Contains a pair of (label, probability)
    +   *         where probability is the probability the model assigns to
    +   *         the label being 1.
    +   */ 
    +  def scoreForEval(input: RDD[LabeledPoint]) : RDD[(Double, Double)] = {
    +    val predictionAndLabel = input.map { point =>
    +        val scores = score(point.features)
    +        (scores, point.label)
    +    }
    +    predictionAndLabel
    +  }
    +
    +  /**
    +   * Evaluate the performance of the model using the score assigned by the model
    +   * to observations and the true label.
    +   * Returns the Receiver operating characteristic area under the curve.
    +   * Note that we consider the prediction of a label to be 0 if the score is less than 0,
    +   * and we predict label 1 if the score is larger than 0.
    +   *
    +   * @param predictionAndLabel RDD with (score by model, true label)
    +   * @return Double Area under curve of ROC
    +   */ 
    +  def areaUnderROC(predictionAndLabel: RDD[(Double, Double)]) : Double = {
    +    val nObs = predictionAndLabel.count.toDouble
    +    val nPos = predictionAndLabel.filter(x => x._2 == 1.0).count.toDouble
    +    // sort according to the predicted score and add indices
    +    val sortedPredictionsWithIndex = predictionAndLabel.sortByKey(true).zipWithIndex
    +    // sum of the positive ranks
    +    val sumPosRanks = sortedPredictionsWithIndex.filter(x => (x._1)._2 > 0).map(x => x._2 + 1).sum
    --- End diff --
    
    Unless I'm misreading, in this line, positive is defined as "> 0" but it's defined as "== 1" above. The latter might be the safer definition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/160#issuecomment-37783040
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by schmit <gi...@git.apache.org>.

Github user schmit commented on a diff in the pull request:

    https://github.com/apache/spark/pull/160#discussion_r10731686
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/BinaryClassificationModel.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.classification
    +
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.regression._
    +import org.apache.spark.rdd.RDD
    +
    +trait BinaryClassificationModel extends ClassificationModel {
    +  /**
    +   * Return true labels and prediction scores in an RDD
    +   *
    +   * @param input RDD with labelled points to use for the evaluation
    +   * @return RDD[(Double, Double)] Contains a pair of (label, probability)
    +   *         where probability is the probability the model assigns to
    +   *         the label being 1.
    +   */ 
    +  def scoreForEval(input: RDD[LabeledPoint]) : RDD[(Double, Double)] = {
    +    val predictionAndLabel = input.map { point =>
    +        val scores = score(point.features)
    +        (scores, point.label)
    +    }
    +    predictionAndLabel
    +  }
    +
    +  /**
    +   * Evaluate the performance of the model using the score assigned by the model
    +   * to observations and the true label.
    +   * Returns the Receiver operating characteristic area under the curve.
    +   * Note that we consider the prediction of a label to be 0 if the score is less than 0,
    +   * and we predict label 1 if the score is larger than 0.
    +   *
    +   * @param predictionAndLabel RDD with (score by model, true label)
    +   * @return Double Area under curve of ROC
    +   */ 
    +  def areaUnderROC(predictionAndLabel: RDD[(Double, Double)]) : Double = {
    +    val nObs = predictionAndLabel.count
    +    val nPos = predictionAndLabel.filter(x => x._2 == 1.0).count
    +    // sort according to the predicted score and add indices
    +    val sortedPredictionsWithIndex = predictionAndLabel.sortByKey(true).zipWithIndex
    +    // sum of the positive ranks
    +    val sumPosRanks = sortedPredictionsWithIndex.filter(x => (x._1)._2 > 0).map(x => x._2 + 1).sum
    +    // if there are no positive or no negative labels, the area under the curve is not defined.
    +    // Return 0 in that case.
    +    if ((nPos > 0) && (nObs > nPos)) {
    +          (sumPosRanks - nPos * (nPos + 1) / 2) / (nPos * (nObs - nPos))
    --- End diff --
    
    If this is true, then that's more a general problem of count instead of this implementation, isn't it? Shouldn't the implementation for count for RDD take this into account?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by schmit <gi...@git.apache.org>.

Github user schmit commented on a diff in the pull request:

    https://github.com/apache/spark/pull/160#discussion_r10734227
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/BinaryClassificationModel.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.classification
    +
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.regression._
    +import org.apache.spark.rdd.RDD
    +
    +trait BinaryClassificationModel extends ClassificationModel {
    +  /**
    +   * Return true labels and prediction scores in an RDD
    +   *
    +   * @param input RDD with labelled points to use for the evaluation
    +   * @return RDD[(Double, Double)] Contains a pair of (label, probability)
    +   *         where probability is the probability the model assigns to
    +   *         the label being 1.
    +   */ 
    +  def scoreForEval(input: RDD[LabeledPoint]) : RDD[(Double, Double)] = {
    +    val predictionAndLabel = input.map { point =>
    +        val scores = score(point.features)
    +        (scores, point.label)
    +    }
    +    predictionAndLabel
    +  }
    +
    +  /**
    +   * Evaluate the performance of the model using the score assigned by the model
    +   * to observations and the true label.
    +   * Returns the Receiver operating characteristic area under the curve.
    +   * Note that we consider the prediction of a label to be 0 if the score is less than 0,
    +   * and we predict label 1 if the score is larger than 0.
    +   *
    +   * @param predictionAndLabel RDD with (score by model, true label)
    +   * @return Double Area under curve of ROC
    +   */ 
    +  def areaUnderROC(predictionAndLabel: RDD[(Double, Double)]) : Double = {
    +    val nObs = predictionAndLabel.count
    +    val nPos = predictionAndLabel.filter(x => x._2 == 1.0).count
    +    // sort according to the predicted score and add indices
    +    val sortedPredictionsWithIndex = predictionAndLabel.sortByKey(true).zipWithIndex
    +    // sum of the positive ranks
    +    val sumPosRanks = sortedPredictionsWithIndex.filter(x => (x._1)._2 > 0).map(x => x._2 + 1).sum
    +    // if there are no positive or no negative labels, the area under the curve is not defined.
    +    // Return 0 in that case.
    +    if ((nPos > 0) && (nObs > nPos)) {
    +          (sumPosRanks - nPos * (nPos + 1) / 2) / (nPos * (nObs - nPos))
    --- End diff --
    
    I see, thanks! Fixed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by schmit <gi...@git.apache.org>.

Github user schmit closed the pull request at:

    https://github.com/apache/spark/pull/160


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/160#discussion_r10645721
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/BinaryClassificationModel.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.classification
    +
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.regression._
    +import org.apache.spark.rdd.RDD
    +
    +trait BinaryClassificationModel extends ClassificationModel {
    +  /**
    +   * Return true labels and prediction scores in an RDD
    +   *
    +   * @param input RDD with labelled points to use for the evaluation
    +   * @return RDD[(Double, Double)] Contains a pair of (label, probability)
    +   *         where probability is the probability the model assigns to
    +   *         the label being 1.
    +   */ 
    +  def scoreForEval(input: RDD[LabeledPoint]) : RDD[(Double, Double)] = {
    +    val predictionAndLabel = input.map { point =>
    +        val scores = score(point.features)
    +        (scores, point.label)
    +    }
    +    predictionAndLabel
    +  }
    +
    +  /**
    +   * Evaluate the performance of the model using the score assigned by the model
    +   * to observations and the true label.
    +   * Returns the Receiver operating characteristic area under the curve.
    +   * Note that we consider the prediction of a label to be 0 if the score is less than 0,
    +   * and we predict label 1 if the score is larger than 0.
    +   *
    +   * @param predictionAndLabel RDD with (score by model, true label)
    +   * @return Double Area under curve of ROC
    +   */ 
    +  def areaUnderROC(predictionAndLabel: RDD[(Double, Double)]) : Double = {
    +    val nObs = predictionAndLabel.count
    +    val nPos = predictionAndLabel.filter(x => x._2 == 1.0).count
    +    // sort according to the predicted score and add indices
    +    val sortedPredictionsWithIndex = predictionAndLabel.sortByKey(true).zipWithIndex
    +    // sum of the positive ranks
    +    val sumPosRanks = sortedPredictionsWithIndex.filter(x => (x._1)._2 > 0).map(x => x._2 + 1).sum
    +    // if there are no positive or no negative labels, the area under the curve is not defined.
    +    // Return 0 in that case.
    +    if ((nPos > 0) && (nObs > nPos)) {
    +          (sumPosRanks - nPos * (nPos + 1) / 2) / (nPos * (nObs - nPos))
    --- End diff --
    
    I think int overflow will be an issue here, even though `nPos` and `nObs` are `Long`. For massive RDDs of more than about 2 billion elements this could fail. Better to make the values `Double` to be safe since they're used that way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/160#issuecomment-38318849
  
    These are supposedly all 0.0 or 1.0 since they're really labels, not numeric values. Given that, I reverse myself and suppose that "== 1" is maybe best. You could say that, well, "> 0.5" is more flexible in case someone passes in things that aren't quite 0/1 labels but probabilities instead. That's a decent argument but wonder if it's just inviting abuse of the fact that labels happen to be floating point values now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: ROC area under the curve for binary classifica...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/160#discussion_r10734471
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/BinaryClassificationModel.scala ---
    @@ -0,0 +1,68 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.classification
    +
    +
    +import org.apache.spark.SparkContext
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.regression._
    +import org.apache.spark.rdd.RDD
    +
    +trait BinaryClassificationModel extends ClassificationModel {
    +  /**
    +   * Return true labels and prediction scores in an RDD
    +   *
    +   * @param input RDD with labelled points to use for the evaluation
    +   * @return RDD[(Double, Double)] Contains a pair of (label, probability)
    +   *         where probability is the probability the model assigns to
    +   *         the label being 1.
    +   */ 
    +  def scoreForEval(input: RDD[LabeledPoint]) : RDD[(Double, Double)] = {
    +    val predictionAndLabel = input.map { point =>
    +        val scores = score(point.features)
    +        (scores, point.label)
    +    }
    +    predictionAndLabel
    +  }
    +
    +  /**
    +   * Evaluate the performance of the model using the score assigned by the model
    +   * to observations and the true label.
    +   * Returns the Receiver operating characteristic area under the curve.
    +   * Note that we consider the prediction of a label to be 0 if the score is less than 0,
    +   * and we predict label 1 if the score is larger than 0.
    +   *
    +   * @param predictionAndLabel RDD with (score by model, true label)
    +   * @return Double Area under curve of ROC
    +   */ 
    +  def areaUnderROC(predictionAndLabel: RDD[(Double, Double)]) : Double = {
    +    val nObs = predictionAndLabel.count.toDouble
    +    val nPos = predictionAndLabel.filter(x => x._2 == 1.0).count.toDouble
    +    // sort according to the predicted score and add indices
    +    val sortedPredictionsWithIndex = predictionAndLabel.sortByKey(true).zipWithIndex
    --- End diff --
    
    Final comment I promise. There's an interesting question of what to do about ties in the keys. If all of my predictions are 0/1 for example, then the ranking of 0/1 true labels within those predictions ends up being pretty arbitrary, and so does the AUC. I don't think this formula deals with that case, and you could argue it's a degenerate case or argue it's actually somewhat common. To make it at least deterministic, maybe a secondary sort by true label? I don't know whether to be optimistic or pessimistic, and sort the positives first or last. Tough one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---