You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mengxr <gi...@git.apache.org> on 2014/04/09 04:26:52 UTC

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/364

    [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationEvaluator

    This PR implements a generic version of `AreaUnderCurve` using the `RDD.sliding` implementation from https://github.com/apache/spark/pull/136 . It also contains refactoring of https://github.com/apache/spark/pull/160 for binary classification evaluation.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark auc

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/364.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #364
    
----
commit d2a600d5c0ab8a068cb23bdd422645d8b1a39f0b
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-03-13T08:47:45Z

    add sliding to rdd

commit 5ee6001471b1897400fef1e35b5e10fbfb47395f
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-03-13T18:49:04Z

    add TODO

commit 65461b21b012c8688d2747a039a721fb859bf9d3
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-03-13T20:14:22Z

    Merge branch 'sliding' into auc

commit c1c6c2228a446ed42bf4382d4703309865f6dc54
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-03-13T20:47:11Z

    add AreaUnderCurve

commit 284d991cf8c79a1ef7db79a9caa35a238e02338a
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-03-15T17:12:41Z

    change SlidedRDD to SlidingRDD

commit 9916202e0c6bc9d183bc35f3f16302bb7fbbb644
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-03-15T17:46:35Z

    change RDD.sliding return type to RDD[Seq[T]]

commit db6cb30da9ef7ce5ca473f32e709aedb2eeabc34
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-03-15T17:59:13Z

    remove unnecessary toSeq

commit cab9a52349a7ffcefeae7660836a6ea1b77d910f
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-03-15T18:06:32Z

    use last for the last element

commit a9b250a22e61192fd7c90b936b5eb798d1a5039e
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-03-22T00:52:44Z

    move sliding to mllib

commit a92086513c976479b1b68255967a72bd4af8f5c2
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-03-31T21:26:44Z

    Merge branch 'sliding' into auc

commit 221ebced1b36b0b625ce1bc19316f310a7e9f44c
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-03-31T22:03:08Z

    add a new test to sliding

commit aa7e278d589fb342dd505c23b35a789eb1f7ed55
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-03-31T22:30:25Z

    add initial version of binary classification evaluator

commit dda82d5253f448b3e3f37ba712d420fe942efd26
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-04-08T22:51:51Z

    add confusion matrix

commit 8f78958cf366ae2bdecbf987bfa6f23d29c36c71
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-04-08T23:29:53Z

    add PredictionAndResponse

commit 3d71525d05ef3b5619c9af8d436ec585d648c1c9
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-04-09T01:12:39Z

    move binary evalution classes to evaluation.binary

commit ca31da590e25a8b18e347534a07b5e8392e1036e
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-04-09T01:13:20Z

    remove PredictionAndResponse

commit 9dc35182725c8dca5293cee7ab7dccca9a258c06
Author: Xiangrui Meng <me...@databricks.com>
Date:   2014-04-09T02:16:52Z

    add tests for BinaryClassificationEvaluator

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/364#discussion_r11470460
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryConfusionMatrix.scala ---
    @@ -0,0 +1,41 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation.binary
    +
    +/**
    + * Trait for a binary confusion matrix.
    + */
    +private[evaluation] trait BinaryConfusionMatrix {
    +  /** number of true positives */
    +  def tp: Long
    --- End diff --
    
    Are these really standard abbreviations or do you think it would be better to spell these out? For stuff like "roc" I can see it being pretty standard.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40053686
  
    Jenkins, test this please. (sorry I rebooted jenkins).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40059906
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40149815
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40062918
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/364#discussion_r11470216
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryClassificationMetrics.scala ---
    @@ -0,0 +1,204 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation.binary
    +
    +import org.apache.spark.rdd.{UnionRDD, RDD}
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.evaluation.AreaUnderCurve
    +import org.apache.spark.Logging
    +
    +/**
    + * Implementation of [[org.apache.spark.mllib.evaluation.binary.BinaryConfusionMatrix]].
    + *
    + * @param count label counter for labels with scores greater than or equal to the current score
    + * @param totalCount label counter for all labels
    + */
    +private case class BinaryConfusionMatrixImpl(
    +    count: LabelCounter,
    +    totalCount: LabelCounter) extends BinaryConfusionMatrix with Serializable {
    +
    +  /** number of true positives */
    +  override def tp: Long = count.numPositives
    +
    +  /** number of false positives */
    +  override def fp: Long = count.numNegatives
    +
    +  /** number of false negatives */
    +  override def fn: Long = totalCount.numPositives - count.numPositives
    +
    +  /** number of true negatives */
    +  override def tn: Long = totalCount.numNegatives - count.numNegatives
    +
    +  /** number of positives */
    +  override def p: Long = totalCount.numPositives
    +
    +  /** number of negatives */
    +  override def n: Long = totalCount.numNegatives
    +}
    +
    +/**
    + * Evaluator for binary classification.
    + *
    + * @param scoreAndLabels an RDD of (score, label) pairs.
    + */
    +class BinaryClassificationMetrics(scoreAndLabels: RDD[(Double, Double)])
    +    extends Serializable with Logging {
    +
    +  private lazy val (
    +      cumCounts: RDD[(Double, LabelCounter)],
    --- End diff --
    
    Probably want to call this cumulativeCounts to make it a bit clearer


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-39925665
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/364#discussion_r11516544
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryClassificationMetrics.scala ---
    @@ -0,0 +1,204 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation.binary
    +
    +import org.apache.spark.rdd.{UnionRDD, RDD}
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.evaluation.AreaUnderCurve
    +import org.apache.spark.Logging
    +
    +/**
    + * Implementation of [[org.apache.spark.mllib.evaluation.binary.BinaryConfusionMatrix]].
    + *
    + * @param count label counter for labels with scores greater than or equal to the current score
    + * @param totalCount label counter for all labels
    + */
    +private case class BinaryConfusionMatrixImpl(
    +    count: LabelCounter,
    +    totalCount: LabelCounter) extends BinaryConfusionMatrix with Serializable {
    +
    +  /** number of true positives */
    +  override def numTruePositives: Long = count.numPositives
    --- End diff --
    
    It is shorter but does not have the exact meaning. Similarly, I prefer numCols instead of cols in matrix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40059907
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13994/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/364


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40149797
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-39931004
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13929/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-39923102
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13921/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40168463
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40152553
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14020/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-39931002
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40053172
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40168452
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-39922970
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40060058
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] [WIP] Add AreaUnder...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-39927111
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13924/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40149474
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40152551
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-39925657
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40169825
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-39928692
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40053180
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40242267
  
    Thanks Xiangrui! Merged into both master and branch-1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] [WIP] Add AreaUnder...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-39927110
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/364#discussion_r11509387
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryClassificationMetrics.scala ---
    @@ -0,0 +1,204 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation.binary
    +
    +import org.apache.spark.rdd.{UnionRDD, RDD}
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.evaluation.AreaUnderCurve
    +import org.apache.spark.Logging
    +
    +/**
    + * Implementation of [[org.apache.spark.mllib.evaluation.binary.BinaryConfusionMatrix]].
    + *
    + * @param count label counter for labels with scores greater than or equal to the current score
    + * @param totalCount label counter for all labels
    + */
    +private case class BinaryConfusionMatrixImpl(
    +    count: LabelCounter,
    +    totalCount: LabelCounter) extends BinaryConfusionMatrix with Serializable {
    +
    +  /** number of true positives */
    +  override def numTruePositives: Long = count.numPositives
    --- End diff --
    
    Just a minor question, do you want to call these numTruePositives or just truePositives? Anyway I'm happy to merge it as is, just felt truePositives would be shorter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-39923101
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/364#discussion_r11470221
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryClassificationMetrics.scala ---
    @@ -0,0 +1,204 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation.binary
    +
    +import org.apache.spark.rdd.{UnionRDD, RDD}
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.evaluation.AreaUnderCurve
    +import org.apache.spark.Logging
    +
    +/**
    + * Implementation of [[org.apache.spark.mllib.evaluation.binary.BinaryConfusionMatrix]].
    + *
    + * @param count label counter for labels with scores greater than or equal to the current score
    + * @param totalCount label counter for all labels
    + */
    +private case class BinaryConfusionMatrixImpl(
    +    count: LabelCounter,
    +    totalCount: LabelCounter) extends BinaryConfusionMatrix with Serializable {
    +
    +  /** number of true positives */
    +  override def tp: Long = count.numPositives
    +
    +  /** number of false positives */
    +  override def fp: Long = count.numNegatives
    +
    +  /** number of false negatives */
    +  override def fn: Long = totalCount.numPositives - count.numPositives
    +
    +  /** number of true negatives */
    +  override def tn: Long = totalCount.numNegatives - count.numNegatives
    +
    +  /** number of positives */
    +  override def p: Long = totalCount.numPositives
    +
    +  /** number of negatives */
    +  override def n: Long = totalCount.numNegatives
    +}
    +
    +/**
    + * Evaluator for binary classification.
    + *
    + * @param scoreAndLabels an RDD of (score, label) pairs.
    + */
    +class BinaryClassificationMetrics(scoreAndLabels: RDD[(Double, Double)])
    +    extends Serializable with Logging {
    +
    +  private lazy val (
    +      cumCounts: RDD[(Double, LabelCounter)],
    +      confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
    +    // Create a bin for each distinct score value, count positives and negatives within each bin,
    +    // and then sort by score values in descending order.
    +    val counts = scoreAndLabels.combineByKey(
    +      createCombiner = (label: Double) => new LabelCounter(0L, 0L) += label,
    +      mergeValue = (c: LabelCounter, label: Double) => c += label,
    +      mergeCombiners = (c1: LabelCounter, c2: LabelCounter) => c1 += c2
    +    ).sortByKey(ascending = false)
    +    val agg = counts.values.mapPartitions({ iter =>
    +      val agg = new LabelCounter()
    +      iter.foreach(agg += _)
    +      Iterator(agg)
    +    }, preservesPartitioning = true).collect()
    +    val partitionwiseCumCounts =
    +      agg.scanLeft(new LabelCounter())((agg: LabelCounter, c: LabelCounter) => agg.clone() += c)
    --- End diff --
    
    Actually there was one ... but scalastyle doesn't allow `def +(` because it asks for a space after `+`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40053905
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40053912
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40059961
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-39928696
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40060068
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40053227
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by mateiz <gi...@git.apache.org>.
Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/364#discussion_r11470192
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryClassificationMetrics.scala ---
    @@ -0,0 +1,204 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation.binary
    +
    +import org.apache.spark.rdd.{UnionRDD, RDD}
    +import org.apache.spark.SparkContext._
    +import org.apache.spark.mllib.evaluation.AreaUnderCurve
    +import org.apache.spark.Logging
    +
    +/**
    + * Implementation of [[org.apache.spark.mllib.evaluation.binary.BinaryConfusionMatrix]].
    + *
    + * @param count label counter for labels with scores greater than or equal to the current score
    + * @param totalCount label counter for all labels
    + */
    +private case class BinaryConfusionMatrixImpl(
    +    count: LabelCounter,
    +    totalCount: LabelCounter) extends BinaryConfusionMatrix with Serializable {
    +
    +  /** number of true positives */
    +  override def tp: Long = count.numPositives
    +
    +  /** number of false positives */
    +  override def fp: Long = count.numNegatives
    +
    +  /** number of false negatives */
    +  override def fn: Long = totalCount.numPositives - count.numPositives
    +
    +  /** number of true negatives */
    +  override def tn: Long = totalCount.numNegatives - count.numNegatives
    +
    +  /** number of positives */
    +  override def p: Long = totalCount.numPositives
    +
    +  /** number of negatives */
    +  override def n: Long = totalCount.numNegatives
    +}
    +
    +/**
    + * Evaluator for binary classification.
    + *
    + * @param scoreAndLabels an RDD of (score, label) pairs.
    + */
    +class BinaryClassificationMetrics(scoreAndLabels: RDD[(Double, Double)])
    +    extends Serializable with Logging {
    +
    +  private lazy val (
    +      cumCounts: RDD[(Double, LabelCounter)],
    +      confusions: RDD[(Double, BinaryConfusionMatrix)]) = {
    +    // Create a bin for each distinct score value, count positives and negatives within each bin,
    +    // and then sort by score values in descending order.
    +    val counts = scoreAndLabels.combineByKey(
    +      createCombiner = (label: Double) => new LabelCounter(0L, 0L) += label,
    +      mergeValue = (c: LabelCounter, label: Double) => c += label,
    +      mergeCombiners = (c1: LabelCounter, c2: LabelCounter) => c1 += c2
    +    ).sortByKey(ascending = false)
    +    val agg = counts.values.mapPartitions({ iter =>
    +      val agg = new LabelCounter()
    +      iter.foreach(agg += _)
    +      Iterator(agg)
    +    }, preservesPartitioning = true).collect()
    +    val partitionwiseCumCounts =
    +      agg.scanLeft(new LabelCounter())((agg: LabelCounter, c: LabelCounter) => agg.clone() += c)
    --- End diff --
    
    This would probably be clearer if LabelCounter had a + method that would always return a new object, and you could do `agg.ScanLeft(new LabelCounter)(_ + _)`. But maybe it's okay as is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40168409
  
    Test failure was due to a random behavior in RDDSuite, which is fixed in https://github.com/apache/spark/pull/387 .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/364#discussion_r11470702
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/binary/BinaryConfusionMatrix.scala ---
    @@ -0,0 +1,41 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.mllib.evaluation.binary
    +
    +/**
    + * Trait for a binary confusion matrix.
    + */
    +private[evaluation] trait BinaryConfusionMatrix {
    +  /** number of true positives */
    +  def tp: Long
    --- End diff --
    
    Good question. I see many places use TP/FP/TP/TN, but I always need to translate the acronyms back to their full names in order to understand. Will switch to full names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40169829
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14047/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40168414
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-39922959
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/364#issuecomment-40062921
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13997/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---