You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dorx <gi...@git.apache.org> on 2014/08/01 05:14:39 UTC

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

GitHub user dorx opened a pull request:

    https://github.com/apache/spark/pull/1710

    [SPARK-2782][mllib] Bug fix for getRanks in SpearmanCorrelation

    getRanks computes the wrong rank when numPartition >= size in the input RDDs before this patch. added units to address this bug.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dorx/spark correlationBug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1710.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1710
    
----
commit 043ff8327bd3520319e3615edc83b4bb435b301d
Author: Doris Xin <do...@gmail.com>
Date:   2014-08-01T03:03:53Z

    bug fix for spearman corner case
    
    where numPartition >= size in the input RDDs

commit 31db920b667e30d3043469f183b03aabcdaf25d6
Author: Doris Xin <do...@gmail.com>
Date:   2014-08-01T03:11:00Z

    revert unnecessary change

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1710#discussion_r15681334
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala ---
    @@ -89,20 +89,17 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         val ranks: RDD[(Long, Double)] = sorted.mapPartitions { iter =>
           // add an extra element to signify the end of the list so that flatMap can flush the last
           // batch of duplicates
    -      val padded = iter ++
    -        Iterator[((Double, Long), Long)](((Double.NaN, -1L), -1L))
    -      var lastVal = 0.0
    -      var firstRank = 0.0
    -      val idBuffer = new ArrayBuffer[Long]()
    +      val padded = iter ++ Iterator[((Double, Long), Long)](((Double.NaN, -1L), -1L))
    +      val firstEntry = padded.next()
    +      var lastVal = firstEntry._1._1
    +      var firstRank = firstEntry._2.toDouble
    +      val idBuffer = new ArrayBuffer[Long]() += firstEntry._1._2
           padded.flatMap { case ((v, id), rank) =>
    -        if (v  == lastVal && id != Long.MinValue) {
    +        if (v == lastVal && id != Long.MinValue) {
    --- End diff --
    
    Where is `Long.MinValue` used?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1710#issuecomment-50847713
  
    LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1710#discussion_r15681299
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala ---
    @@ -89,20 +89,17 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         val ranks: RDD[(Long, Double)] = sorted.mapPartitions { iter =>
           // add an extra element to signify the end of the list so that flatMap can flush the last
           // batch of duplicates
    -      val padded = iter ++
    -        Iterator[((Double, Long), Long)](((Double.NaN, -1L), -1L))
    -      var lastVal = 0.0
    -      var firstRank = 0.0
    -      val idBuffer = new ArrayBuffer[Long]()
    +      val padded = iter ++ Iterator[((Double, Long), Long)](((Double.NaN, -1L), -1L))
    +      val firstEntry = padded.next()
    +      var lastVal = firstEntry._1._1
    +      var firstRank = firstEntry._2.toDouble
    +      val idBuffer = new ArrayBuffer[Long]() += firstEntry._1._2
    --- End diff --
    
    `ArrayBuffer(firstEntry._1._2)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/1710#issuecomment-50845244
  
    Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1710#issuecomment-50846631
  
    QA tests have started for PR 1710. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17637/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

Posted by dorx <gi...@git.apache.org>.
Github user dorx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1710#discussion_r15681388
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/SpearmanCorrelation.scala ---
    @@ -89,20 +89,17 @@ private[stat] object SpearmanCorrelation extends Correlation with Logging {
         val ranks: RDD[(Long, Double)] = sorted.mapPartitions { iter =>
           // add an extra element to signify the end of the list so that flatMap can flush the last
           // batch of duplicates
    -      val padded = iter ++
    -        Iterator[((Double, Long), Long)](((Double.NaN, -1L), -1L))
    -      var lastVal = 0.0
    -      var firstRank = 0.0
    -      val idBuffer = new ArrayBuffer[Long]()
    +      val padded = iter ++ Iterator[((Double, Long), Long)](((Double.NaN, -1L), -1L))
    +      val firstEntry = padded.next()
    +      var lastVal = firstEntry._1._1
    +      var firstRank = firstEntry._2.toDouble
    +      val idBuffer = new ArrayBuffer[Long]() += firstEntry._1._2
           padded.flatMap { case ((v, id), rank) =>
    -        if (v  == lastVal && id != Long.MinValue) {
    +        if (v == lastVal && id != Long.MinValue) {
    --- End diff --
    
    oops that's supposed to be `-1L`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1710


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

Posted by dorx <gi...@git.apache.org>.
Github user dorx commented on the pull request:

    https://github.com/apache/spark/pull/1710#issuecomment-50845017
  
    @mengxr I'd really appreciate it if we can get this merged ASAP so I can send out my python correlation PR before the code freeze. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1710#issuecomment-50845388
  
    QA tests have started for PR 1710. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17635/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1710#discussion_r15681269
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala ---
    @@ -55,20 +55,24 @@ object Statistics {
     
       /**
        * Compute the Pearson correlation for the input RDDs.
    -   * Columns with 0 covariance produce NaN entries in the correlation matrix.
    +   * Returns NaN if either vector has 0 variance.
    +   *
    +   * Note: the two input RDDs need to have the same number of partitions.
    --- End diff --
    
    have the *same number of partitions* and the *same number of elements in each partition*


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1710#issuecomment-50847511
  
    QA results for PR 1710:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17635/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2782][mllib] Bug fix for getRanks in Sp...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1710#issuecomment-50848506
  
    QA results for PR 1710:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17637/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---