You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by hhbyyh <gi...@git.apache.org> on 2016/03/25 09:04:28 UTC

[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

GitHub user hhbyyh opened a pull request:

    https://github.com/apache/spark/pull/11954

    [SPARK-14154] [MLlib] Simplify the implementation for Kolmogorov–Smirnov test

    ## What changes were proposed in this pull request?
    jira: https://issues.apache.org/jira/browse/SPARK-14154
    
    I just read the code for KolmogorovSmirnovTest and find it could be much simplified following the original definition.
    
    Send a PR for discussion
    
    ## How was this patch tested?
    unit test
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hhbyyh/spark ksoptimize

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11954.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11954
    
----
commit c7dcb97a4addc0739f71813b39551796155c4225
Author: Yuhao Yang <hh...@gmail.com>
Date:   2016-03-25T07:56:08Z

    simplify ks

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-202285590
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by hhbyyh <gi...@git.apache.org>.
Github user hhbyyh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11954#discussion_r57514823
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/KolmogorovSmirnovTest.scala ---
    @@ -64,11 +64,10 @@ private[stat] object KolmogorovSmirnovTest extends Logging {
        */
       def testOneSample(data: RDD[Double], cdf: Double => Double): KolmogorovSmirnovTestResult = {
         val n = data.count().toDouble
    -    val localData = data.sortBy(x => x).mapPartitions { part =>
    -      val partDiffs = oneSampleDifferences(part, n, cdf) // local distances
    -      searchOneSampleCandidates(partDiffs) // candidates: local extrema
    -    }.collect()
    -    val ksStat = searchOneSampleStatistic(localData, n) // result: global extreme
    +    val ksStat = data.sortBy(x => x).zipWithIndex().map { case (v, i) =>
    +      val f = cdf(v)
    +      math.max(f - i.toDouble / n, (i + 1).toDouble / n - f)
    --- End diff --
    
    Thanks Sean. I didn't notice n is already a Double. Will change that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11954


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-201206277
  
    **[Test build #54161 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54161/consoleFull)** for PR 11954 at commit [`c7dcb97`](https://github.com/apache/spark/commit/c7dcb97a4addc0739f71813b39551796155c4225).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11954#discussion_r57441797
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/stat/test/KolmogorovSmirnovTest.scala ---
    @@ -64,11 +64,10 @@ private[stat] object KolmogorovSmirnovTest extends Logging {
        */
       def testOneSample(data: RDD[Double], cdf: Double => Double): KolmogorovSmirnovTestResult = {
         val n = data.count().toDouble
    -    val localData = data.sortBy(x => x).mapPartitions { part =>
    -      val partDiffs = oneSampleDifferences(part, n, cdf) // local distances
    -      searchOneSampleCandidates(partDiffs) // candidates: local extrema
    -    }.collect()
    -    val ksStat = searchOneSampleStatistic(localData, n) // result: global extreme
    +    val ksStat = data.sortBy(x => x).zipWithIndex().map { case (v, i) =>
    +      val f = cdf(v)
    +      math.max(f - i.toDouble / n, (i + 1).toDouble / n - f)
    --- End diff --
    
    You don't need `toDouble` if `n` is already a `Double`. It looks like the first element you compute here has an opposite sign to what was there before. Am I missing something or is that change unintentional?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-201193608
  
    **[Test build #54161 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54161/consoleFull)** for PR 11954 at commit [`c7dcb97`](https://github.com/apache/spark/commit/c7dcb97a4addc0739f71813b39551796155c4225).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-202285595
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54297/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-202013872
  
    **[Test build #54284 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54284/consoleFull)** for PR 11954 at commit [`fa94568`](https://github.com/apache/spark/commit/fa94568671cff1c82dc51314517e0f9177a7920a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-202981052
  
    Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-202013902
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-202285200
  
    **[Test build #54297 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54297/consoleFull)** for PR 11954 at commit [`fa94568`](https://github.com/apache/spark/commit/fa94568671cff1c82dc51314517e0f9177a7920a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-201206793
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-202005705
  
    **[Test build #54284 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54284/consoleFull)** for PR 11954 at commit [`fa94568`](https://github.com/apache/spark/commit/fa94568671cff1c82dc51314517e0f9177a7920a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-201206798
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54161/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-202013903
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54284/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-14154] [MLlib] Simplify the implementat...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11954#issuecomment-202269179
  
    **[Test build #54297 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54297/consoleFull)** for PR 11954 at commit [`fa94568`](https://github.com/apache/spark/commit/fa94568671cff1c82dc51314517e0f9177a7920a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org