You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by srowen <gi...@git.apache.org> on 2016/05/01 08:48:03 UTC

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

GitHub user srowen opened a pull request:

    https://github.com/apache/spark/pull/12821

    [SPARK-15043] [MLLIB] Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr

    ## What changes were proposed in this pull request?
    
    Following https://github.com/apache/spark/pull/12779 this test became flaky. The issue is that the mean, computed for a covariance calculation, is now calculated with the standard and slightly more accurate `MultivariateOnlineSummarizer`. However I think the fact that it uses `treeAggregate` internally can lead to a different order of summation and very very slightly different results on different runs.
    
    The immediate fix for the test, which asserts equality, is to use 1 partition. We can find out if that appears to be robust.
    
    More generally, it's an interesting question whether we want `MultivariateOnlineSummarizer` to be deterministic. I'm not sure if another aggregation method would provide more guarantees of this, in theory or practice.
    
    
    ## How was this patch tested?
    
    Existing Java stats suite.
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/spark SPARK-15043

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12821.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12821
    
----
commit f755850a0bbcae48118c783a7ca643649deef327
Author: Sean Owen <so...@cloudera.com>
Date:   2016-05-01T08:43:13Z

    Use 1 partition for simple Java stats test

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12821#issuecomment-216030597
  
    **[Test build #57475 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57475/consoleFull)** for PR 12821 at commit [`f755850`](https://github.com/apache/spark/commit/f755850a0bbcae48118c783a7ca643649deef327).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12821#issuecomment-216030627
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57475/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

Posted by mengxr <gi...@git.apache.org>.

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/12821#issuecomment-216119750
  
    @srowen If we only use 1 partition, it doesn't touch the code path of `treeAggregation`, which means the coverage is not sufficient. I think we should use multiple partitions and test equality with a small tolerance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen closed the pull request at:

    https://github.com/apache/spark/pull/12821


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/12821#issuecomment-216026251
  
    **[Test build #57475 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57475/consoleFull)** for PR 12821 at commit [`f755850`](https://github.com/apache/spark/commit/f755850a0bbcae48118c783a7ca643649deef327).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

Posted by JoshRosen <gi...@git.apache.org>.

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/12821#issuecomment-216058335
  
    /cc @mengxr for review, since @jkbradley mentioned that you might have fixed the flakiness via a separate patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/12821#issuecomment-216225226
  
    Sounds good, it's either make it so simple that it's deterministic or tolerate tiny variation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/12821#issuecomment-216045431
  
    FWIW you can consistently reproduce the differing results with ..
    
    ```
    val x = sc.parallelize(Seq(1.0, 2.0, 3.0, 4.0))
    val y = sc.parallelize(Seq(1.1, 2.2, 3.1, 4.3))
    (0 to 10).map(i => Statistics.corr(x,y)).distinct
    ```
    
    With 1 partition the result is always the same. The result is the same if I use `aggregate` instead of `treeAggregate`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12821#issuecomment-216030626
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/12821#issuecomment-216225537
  
    Oh already addressed in https://github.com/apache/spark/commit/19a6d192d53ce6dffe998ce110adab1f2efcb23e


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12821#issuecomment-216066630
  
    This patch changed the test to use approx equality: https://github.com/apache/spark/commit/19a6d192d53ce6dffe998ce110adab1f2efcb23e


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-15043] [MLLIB] Fix and re-enable flaky ...

Posted by jkbradley <gi...@git.apache.org>.

Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/12821#issuecomment-216066867
  
    It would be great to have it be deterministic, but it sounds hard or impossible to ensure beyond numerical precision in general.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org