You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by WeichenXu123 <gi...@git.apache.org> on 2017/08/23 10:58:05 UTC

[GitHub] spark pull request #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateO...

GitHub user WeichenXu123 opened a pull request:

    https://github.com/apache/spark/pull/19029

    [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSummarizer.variance generate negative result

    ## What changes were proposed in this pull request?
    
    Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.
    **This is a serious bug because many algos in MLLib use stddev computed from sqrt(variance), **
    **it will generate NaN and crash the whole algorithm.**
    we can reproduce this bug use the following code:
    ```
        val summarizer1 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.7)
        val summarizer2 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.4)
        val summarizer3 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.5)
        val summarizer4 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.4)
    
        val summarizer = summarizer1
          .merge(summarizer2)
          .merge(summarizer3)
          .merge(summarizer4)
    
        println(summarizer.variance(0))
    ```
    This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance` (The latter one is newly added which has similar logic)
    
    ## How was this patch tested?
    
    test cases added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/WeichenXu123/spark fix_summarizer_var_bug

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19029.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19029
    
----
commit 9c92730bc3588596b348932ea285b12c5a4a77ce
Author: WeichenXu <we...@outlook.com>
Date:   2017-08-23T10:52:56Z

    init pr

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    **[Test build #81167 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81167/testReport)** for PR 19029 at commit [`21e7ff7`](https://github.com/apache/spark/commit/21e7ff7ea65da1c03b32445405d2bd55346db096).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    **[Test build #81171 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81171/testReport)** for PR 19029 at commit [`c40eba3`](https://github.com/apache/spark/commit/c40eba38d82893d5604aa66ec9037df706da712d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    **[Test build #81118 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81118/testReport)** for PR 19029 at commit [`c24292c`](https://github.com/apache/spark/commit/c24292ccad700d39892a576390cff2559c4f3b9a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateO...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19029


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81032/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81167/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    **[Test build #81032 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81032/testReport)** for PR 19029 at commit [`9c92730`](https://github.com/apache/spark/commit/9c92730bc3588596b348932ea285b12c5a4a77ce).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    **[Test build #81127 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81127/testReport)** for PR 19029 at commit [`9a47579`](https://github.com/apache/spark/commit/9a47579194f885815b9d298435b7b56a9649da2c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateO...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19029#discussion_r135216154
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -440,7 +440,7 @@ private[ml] object WeightedLeastSquares {
         /**
          * Weighted population standard deviation of labels.
          */
    -    def bStd: Double = math.sqrt(bbSum / wSum - bBar * bBar)
    +    def bStd: Double = math.sqrt(math.max(bbSum / wSum - bBar * bBar, 0.0))
    --- End diff --
    
    Please add comment here and bellow to clarify that we are preventing from negative value caused by numerical error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateO...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19029#discussion_r135411403
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -439,8 +439,9 @@ private[ml] object WeightedLeastSquares {
     
         /**
          * Weighted population standard deviation of labels.
    +     * We prevent variance from negative value caused by numerical error.
    --- End diff --
    
    I'm not so against this, but this is really an implementation detail and not relevant to the caller. It's a value that is by definition nonnegative.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81129/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    **[Test build #81129 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81129/testReport)** for PR 19029 at commit [`56c0d41`](https://github.com/apache/spark/commit/56c0d41f1517a49a935464933a8021008d8a32f7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    **[Test build #81127 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81127/testReport)** for PR 19029 at commit [`9a47579`](https://github.com/apache/spark/commit/9a47579194f885815b9d298435b7b56a9649da2c).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81171/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateO...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19029#discussion_r134720152
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
    @@ -438,6 +438,10 @@ private[ml] object SummaryBuilderImpl extends Logging {
             while (i < len) {
               realVariance(i) = (currM2n(i) + deltaMean(i) * deltaMean(i) * weightSum(i) *
                 (totalWeightSum - weightSum(i)) / totalWeightSum) / denominator
    +          // Because of numerical error, it is possible to get negative real variance
    +          if (realVariance(i) < 0.0) {
    --- End diff --
    
    Just use `math.max(0.0 ...)` in the line above? no need to assign it twice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    **[Test build #81167 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81167/testReport)** for PR 19029 at commit [`21e7ff7`](https://github.com/apache/spark/commit/21e7ff7ea65da1c03b32445405d2bd55346db096).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    **[Test build #81171 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81171/testReport)** for PR 19029 at commit [`c40eba3`](https://github.com/apache/spark/commit/c40eba38d82893d5604aa66ec9037df706da712d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    **[Test build #81032 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81032/testReport)** for PR 19029 at commit [`9c92730`](https://github.com/apache/spark/commit/9c92730bc3588596b348932ea285b12c5a4a77ce).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    **[Test build #81129 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81129/testReport)** for PR 19029 at commit [`56c0d41`](https://github.com/apache/spark/commit/56c0d41f1517a49a935464933a8021008d8a32f7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateO...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19029#discussion_r135385006
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
    @@ -439,8 +439,9 @@ private[ml] object WeightedLeastSquares {
     
         /**
          * Weighted population standard deviation of labels.
    +     * We prevent variance from negative value caused by numerical error.
          */
    -    def bStd: Double = math.sqrt(bbSum / wSum - bBar * bBar)
    +    def bStd: Double = math.sqrt(math.max(bbSum / wSum - bBar * bBar, 0.0))
    --- End diff --
    
    There are a couple more places where variance is computed in this file -- I think they need this too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateO...

Posted by WeichenXu123 <gi...@git.apache.org>.
Github user WeichenXu123 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19029#discussion_r135186430
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
    @@ -438,6 +438,10 @@ private[ml] object SummaryBuilderImpl extends Logging {
             while (i < len) {
               realVariance(i) = (currM2n(i) + deltaMean(i) * deltaMean(i) * weightSum(i) *
                 (totalWeightSum - weightSum(i)) / totalWeightSum) / denominator
    +          // Because of numerical error, it is possible to get negative real variance
    +          if (realVariance(i) < 0.0) {
    --- End diff --
    
    Hmm.. `WeightedLeastSquares` use another way to compute variance `Var(X) = E(X^2) - E(X)^2`. But it seems also possible to have this problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81118/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateO...

Posted by yanboliang <gi...@git.apache.org>.
Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19029#discussion_r134816423
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala ---
    @@ -438,6 +438,10 @@ private[ml] object SummaryBuilderImpl extends Logging {
             while (i < len) {
               realVariance(i) = (currM2n(i) + deltaMean(i) * deltaMean(i) * weightSum(i) *
                 (totalWeightSum - weightSum(i)) / totalWeightSum) / denominator
    +          // Because of numerical error, it is possible to get negative real variance
    +          if (realVariance(i) < 0.0) {
    --- End diff --
    
    The computation of _variance_ may be touch this numerical error, it seems ```WeightedLeastSquares``` also use the same method to compute _variance_ , does it will have similar issue? @WeichenXu123 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81127/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19029: [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSu...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19029
  
    **[Test build #81118 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81118/testReport)** for PR 19029 at commit [`c24292c`](https://github.com/apache/spark/commit/c24292ccad700d39892a576390cff2559c4f3b9a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org