You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Weichen Xu (JIRA)" <ji...@apache.org> on 2017/08/23 10:52:00 UTC

[jira] [Created] (SPARK-21818) MultivariateOnlineSummarizer.variance generate negative result

Weichen Xu created SPARK-21818:
----------------------------------

             Summary: MultivariateOnlineSummarizer.variance generate negative result
                 Key: SPARK-21818
                 URL: https://issues.apache.org/jira/browse/SPARK-21818
             Project: Spark
          Issue Type: Bug
          Components: ML, MLlib
    Affects Versions: 2.2.0
            Reporter: Weichen Xu


Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.
This is a serious bug because many algos in MLLib use stddev computed from sqrt(variance),
it will generate NaN and crash the whole algorithm.

we can reproduce this bug use the following code:
{code}
    val summarizer1 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.7)
    val summarizer2 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.4)
    val summarizer3 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.5)
    val summarizer4 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.4)

    val summarizer = summarizer1
      .merge(summarizer2)
      .merge(summarizer3)
      .merge(summarizer4)

    println(summarizer.variance(0))
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org