You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/04/11 14:35:25 UTC

[jira] [Created] (SPARK-14533) RowMatrix.computeCovariance inaccurate when values are very large

Sean Owen created SPARK-14533:
---------------------------------

             Summary: RowMatrix.computeCovariance inaccurate when values are very large
                 Key: SPARK-14533
                 URL: https://issues.apache.org/jira/browse/SPARK-14533
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 1.6.1, 2.0.0
            Reporter: Sean Owen
            Assignee: Sean Owen
            Priority: Minor


The following code will produce a Pearson correlation that's quite different from 0, sometimes outside [-1,1] or even NaN:

{code}
    val a = RandomRDDs.normalRDD(sc, 100000, 10).map(_ + 1000000000.0)
    val b = RandomRDDs.normalRDD(sc, 100000, 10).map(_ + 1000000000.0)
    val p = Statistics.corr(a, b, method = "pearson")
{code}

This is a "known issue" to some degree, given how Cov(X,Y) is calculated in {{RowMatrix.getCovariance}}, as Cov(X,Y) = E[XY] - E[X]E[Y]. The easier and more accurate approach involves just centering the input before computing the Gramian, but this would be inefficient for sparse data.

However, for dense data -- which includes the code paths that compute correlations -- this approach is quite sensible. This would improve accuracy for the dense row case, at least.

Also, the mean column values computed in this method can be computed more simply and accurately from {{computeColumnSummaryStatistics()}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org