You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/04/11 14:35:25 UTC
[jira] [Created] (SPARK-14533) RowMatrix.computeCovariance
inaccurate when values are very large
Sean Owen created SPARK-14533:
---------------------------------
Summary: RowMatrix.computeCovariance inaccurate when values are very large
Key: SPARK-14533
URL: https://issues.apache.org/jira/browse/SPARK-14533
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.6.1, 2.0.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
The following code will produce a Pearson correlation that's quite different from 0, sometimes outside [-1,1] or even NaN:
{code}
val a = RandomRDDs.normalRDD(sc, 100000, 10).map(_ + 1000000000.0)
val b = RandomRDDs.normalRDD(sc, 100000, 10).map(_ + 1000000000.0)
val p = Statistics.corr(a, b, method = "pearson")
{code}
This is a "known issue" to some degree, given how Cov(X,Y) is calculated in {{RowMatrix.getCovariance}}, as Cov(X,Y) = E[XY] - E[X]E[Y]. The easier and more accurate approach involves just centering the input before computing the Gramian, but this would be inefficient for sparse data.
However, for dense data -- which includes the code paths that compute correlations -- this approach is quite sensible. This would improve accuracy for the dense row case, at least.
Also, the mean column values computed in this method can be computed more simply and accurately from {{computeColumnSummaryStatistics()}}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org