You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by KyleLi1985 <gi...@git.apache.org> on 2018/11/23 14:54:12 UTC
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Compare Spark computeCovariance function in RowMatrix for DenseVector and Numpy's function cov,
Find two problem, below is the result:
1)The Spark function computeCovariance in RowMatrix is not accuracy
input data
1.0,2.0,3.0,4.0,5.0
2.0,3.0,1.0,2.0,6.0
Numpy function cov result:
[[2.5 1.75]
[ 1.75 3.7 ]]
RowMatrix function computeCovariance result:
2.5 1.75
1.75 3.700000000000001
2)For some input case, the result is not good
generate input data by below logic
data1 = np.random.normal(loc=100000, scale=0.000009, size=10000000)
data2 = np.random.normal(loc=200000, scale=0.000002,size=10000000)
Numpy function cov result:
[[ 8.10536442e-11 -4.35439574e-15]
[ -4.35439574e-15 3.99928264e-12]]
RowMatrix function computeCovariance result:
-0.0027484893798828125 0.001491546630859375
0.001491546630859375 8.087158203125E-4
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org