You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by KyleLi1985 <gi...@git.apache.org> on 2018/11/23 14:53:48 UTC
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
GitHub user KyleLi1985 opened a pull request:
https://github.com/apache/spark/pull/23126
[SPARK-26158] [MLLIB] fix covariance accuracy problem for DenseVector
## What changes were proposed in this pull request?
Enhance accuracy of the covariance logic in RowMatrix for function computeCovariance
## How was this patch tested?
Unit test
Accuracy test
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/KyleLi1985/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/23126.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #23126
----
commit a8bfdfffbe82a77943adb4bf84ca939d786afc8a
Author: 李亮 <li...@...>
Date: 2018-11-23T14:35:27Z
fix covariance accuracy problem for DenseVector
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237532703
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala ---
@@ -128,6 +128,82 @@ class RowMatrix @Since("1.0.0") (
RowMatrix.triuToFull(n, GU.data)
}
+ private def computeDenseVectorCovariance(mean: Vector, n: Int, m: Long): Matrix = {
+
+ val bc = rows.context.broadcast(mean)
+
+ // Computes n*(n+1)/2, avoiding overflow in the multiplication.
+ // This succeeds when n <= 65535, which is checked above
+ val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))
+
+ val MU = rows.treeAggregate(new BDV[Double](nt))(
+ seqOp = (U, v) => {
+
+ val n = v.size
+ val na = Array.ofDim[Double](n)
+ val means = bc.value
+ if (v.isInstanceOf[DenseVector]) {
+ v.foreachActive{(index, value) =>
--- End diff --
This scalastyle not find my environment . We need consider sparse vector in here. If input some of sparse vector, we need to keep a right result. But, actually the sparse vector override the toArray function. So, update the newest commit.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Compare Spark computeCovariance function in RowMatrix for DenseVector and Numpy's function cov,
Find two problem, below is the result:
1)The Spark function computeCovariance in RowMatrix is not accuracy
input data
1.0,2.0,3.0,4.0,5.0
2.0,3.0,1.0,2.0,6.0
Numpy function cov result:
[[2.5 1.75]
[ 1.75 3.7 ]]
RowMatrix function computeCovariance result:
2.5 1.75
1.75 3.700000000000001
2)For some input case, the result is not good
generate input data by below logic
data1 = np.random.normal(loc=100000, scale=0.000009, size=10000000)
data2 = np.random.normal(loc=200000, scale=0.000002,size=10000000)
Numpy function cov result:
[[ 8.10536442e-11 -4.35439574e-15]
[ -4.35439574e-15 3.99928264e-12]]
RowMatrix function computeCovariance result:
-0.0027484893798828125 0.001491546630859375
0.001491546630859375 8.087158203125E-4
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Ok, I will do it later
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237130345
--- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPCASuite.java ---
@@ -67,7 +66,7 @@ public void testPCA() {
JavaRDD<Vector> dataRDD = jsc.parallelize(points, 2);
RowMatrix mat = new RowMatrix(dataRDD.map(
- (Vector vector) -> (org.apache.spark.mllib.linalg.Vector) new DenseVector(vector.toArray())
+ (Vector vector) -> org.apache.spark.mllib.linalg.Vectors.fromML(vector)
--- End diff --
I don't think that helps a caller much as they don't know what the implication of this is. I'd leave it for now.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23126
**[Test build #4441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4441/testReport)** for PR 23126 at commit [`6fca901`](https://github.com/apache/spark/commit/6fca901dfdad230fa7a4e1079a3f48be826276a3).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23126
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23126
**[Test build #4441 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4441/testReport)** for PR 23126 at commit [`6fca901`](https://github.com/apache/spark/commit/6fca901dfdad230fa7a4e1079a3f48be826276a3).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
It would be better, update the commit
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/23126
Merged to master
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r236927721
--- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPCASuite.java ---
@@ -67,7 +66,7 @@ public void testPCA() {
JavaRDD<Vector> dataRDD = jsc.parallelize(points, 2);
RowMatrix mat = new RowMatrix(dataRDD.map(
- (Vector vector) -> (org.apache.spark.mllib.linalg.Vector) new DenseVector(vector.toArray())
+ (Vector vector) -> org.apache.spark.mllib.linalg.Vectors.fromML(vector)
--- End diff --
Sure, as you said, if the **first item** is sparse vector, it will align with computeSparseVectorCovariance logic, and if the **first item** is a dense vector, it will align with computeDenseVectorCovariance logic. The rest is user's choice.
But, maybe we can add some notes into the annotation of function computeCovariance,
give user some notes, like:
/**
* Computes the covariance matrix, treating each row as an observation.
*
* Note:
* When the first row is DenseVector, we use the computeDenseVectorCovariance
* to calculate the covariance matrix, and if the first row is SparseVector, we
* use the computeSparseVectorCovariance to calculate the covariance matrix
*
* @return a local dense matrix of size n x n
*
* @note This cannot be computed on matrices with more than 65535 columns.
*/
@Since("1.0.0")
def computeCovariance(): Matrix = {
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
After add this commit
We get the result for RowMatrix computeCovariance function:
For the input data
1.0,2.0,3.0,4.0,5.0
2.0,3.0,1.0,2.0,6.0
RowMatrix function computeCovariance result:
2.5 1.75
1.75 3.7
For the input data generated by
data1 = np.random.normal(loc=100000, scale=0.000009, size=10000000)
data2 = np.random.normal(loc=200000, scale=0.000002,size=10000000)
RowMatrix function computeCovariance result:
8.109505250896888E-11 -5.003160564607658E-15
-5.003160564607658E-15 4.08276584628234E-12
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23126
**[Test build #4445 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4445/testReport)** for PR 23126 at commit [`17117d7`](https://github.com/apache/spark/commit/17117d70ae0e26b5a99c6ea5a572cfbaf83080ab).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237556042
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala ---
@@ -128,6 +128,82 @@ class RowMatrix @Since("1.0.0") (
RowMatrix.triuToFull(n, GU.data)
}
+ private def computeDenseVectorCovariance(mean: Vector, n: Int, m: Long): Matrix = {
+
+ val bc = rows.context.broadcast(mean)
+
+ // Computes n*(n+1)/2, avoiding overflow in the multiplication.
+ // This succeeds when n <= 65535, which is checked above
+ val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))
+
+ val MU = rows.treeAggregate(new BDV[Double](nt))(
+ seqOp = (U, v) => {
+
+ val n = v.size
+ val na = Array.ofDim[Double](n)
+ val means = bc.value
+ if (v.isInstanceOf[DenseVector]) {
+ v.foreachActive{(index, value) =>
--- End diff --
Yeah, because it hasn't subtracted the mean from one of the values in the Spark vector. that's the general issue with centering a sparse vector: it becomes dense!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23126
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/23126
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r236927771
--- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala ---
@@ -266,6 +266,16 @@ class RowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
}
}
+ test("dense vector covariance accuracy (SPARK-26158)") {
+ val rdd1 = sc.parallelize(Array(100000.000004, 100000.000012, 99999.9999931, 99999.9999977))
--- End diff --
handy thing
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r236658063
--- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPCASuite.java ---
@@ -67,7 +66,7 @@ public void testPCA() {
JavaRDD<Vector> dataRDD = jsc.parallelize(points, 2);
RowMatrix mat = new RowMatrix(dataRDD.map(
- (Vector vector) -> (org.apache.spark.mllib.linalg.Vector) new DenseVector(vector.toArray())
+ (Vector vector) -> org.apache.spark.mllib.linalg.Vectors.fromML(vector)
--- End diff --
This is an interesting point; what if the data is a mix of sparse and dense? At least this test now covers that case a bit.
The check above only tests the first element, but that's the only reasonable thing to do. If it were, for example, mostly sparse but the first happened to be dense, then this change could be problematic. I think it's OK to assume that most real-world uses are all sparse or dense. It is possible that some data in a sparse case is occasionally encoded as sparse as an optimization if it has a lot of zeroes, but this case works fine here.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237505113
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala ---
@@ -128,6 +128,69 @@ class RowMatrix @Since("1.0.0") (
RowMatrix.triuToFull(n, GU.data)
}
+ private def computeDenseVectorCovariance(mean: Vector, n: Int, m: Long): Matrix = {
+
+ val bc = rows.context.broadcast(mean)
+
+ // Computes n*(n+1)/2, avoiding overflow in the multiplication.
+ // This succeeds when n <= 65535, which is checked above
+ val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))
+
+ val MU = rows.treeAggregate(new BDV[Double](nt))(
+ seqOp = (U, v) => {
+ val dv = new DenseVector(v.toArray.zip(bc.value.toArray)
--- End diff --
I do some more research about your mention, the primitive array is more faster than zip, or zipped and view.zip. Because there is no more temporary collection and extra memory copies.
I do a comparison test, below is the result
![2018-11-29 9 18 11](https://user-images.githubusercontent.com/40689156/49227954-44cc3100-f425-11e8-86af-6a48f1353bf7.png)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237520303
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala ---
@@ -128,6 +128,82 @@ class RowMatrix @Since("1.0.0") (
RowMatrix.triuToFull(n, GU.data)
}
+ private def computeDenseVectorCovariance(mean: Vector, n: Int, m: Long): Matrix = {
+
+ val bc = rows.context.broadcast(mean)
+
+ // Computes n*(n+1)/2, avoiding overflow in the multiplication.
+ // This succeeds when n <= 65535, which is checked above
+ val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))
+
+ val MU = rows.treeAggregate(new BDV[Double](nt))(
+ seqOp = (U, v) => {
+
+ val n = v.size
+ val na = Array.ofDim[Double](n)
+ val means = bc.value
+ if (v.isInstanceOf[DenseVector]) {
+ v.foreachActive{(index, value) =>
--- End diff --
Nit (this might fail scalastyle): space around the braces here. But we don't need foreachActive here I think; they're all 'active' in a dense vector and every element needs the mean subtracted. Even for a sparse vector you have to do this. Do you really need a separate case here? we're assuming the vectors are (mostly) dense in this method.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237133038
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala ---
@@ -128,6 +128,69 @@ class RowMatrix @Since("1.0.0") (
RowMatrix.triuToFull(n, GU.data)
}
+ private def computeDenseVectorCovariance(mean: Vector, n: Int, m: Long): Matrix = {
+
+ val bc = rows.context.broadcast(mean)
+
+ // Computes n*(n+1)/2, avoiding overflow in the multiplication.
+ // This succeeds when n <= 65535, which is checked above
+ val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))
+
+ val MU = rows.treeAggregate(new BDV[Double](nt))(
+ seqOp = (U, v) => {
+ val dv = new DenseVector(v.toArray.zip(bc.value.toArray)
--- End diff --
A few more nits here. Let's broadcast `mean.toArray` to save converting it each time. Given this might be performance-sensitive, what about avoiding zip and computing the new array in a loop? I'm not sure how much difference it makes but I know we do similar things elsewhere.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Um, the unit test in spark indeed cover both case. But there is function closeToZero to handle accuracy problem, so..
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Add test case in RowMatrixSuite for this PR,
The breeze output is
6.711333870761802E-11 -3.833375461575691E-12
-3.833375461575691E-12 2.916662578525011E-12
Before add this commit, the result is:
1.9073486328125E-6 3.814697265625E-6
3.814697265625E-6 -7.62939453125E-6
After add this commit, the result is:
6.711333870761802E-11 -3.833375461575691E-12
-3.833375461575691E-12 2.916662578525011E-12
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237541497
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala ---
@@ -128,6 +128,82 @@ class RowMatrix @Since("1.0.0") (
RowMatrix.triuToFull(n, GU.data)
}
+ private def computeDenseVectorCovariance(mean: Vector, n: Int, m: Long): Matrix = {
+
+ val bc = rows.context.broadcast(mean)
+
+ // Computes n*(n+1)/2, avoiding overflow in the multiplication.
+ // This succeeds when n <= 65535, which is checked above
+ val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))
+
+ val MU = rows.treeAggregate(new BDV[Double](nt))(
+ seqOp = (U, v) => {
+
+ val n = v.size
+ val na = Array.ofDim[Double](n)
+ val means = bc.value
+ if (v.isInstanceOf[DenseVector]) {
+ v.foreachActive{(index, value) =>
--- End diff --
But isn't it incorrect to not subtract the mean from 0 elements in a sparse vector anyway?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/23126
Is there a simple test case you can add to cover that too? that would really prove this change.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Sure, the test cases include sparse and dense case.
Do these case again for new commit
we use data from
http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals
as **dense case**
we use data from
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
as **sparse case**
Upload the test result
Dense data case
[sparkdensedataout.txt](https://github.com/apache/spark/files/2615937/sparkdensedataout.txt)
[numpydensedataout.txt](https://github.com/apache/spark/files/2615938/numpydensedataout.txt)
Sparse data case
[sparksparsedataout.txt](https://github.com/apache/spark/files/2615943/sparksparsedataout.txt)
[numpysparsedataout.txt](https://github.com/apache/spark/files/2615944/numpysparsedataout.txt)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r236656961
--- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala ---
@@ -266,6 +266,16 @@ class RowMatrixSuite extends SparkFunSuite with MLlibTestSparkContext {
}
}
+ test("dense vector covariance accuracy (SPARK-26158)") {
+ val rdd1 = sc.parallelize(Array(100000.000004, 100000.000012, 99999.9999931, 99999.9999977))
--- End diff --
I think you could just parallelize a couple DenseVectors, or (x,y) tuples and transform, rather than this way, but it doesn't matter really.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Plug do some more test on real data after add this commit
we use data from
http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals
and data from
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
to do some more accuracy test, the accuracy result is OK
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237551217
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala ---
@@ -128,6 +128,82 @@ class RowMatrix @Since("1.0.0") (
RowMatrix.triuToFull(n, GU.data)
}
+ private def computeDenseVectorCovariance(mean: Vector, n: Int, m: Long): Matrix = {
+
+ val bc = rows.context.broadcast(mean)
+
+ // Computes n*(n+1)/2, avoiding overflow in the multiplication.
+ // This succeeds when n <= 65535, which is checked above
+ val nt = if (n % 2 == 0) ((n / 2) * (n + 1)) else (n * ((n + 1) / 2))
+
+ val MU = rows.treeAggregate(new BDV[Double](nt))(
+ seqOp = (U, v) => {
+
+ val n = v.size
+ val na = Array.ofDim[Double](n)
+ val means = bc.value
+ if (v.isInstanceOf[DenseVector]) {
+ v.foreachActive{(index, value) =>
--- End diff --
Just, I do a quick test, input data
val data = Seq(
Vectors.dense(100000.000004, 199999.999999),
Vectors.dense(100000.000012, 0.0).toSparse,
Vectors.dense(99999.9999931, 200000.000003),
Vectors.dense(99999.9999977, 200000.000001)
)
and
val data = Seq(
Vectors.dense(100000.000004, 199999.999999),
Vectors.dense(100000.000012, 0.0),
Vectors.dense(99999.9999931, 200000.000003),
Vectors.dense(99999.9999977, 200000.000001)
)
for all dense vector case, the breeze and spark realization (substract the mean from 0) get the same result
6.711333870761802E-11 -0.6866668021258175
-0.6866668021258175 1.00000000001E10
but if change the one dense vector to sparse vector and not substract the mean from 0, the spark get the result
6.711333870761802E-11 -0.17166651863796398
-0.17166651863796398 2.500000000024999E9
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...
Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/23126
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/23126
**[Test build #4445 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4445/testReport)** for PR 23126 at commit [`17117d7`](https://github.com/apache/spark/commit/17117d70ae0e26b5a99c6ea5a572cfbaf83080ab).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...
Posted by KyleLi1985 <gi...@git.apache.org>.
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
align JavaPCASuite expected data process behavior with PCA function fit
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org