You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2022/10/26 19:47:00 UTC

[jira] [Commented] (SPARK-40920) SVD: matrix U has wrong row order

    [ https://issues.apache.org/jira/browse/SPARK-40920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17624714#comment-17624714 ] 

Sean R. Owen commented on SPARK-40920:
--------------------------------------

So, first, to reproduce the problem more reliably, stick a .repartition(10) or something after .zipWithIndex(). The problem here hinges on what ordering is preserved where. In a simple case with 1 partition, everything works as expected in this code snippet.

However the distributed representations here generally don't preserve row order in the RDD; they rely on row indices. To recover the original ordering, try using .toIndexedRowMatrix() instead, and then call .sortBy(lambda r: r.index) on svd.U and svd.V first. I believe that will give you the expected result, or at least it gave me the same answers as scipy's SVD.

Now, I think this is confusing. In particular, there is "RowMatrix" which lacks indices, and which is returned in several places, and without indices you'd really expect that (for instance) CoordinateMatrix.toRowMatrix has rows ordered by the coordinates, but it doesn't. I think that's a bug, let me chew on the implications of fixing that while you check if that's the issue.

> SVD: matrix U has wrong row order
> ---------------------------------
>
>                 Key: SPARK-40920
>                 URL: https://issues.apache.org/jira/browse/SPARK-40920
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>    Affects Versions: 3.3.0
>         Environment: Python 3.10, multi-core machine, no cluster
>            Reporter: Leonard Papenmeier
>            Priority: Major
>         Attachments: image-2022-10-26-13-58-52-998.png, image-2022-10-26-13-59-04-608.png, image-2022-10-26-13-59-13-425.png
>
>
> When performing SVD on a RowMatrix, the matrix U has the wrong row order and the original matrix is not correctly restored with the given matrix. 
>  
> Consider the following code:
> {code:java}
> x_np = np.random.random((14, 3)) # the size matters, it works for smaller sizes
> x = ctx.parallelize(x_np).zipWithIndex().map(
>     lambda r: [MatrixEntry(r[1], i, r[0][i]) for i in range(len(r[0]))])
> x = CoordinateMatrix(x.flatMap(lambda x: x))
> x_inv = matrix_inverse(x) {code}
> with 
> {code:java}
> def matrix_inverse(matrix: CoordinateMatrix) -> DenseMatrix:
>     mtrx = matrix.toRowMatrix()
>     svd = matrix.toRowMatrix().computeSVD(k=mtrx.numCols(), computeU=True, rCond=1e-15)  # do the SVD
>     s_inv = 1 / svd.s
>     mtrx_orig = matrix.toBlockMatrix().blocks.first()[1].toArray()
>     u_dense = mtrx_orig @ (svd.V.toArray() * s_inv[np.newaxis, :])
>     cov_inv = np.matmul(svd.V.toArray(), np.multiply(s_inv[:, np.newaxis], u_dense.T))
>     u_from_spark = np.array(svd.U.rows.map(lambda x: x.toArray()).collect())
>     return DenseMatrix(numRows=cov_inv.shape[0], numCols=cov_inv.shape[1],
>                        values=cov_inv.ravel(order="F"))  # return inverse as dense matrix {code}
> Then, u_dense is the correct U but differs from the U produced by Spark. In particular, the U in Spark does not return the correct pseudoinverse and U@[S@V.T|mailto:S@V.T] does not reproduce the input matrix. 
>  
> With the following input matrix x
> !image-2022-10-26-13-58-52-998.png!
> I get the following u_dense
> !image-2022-10-26-13-59-04-608.png!
> but the following u_from_spark
> !image-2022-10-26-13-59-13-425.png!
>  
> On careful inspection, it seems that the row order is wrong.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org