You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2021/09/26 00:53:00 UTC
[jira] [Resolved] (SPARK-35598) Improve Spark-ML PCA analysis

     [ https://issues.apache.org/jira/browse/SPARK-35598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean R. Owen resolved SPARK-35598.
----------------------------------
    Resolution: Won't Fix

> Improve Spark-ML PCA analysis
> -----------------------------
>
>                 Key: SPARK-35598
>                 URL: https://issues.apache.org/jira/browse/SPARK-35598
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 3.1.2
>            Reporter: Antonio Zammuto
>            Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely no jargon.*
> When performing a PCA, the covariance matrix is calculate first.
>  In the function RowMAtrix.computeCovariance() there is the following code:
> {code:java}
>     if (rows.first().isInstanceOf[DenseVector]) {
>       computeDenseVectorCovariance(mean, n, m)
>     } else {
>       computeSparseVectorCovariance(mean, n, m)
>     }
> {code}
> As can be seen, there are two different function to compute the covariance matrix depending on whether the matrix is sparse or dense. Now, the decision if the matrix is sparse or dense is based only on the type of the first row (DenseVector or SparseVector).
> I propose to calculate the sparsity of the matrix, as ratio between number-of-zeros and total-number-of elements. And use the sparsity to switch between the two functions.
> *Q2. What problem is this proposal NOT designed to solve?*
> The 2 functions give slightly different results. This SPIP doesn't solve this issue
> *Q3. How is it done today, and what are the limits of current practice?*
> Basing on the type of the first rows only is questionable.
> Better would be to base on the entire matrix, and on a mathematical concept, like sparsity.
> *Q4. What is new in your approach and why do you think it will be successful?*
> The code is implemented in a more meaningful way, the decision is done on the entire matrix
> *Q5. Who cares? If you are successful, what difference will it make?*
> More consistent code for PCA analysis, it will be beneficial for who is using the PCA analysis
> *Q6. What are the risks?*
> You still need a threshold to decide if the matrix is sparse or not. There is no universally defined value to define the sparsity. 
>  Therefore the threshold of 50% that I choose is arbitrary, and might not be the best in all cases.
>  Still I consider as an improvement compared to the current implementation.
> *Q7. How long will it take?*
> The change is easy to implement and test
> *Q8. What are the mid-term and final “exams” to check for success?*
> We can check that the sparsity is calculated correctly and that there is no "unexplainable" differences with current implementation.
>  "Unexplainable" meaning, beside the fact that there will be cases where it will now use the computeDenseVectorCovariance whereas before was using the computeSparseVectorCovariance function and vice versa.
> The tests in the RowMatrixSuite are successful.
>  A couple of tests have been added to verify the sparsity is calculated properly.
> *Appendix A. Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account.*
> defining a function to calculate the sparsity of the RowMatrix
> {code:java}
>   def calcSparsity(): Double = {
>     rows.map{ vec => (vec.size-vec.numNonzeros) }.sum/
>       (rows.count() * rows.take(1)(0).size).toDouble
>   }
> {code}
> Changing the switching condition. Before was:
> {code:java}
>     if (rows.first().isInstanceOf[DenseVector]) {
>       computeDenseVectorCovariance(mean, n, m)
>     } else {
>       computeSparseVectorCovariance(mean, n, m)
>     }
> {code}
> Proposed new code is:
> {code:java}
>     val sparsityThreshold = 0.5
>     val sparsity = calcSparsity()
>     if (sparsity<sparsityThreshold) {
>       computeDenseVectorCovariance(mean, n, m)
>     } else {
>       computeSparseVectorCovariance(mean, n, m)
>     }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org