You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2017/05/09 03:37:04 UTC

[jira] [Assigned] (SPARK-7856) Scalable PCA implementation for tall and fat matrices

     [ https://issues.apache.org/jira/browse/SPARK-7856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-7856:
-----------------------------------

    Assignee:     (was: Apache Spark)

> Scalable PCA implementation for tall and fat matrices
> -----------------------------------------------------
>
>                 Key: SPARK-7856
>                 URL: https://issues.apache.org/jira/browse/SPARK-7856
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Tarek Elgamal
>
> Currently the PCA implementation has a limitation of fitting d^2 covariance/grammian matrix entries in memory (d is the number of columns/dimensions of the matrix). We often need only the largest k principal components. To make pca really scalable, I suggest an implementation where the memory usage is proportional to the principal components k rather than the full dimensionality d. 
> I suggest adopting the solution described in this paper that is published in SIGMOD 2015 (http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf). The paper offers an implementation for Probabilistic PCA (PPCA) which has less memory and time complexity and could potentially scale to tall and fat matrices rather than tall and skinny matrices that is supported by the current PCA impelmentation. 
> Probablistic PCA could be potentially added to the set of algorithms supported by MLlib and it does not necessarily replace the old PCA implementation.
> PPCA implementation is adopted in Matlab's Statistics and Machine Learning Toolbox (http://www.mathworks.com/help/stats/ppca.html)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org