You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by ghoto <gi...@git.apache.org> on 2017/05/09 03:35:26 UTC

[GitHub] spark pull request #17907: SPARK-7856 Principal components and variance usin...

GitHub user ghoto opened a pull request:

    https://github.com/apache/spark/pull/17907

    SPARK-7856 Principal components and variance using computeSVD()

    ## What changes were proposed in this pull request?
    
    The current implementation of computePrincipalComponentsAndExplainedVariance in RowMatrix for tall and fat matrices usually crashes as it computes the covariance matrix locally.
    
    This implementation uses the same RowMatrix.computeSVD which is already optimized for big matrices, to compute the Principal components and Explained Variance.
    
    It's known that if a matrix X with mean µ and covariance (X - µ)'(X - µ)
    (X - µ) can be decomposed with SVD such that
    
    (X - µ) = USV'
    and
    (X - µ)' = VS'U'
    
    V and U are orthonormal, therefore V'V = I and U'U = I.
    cov = (X - µ)'(X -µ) = VS'U'USV' = VS'SV'
    and
    cov*V = V(S'S), therefore V are the eigenvectors of covariance and S'S contains the eigenvalues
    
    ## How was this patch tested?
    
    This patch has been tested running the current RowMatrixSuite, mllib.PCASuite and ml.PCASuite, passing all the tests.
    
    Also, this patch allowed to run PCA over a matrix 56k x 12k without crashing from OutOfMemory errors (not included in testing as it takes long time to execute and it's generated by a private dataset)
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ghoto/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17907.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17907
    
----
commit 956ce87cd151a9b30d181618aad7ef2a7ee859dc
Author: Ignacio Bermudez <ig...@gmail.com>
Date:   2017-05-09T02:25:06Z

    SPARK-7856 Principal components and variance using computeSVD()
    
    The previous computePrincipalComponentsAndExplainedVariance() evaluates
    the covariance matrix in a local breeze matrix causing OutOfMemory exceptions
    for tall and fat matrices.
    The decomposition of the matrix X-mean(X) provides the eigenvectors and eigenvalues
    for the covariance matrix.
    
    X = U S V'  // (1)
    X' = V S'U'
    
    X'X = V S'U'U S V'
    X'X = V S'S V' // U'U = I
    (X'X)V = V (S'S)(V'V)
    (X'X)V = V (S'S) // V'V = I
    A*V = V*lambda // if A=X'X, V is the eigenvector matrix of X'X and can be obtained from (1)

commit a1082b26a19fd3c05b1af4bb59fa56d50e7c6cfa
Author: Ignacio Bermudez <ig...@gmail.com>
Date:   2017-05-09T03:14:30Z

    Added comment, small modification

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17907: SPARK-7856 Principal components and variance using compu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17907
  
    Yes, it is likely more accurate to not base the PCA on the Gramian. However it's probably going to be more efficient than what the SVD method does even when operating locally. If this change makes other cases very slow, that could be just as bad. However getting it to work with more than ~65535 columns is of course a good thing.
    
    How much memory does your driver have? at the size you're computing, the matrix should only take < 1GB of memory even with some overhead. This isn't large enough to run out of memory, assuming you've given your driver more than the default amount of memory.
    
    The real question is scaling past 65535, which isn't possible no matter what the heap size here. But then the question is what happens in the regime of thousands of columns -- your change may make it a lot slower when it's pretty reasonable to compute locally.
    
    It could be that a threshold is needed here -- above some size, it's probably better to distribute vs compute the Gramian locally, but we don't know what that scale is here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17907: SPARK-7856 Principal components and variance usin...

Posted by ghoto <gi...@git.apache.org>.
Github user ghoto commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17907#discussion_r115502692
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala ---
    @@ -384,19 +384,23 @@ class RowMatrix @Since("1.0.0") (
         val n = numCols().toInt
         require(k > 0 && k <= n, s"k = $k out of range (0, n = $n]")
     
    -    val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]
    -
    -    val brzSvd.SVD(u: BDM[Double], s: BDV[Double], _) = brzSvd(Cov)
    -
    -    val eigenSum = s.data.sum
    -    val explainedVariance = s.data.map(_ / eigenSum)
    -
    -    if (k == n) {
    -      (Matrices.dense(n, k, u.data), Vectors.dense(explainedVariance))
    +    // Check matrix is standarized with mean 0
    +    val mean = computeColumnSummaryStatistics().mean
    +    val stdMat = if (mean.toArray.sum < 1E-9) {
    +      this  // If matrix is already centered in 0, then no need to standarize
         } else {
    -      (Matrices.dense(n, k, Arrays.copyOfRange(u.data, 0, n * k)),
    -        Vectors.dense(Arrays.copyOfRange(explainedVariance, 0, k)))
    +      // X' = X - µ
    +      def subPairs = (vPair: (Double, Double)) => vPair._1 - vPair._2
    +      def subMean = (v: Vector) => Vectors.dense(v.toArray.zip(mean.toArray).map(subPairs))
    --- End diff --
    
    It's not possible to keep sparse vectors sparse if we center in the origin. However, as your concern for efficiency I could try using a  mllib.features.StandardScaler().fit(data).setWithMean(true).setWithVariance(false) on the data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17907: SPARK-7856 Principal components and variance using compu...

Posted by ghoto <gi...@git.apache.org>.
Github user ghoto commented on the issue:

    https://github.com/apache/spark/pull/17907
  
    My understanding is that the RowMatrix computes the SVD locally when the data is suitable to improve performance, and distributed otherwise. Then, the suggested implementation NOT always relies on a distributed SVD, and sometimes it can be computed locally depending on the data.
    
    It's to my knowledge that the current implementation can't handle PCA on tall&fat as it soon goes to OutOfMemory as it tries to compute the PC from eigenvector decomposition on X'X using Breeze SVD.
    
    Moreover the SVD for PCA is preferred over X'X eigenvector decomposition for numerical reasons (https://math.stackexchange.com/questions/359397/why-svd-on-x-is-preferred-to-eigendecomposition-of-xx-top-in-pca), the SVD PCA in the current implementation wouldn't crash as easily as computing locally the eigenvectors of X'X.
    
    This implementation might be a bit slower than the current (disputable), but it adds much more stability.
    
    From https://spark.apache.org/docs/2.1.0/mllib-dimensionality-reduction.html
    > If n is small (n<100) or k is large compared with n (k>n/2), we compute the Gramian matrix first and then compute its top eigenvalues and eigenvectors locally on the driver. This requires a single pass with O(n2) storage on each executor and on the driver, and O(n2k)
    time on the driver.
    Otherwise, we compute (ATA)v
    in a distributive way and send it to ARPACK to compute (ATA)’s top eigenvalues and eigenvectors on the driver node. This requires O(k) passes, O(n) storage on each executor, and O(nk) storage on the driver.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17907: SPARK-7856 Principal components and variance using compu...

Posted by ghoto <gi...@git.apache.org>.
Github user ghoto commented on the issue:

    https://github.com/apache/spark/pull/17907
  
    I think I need sometime to run benchmarks. Originally the driver was set to 3GB, but since I was having this OutOfMemory in the driver I decided to give a try and increase the size.
    
    For other benchmarks, I think I need more time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17907: SPARK-7856 Principal components and variance using compu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17907
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17907: SPARK-7856 Principal components and variance usin...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/17907


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17907: SPARK-7856 Principal components and variance usin...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17907#discussion_r115443849
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala ---
    @@ -384,19 +384,23 @@ class RowMatrix @Since("1.0.0") (
         val n = numCols().toInt
         require(k > 0 && k <= n, s"k = $k out of range (0, n = $n]")
     
    -    val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]
    -
    -    val brzSvd.SVD(u: BDM[Double], s: BDV[Double], _) = brzSvd(Cov)
    -
    -    val eigenSum = s.data.sum
    -    val explainedVariance = s.data.map(_ / eigenSum)
    -
    -    if (k == n) {
    -      (Matrices.dense(n, k, u.data), Vectors.dense(explainedVariance))
    +    // Check matrix is standarized with mean 0
    +    val mean = computeColumnSummaryStatistics().mean
    +    val stdMat = if (mean.toArray.sum < 1E-9) {
    +      this  // If matrix is already centered in 0, then no need to standarize
         } else {
    -      (Matrices.dense(n, k, Arrays.copyOfRange(u.data, 0, n * k)),
    -        Vectors.dense(Arrays.copyOfRange(explainedVariance, 0, k)))
    +      // X' = X - µ
    +      def subPairs = (vPair: (Double, Double)) => vPair._1 - vPair._2
    +      def subMean = (v: Vector) => Vectors.dense(v.toArray.zip(mean.toArray).map(subPairs))
    --- End diff --
    
    This is a pretty inefficient way to subtract the mean, and it's going to make sparse data dense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17907: SPARK-7856 Principal components and variance using compu...

Posted by ghoto <gi...@git.apache.org>.
Github user ghoto commented on the issue:

    https://github.com/apache/spark/pull/17907
  
    With classic Spark PCA, approx. 55Kx15K matrix and 10GB in driver I go out of memory. I chopped the matrix to be 55Kx3K and I can get the PCA. With the SVD distributed approach I could compute PCA with the original matrix and it took about 7 minutes to complete training.
    
    I think as well that if local approach is faster for small to medium size matrices, we should be allowed to set a threshold, to therefore chose the PCA computation.
    
    On the other side, I'm working on my spare time on implementing a Probabilistic PCA with EM. That should scale pretty well and converges pretty fast. Moreover, some flavors allow to have missing values in the matrix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17907: SPARK-7856 Principal components and variance using compu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17907
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #17907: SPARK-7856 Principal components and variance usin...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17907#discussion_r115443970
  
    --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala ---
    @@ -384,19 +384,23 @@ class RowMatrix @Since("1.0.0") (
         val n = numCols().toInt
         require(k > 0 && k <= n, s"k = $k out of range (0, n = $n]")
     
    -    val Cov = computeCovariance().asBreeze.asInstanceOf[BDM[Double]]
    -
    -    val brzSvd.SVD(u: BDM[Double], s: BDV[Double], _) = brzSvd(Cov)
    -
    -    val eigenSum = s.data.sum
    -    val explainedVariance = s.data.map(_ / eigenSum)
    -
    -    if (k == n) {
    -      (Matrices.dense(n, k, u.data), Vectors.dense(explainedVariance))
    +    // Check matrix is standarized with mean 0
    --- End diff --
    
    The scaladoc comments are no longer consistent with the impl.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17907: SPARK-7856 Principal components and variance using compu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17907
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17907: SPARK-7856 Principal components and variance using compu...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/17907
  
    Hm, that sounds like a whole lot more memory being used than I'd imagine. How are you running the driver, and is it certainly the driver that runs out of memory? do you have timings for local vs distributed for a scale that works in both cases?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #17907: SPARK-7856 Principal components and variance using compu...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/17907
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org