You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Max Heimel (JIRA)" <ji...@apache.org> on 2010/09/25 15:50:32 UTC

[jira] Created: (MAHOUT-512) Principal Component Analysis for mahout-examples

Principal Component Analysis for mahout-examples
------------------------------------------------

                 Key: MAHOUT-512
                 URL: https://issues.apache.org/jira/browse/MAHOUT-512
             Project: Mahout
          Issue Type: New Feature
            Reporter: Max Heimel
            Assignee: Max Heimel
            Priority: Minor


h2.Overview:
Principle Component Analysis is a widely used statistical method for decorrelation and dimension-reduction. It is typically applied for denoising data, data compression and as a preprocessing step for more advanced statistical analysis tools (like e.g. independent component analysis). The main idea of PCA is to apply a linear transformation of the given data points into a - typically less-dimensional - feature space, such that the direction of greatest variance within the data lies on the first dimension within the feature space.
One approach to performing PCA is by transforming the d-dimensional data into the space spanned by the p largest eigenvectors of the covariance matrix of the (centered) data. This is done in the following steps (assume the data is given by a nxd data matrix containing n data rows with dimensionality d):
* Compute the d-dimensional empirical mean-vector m of the data points
* Center the data by subtracting m from each data point.
* Compute the covariance matrix C = E' * E, where E is the centered data matrix and E' is its transpose.
* Compute the eigenvalue decomposition of C, such that e_i is the eigenvector corresponding to eigenvalue lambda_i, order i such that lambda_i > lambda_j implies i<j.
* Create the transformation matrix T as the p largest eigenvectors (e_0 ... e_p) of C.
* Transforming a data matrix into the PCA space: R=(D-M)*T
* Transforming data from PCA to original space: D=(R*T')+M 

h2.Implementation overview:
I would suggest implementing PCA for the mahout-examples subproject. Maybe some of the jobs could also go into mahout-core/math, if wished. The implementation would consist of a set of M/R jobs and some gluing code. I assume data is given in HDFS as a file of d-dimensional data points, where each row corresponds to a data point. The map-reduce based implementation of PCA for Mahout would then consist of the following code:
# Distributed Data centering code, that computes the emprical mean vector and centers the data
# Distributed Code for computing the covariance matrix.
# Sequential ( ? )  glue code for computing the eigenvalue decomposition of the covariance matrix and constructing the transformation matrices T and T'. This code would use the existing eigensolvers of Mahout. 
# Distributed code to transform data into the PCA space. This code would use the existing distributed matrix multiplication of mahout & the data centering job
# Distributed code to transforms PCA  data back into the original space. This code would use the existing distributed matrix multiplication of mahout & the data centering job.
# Glue code that combines all jobs into a coherent implementation of PCA.

h2.Implementation details:
For a complete implementation the following three map/reduce jobs have to be written:
# A M/R job to compute the empirical mean vector. Each row gets a row (data point) and emits (dimension, [value,1]) tuples. A local combiner preaggregates the values for each dimension to (dimension, [sum of values, nr of datapoints]). Finally, a (single) reducer computes the average for each dimension and stores back the mean vector.
# A M/R job to center (uncenter) the data. The job consists of a single map-phase that reads in a row (data point) of the original matrix, subtracts (or adds) the empirical mean vector and emits (row number, new ro) pairs that are written back to disk.
# A M/R job to compute the matrix product of a large matrix with its transpose. Principle idea: each mapper reads in a row of the matrix and computes the products of all combinations (e.g. [a,b,c] --> (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position within the symmetric result matrix. A local combiner sums up the local products for each key, the reducer sums up the sum of products for each mapper per key and stores (position, value) tuples to disk. Finally a (local) clean-up phase constructs the covariance matrix from the (position, value) tuples. Since the covariance matrix is dxd and d is typically low compared to the number of data points, I assume the final local step should be fine.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Created: (MAHOUT-512) Principal Component Analysis for mahout-examples

Posted by Ted Dunning <te...@gmail.com>.
Glad that Steve chimed in.  The value of SVD here is not just related to the
fact that it is already implemented.  There is
a more subtle point as well.

Many algorithms start with the fairly innocent "subtract the mean from the
matrix" step.  This is often critical to getting good numerical
properties in dense matrices.

In sparse problems, however, subtracting the mean directly converts a sparse
problem into a dense problem and completely destroys any hope of
scalability.  The effect is changing O(n) problems into O(n^2) or O(n^3)
where n is in the millions or billions.

Generally the solution is to go back to the algebra and figure out a way to
carry the offset along in separate form or to analyze the problem with out
the mean centering.  With very, very sparse data such as common in highly
scaled systems, it is often true that the mean is so close to zero that the
mean subtraction step doesn't improve the numerical properties anyway.  Or
it may be that the mean vector could be usefully modeled as a relatively
sparse vector with only a few means being significantly non-zero.  It is
also common for the elements with significantly non-zero means to be
uninteresting ... typically these are items that should be on a kill-list or
heavily down-sampled in any case.

On Sat, Sep 25, 2010 at 10:46 AM, Max Heimel <mh...@googlemail.com> wrote:

> Hi Steve,
>
> yes, you are obviously right. The computation and eigen-decomposition
> of the covariance matrix can be replaced by SVD. Thanks for the hint,
> shame on me :)
> I will adapt the proposal in JIRA accordingly.
>
> Thanks
> Max
>
> On Sat, Sep 25, 2010 at 7:23 PM, Steve Lianoglou
> <ma...@gmail.com> wrote:
> > Hi,
> >
> > I'm just a casual lurker and am not too well versed in the mahout
> > universe, so please forgive the intrusion, but doesn't mahout already
> > have an SVD implementation?
> >
> > If so, then you've already got the lion's share of a PCA
> > implementation right there, no? Here's a nice/accessible overview:
> >
> > Singular value decomposition and principal component analysis
> > http://public.lanl.gov/mewall/kluwer2002.html
> >
> > See the section entitled "Relation to principal component analysis"
> > for an explanation of how the right/left singular vectors are related
> > to the eigenvectors of your cov(X) matrix (which is pretty much the
> > central "nut" for PCA).
> >
> > -steve
> >
> > On Sat, Sep 25, 2010 at 9:50 AM, Max Heimel (JIRA) <ji...@apache.org>
> wrote:
> >> Principal Component Analysis for mahout-examples
> >> ------------------------------------------------
> >>
> >>                 Key: MAHOUT-512
> >>                 URL: https://issues.apache.org/jira/browse/MAHOUT-512
> >>             Project: Mahout
> >>          Issue Type: New Feature
> >>            Reporter: Max Heimel
> >>            Assignee: Max Heimel
> >>            Priority: Minor
> >>
> >>
> >> h2.Overview:
> >> Principle Component Analysis is a widely used statistical method for
> decorrelation and dimension-reduction. It is typically applied for denoising
> data, data compression and as a preprocessing step for more advanced
> statistical analysis tools (like e.g. independent component analysis). The
> main idea of PCA is to apply a linear transformation of the given data
> points into a - typically less-dimensional - feature space, such that the
> direction of greatest variance within the data lies on the first dimension
> within the feature space.
> >> One approach to performing PCA is by transforming the d-dimensional data
> into the space spanned by the p largest eigenvectors of the covariance
> matrix of the (centered) data. This is done in the following steps (assume
> the data is given by a nxd data matrix containing n data rows with
> dimensionality d):
> >> * Compute the d-dimensional empirical mean-vector m of the data points
> >> * Center the data by subtracting m from each data point.
> >> * Compute the covariance matrix C = E' * E, where E is the centered data
> matrix and E' is its transpose.
> >> * Compute the eigenvalue decomposition of C, such that e_i is the
> eigenvector corresponding to eigenvalue lambda_i, order i such that lambda_i
> > lambda_j implies i<j.
> >> * Create the transformation matrix T as the p largest eigenvectors (e_0
> ... e_p) of C.
> >> * Transforming a data matrix into the PCA space: R=(D-M)*T
> >> * Transforming data from PCA to original space: D=(R*T')+M
> >>
> >> h2.Implementation overview:
> >> I would suggest implementing PCA for the mahout-examples subproject.
> Maybe some of the jobs could also go into mahout-core/math, if wished. The
> implementation would consist of a set of M/R jobs and some gluing code. I
> assume data is given in HDFS as a file of d-dimensional data points, where
> each row corresponds to a data point. The map-reduce based implementation of
> PCA for Mahout would then consist of the following code:
> >> # Distributed Data centering code, that computes the emprical mean
> vector and centers the data
> >> # Distributed Code for computing the covariance matrix.
> >> # Sequential ( ? )  glue code for computing the eigenvalue decomposition
> of the covariance matrix and constructing the transformation matrices T and
> T'. This code would use the existing eigensolvers of Mahout.
> >> # Distributed code to transform data into the PCA space. This code would
> use the existing distributed matrix multiplication of mahout & the data
> centering job
> >> # Distributed code to transforms PCA  data back into the original space.
> This code would use the existing distributed matrix multiplication of mahout
> & the data centering job.
> >> # Glue code that combines all jobs into a coherent implementation of
> PCA.
> >>
> >> h2.Implementation details:
> >> For a complete implementation the following three map/reduce jobs have
> to be written:
> >> # A M/R job to compute the empirical mean vector. Each row gets a row
> (data point) and emits (dimension, [value,1]) tuples. A local combiner
> preaggregates the values for each dimension to (dimension, [sum of values,
> nr of datapoints]). Finally, a (single) reducer computes the average for
> each dimension and stores back the mean vector.
> >> # A M/R job to center (uncenter) the data. The job consists of a single
> map-phase that reads in a row (data point) of the original matrix, subtracts
> (or adds) the empirical mean vector and emits (row number, new ro) pairs
> that are written back to disk.
> >> # A M/R job to compute the matrix product of a large matrix with its
> transpose. Principle idea: each mapper reads in a row of the matrix and
> computes the products of all combinations (e.g. [a,b,c] -->
> (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position
> within the symmetric result matrix. A local combiner sums up the local
> products for each key, the reducer sums up the sum of products for each
> mapper per key and stores (position, value) tuples to disk. Finally a
> (local) clean-up phase constructs the covariance matrix from the (position,
> value) tuples. Since the covariance matrix is dxd and d is typically low
> compared to the number of data points, I assume the final local step should
> be fine.
> >>
> >> --
> >> This message is automatically generated by JIRA.
> >> -
> >> You can reply to this email to add a comment to the issue online.
> >>
> >>
> >
> >
> >
> > --
> > Steve Lianoglou
> > Graduate Student: Computational Systems Biology
> >  | Memorial Sloan-Kettering Cancer Center
> >  | Weill Medical College of Cornell University
> > Contact Info: http://cbio.mskcc.org/~lianos/contact
> >
>

Re: [jira] Created: (MAHOUT-512) Principal Component Analysis for mahout-examples

Posted by Max Heimel <mh...@googlemail.com>.
Hi Steve,

yes, you are obviously right. The computation and eigen-decomposition
of the covariance matrix can be replaced by SVD. Thanks for the hint,
shame on me :)
I will adapt the proposal in JIRA accordingly.

Thanks
Max

On Sat, Sep 25, 2010 at 7:23 PM, Steve Lianoglou
<ma...@gmail.com> wrote:
> Hi,
>
> I'm just a casual lurker and am not too well versed in the mahout
> universe, so please forgive the intrusion, but doesn't mahout already
> have an SVD implementation?
>
> If so, then you've already got the lion's share of a PCA
> implementation right there, no? Here's a nice/accessible overview:
>
> Singular value decomposition and principal component analysis
> http://public.lanl.gov/mewall/kluwer2002.html
>
> See the section entitled "Relation to principal component analysis"
> for an explanation of how the right/left singular vectors are related
> to the eigenvectors of your cov(X) matrix (which is pretty much the
> central "nut" for PCA).
>
> -steve
>
> On Sat, Sep 25, 2010 at 9:50 AM, Max Heimel (JIRA) <ji...@apache.org> wrote:
>> Principal Component Analysis for mahout-examples
>> ------------------------------------------------
>>
>>                 Key: MAHOUT-512
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-512
>>             Project: Mahout
>>          Issue Type: New Feature
>>            Reporter: Max Heimel
>>            Assignee: Max Heimel
>>            Priority: Minor
>>
>>
>> h2.Overview:
>> Principle Component Analysis is a widely used statistical method for decorrelation and dimension-reduction. It is typically applied for denoising data, data compression and as a preprocessing step for more advanced statistical analysis tools (like e.g. independent component analysis). The main idea of PCA is to apply a linear transformation of the given data points into a - typically less-dimensional - feature space, such that the direction of greatest variance within the data lies on the first dimension within the feature space.
>> One approach to performing PCA is by transforming the d-dimensional data into the space spanned by the p largest eigenvectors of the covariance matrix of the (centered) data. This is done in the following steps (assume the data is given by a nxd data matrix containing n data rows with dimensionality d):
>> * Compute the d-dimensional empirical mean-vector m of the data points
>> * Center the data by subtracting m from each data point.
>> * Compute the covariance matrix C = E' * E, where E is the centered data matrix and E' is its transpose.
>> * Compute the eigenvalue decomposition of C, such that e_i is the eigenvector corresponding to eigenvalue lambda_i, order i such that lambda_i > lambda_j implies i<j.
>> * Create the transformation matrix T as the p largest eigenvectors (e_0 ... e_p) of C.
>> * Transforming a data matrix into the PCA space: R=(D-M)*T
>> * Transforming data from PCA to original space: D=(R*T')+M
>>
>> h2.Implementation overview:
>> I would suggest implementing PCA for the mahout-examples subproject. Maybe some of the jobs could also go into mahout-core/math, if wished. The implementation would consist of a set of M/R jobs and some gluing code. I assume data is given in HDFS as a file of d-dimensional data points, where each row corresponds to a data point. The map-reduce based implementation of PCA for Mahout would then consist of the following code:
>> # Distributed Data centering code, that computes the emprical mean vector and centers the data
>> # Distributed Code for computing the covariance matrix.
>> # Sequential ( ? )  glue code for computing the eigenvalue decomposition of the covariance matrix and constructing the transformation matrices T and T'. This code would use the existing eigensolvers of Mahout.
>> # Distributed code to transform data into the PCA space. This code would use the existing distributed matrix multiplication of mahout & the data centering job
>> # Distributed code to transforms PCA  data back into the original space. This code would use the existing distributed matrix multiplication of mahout & the data centering job.
>> # Glue code that combines all jobs into a coherent implementation of PCA.
>>
>> h2.Implementation details:
>> For a complete implementation the following three map/reduce jobs have to be written:
>> # A M/R job to compute the empirical mean vector. Each row gets a row (data point) and emits (dimension, [value,1]) tuples. A local combiner preaggregates the values for each dimension to (dimension, [sum of values, nr of datapoints]). Finally, a (single) reducer computes the average for each dimension and stores back the mean vector.
>> # A M/R job to center (uncenter) the data. The job consists of a single map-phase that reads in a row (data point) of the original matrix, subtracts (or adds) the empirical mean vector and emits (row number, new ro) pairs that are written back to disk.
>> # A M/R job to compute the matrix product of a large matrix with its transpose. Principle idea: each mapper reads in a row of the matrix and computes the products of all combinations (e.g. [a,b,c] --> (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position within the symmetric result matrix. A local combiner sums up the local products for each key, the reducer sums up the sum of products for each mapper per key and stores (position, value) tuples to disk. Finally a (local) clean-up phase constructs the covariance matrix from the (position, value) tuples. Since the covariance matrix is dxd and d is typically low compared to the number of data points, I assume the final local step should be fine.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>

Re: [jira] Created: (MAHOUT-512) Principal Component Analysis for mahout-examples

Posted by Steve Lianoglou <ma...@gmail.com>.
Hi,

I'm just a casual lurker and am not too well versed in the mahout
universe, so please forgive the intrusion, but doesn't mahout already
have an SVD implementation?

If so, then you've already got the lion's share of a PCA
implementation right there, no? Here's a nice/accessible overview:

Singular value decomposition and principal component analysis
http://public.lanl.gov/mewall/kluwer2002.html

See the section entitled "Relation to principal component analysis"
for an explanation of how the right/left singular vectors are related
to the eigenvectors of your cov(X) matrix (which is pretty much the
central "nut" for PCA).

-steve

On Sat, Sep 25, 2010 at 9:50 AM, Max Heimel (JIRA) <ji...@apache.org> wrote:
> Principal Component Analysis for mahout-examples
> ------------------------------------------------
>
>                 Key: MAHOUT-512
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-512
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Max Heimel
>            Assignee: Max Heimel
>            Priority: Minor
>
>
> h2.Overview:
> Principle Component Analysis is a widely used statistical method for decorrelation and dimension-reduction. It is typically applied for denoising data, data compression and as a preprocessing step for more advanced statistical analysis tools (like e.g. independent component analysis). The main idea of PCA is to apply a linear transformation of the given data points into a - typically less-dimensional - feature space, such that the direction of greatest variance within the data lies on the first dimension within the feature space.
> One approach to performing PCA is by transforming the d-dimensional data into the space spanned by the p largest eigenvectors of the covariance matrix of the (centered) data. This is done in the following steps (assume the data is given by a nxd data matrix containing n data rows with dimensionality d):
> * Compute the d-dimensional empirical mean-vector m of the data points
> * Center the data by subtracting m from each data point.
> * Compute the covariance matrix C = E' * E, where E is the centered data matrix and E' is its transpose.
> * Compute the eigenvalue decomposition of C, such that e_i is the eigenvector corresponding to eigenvalue lambda_i, order i such that lambda_i > lambda_j implies i<j.
> * Create the transformation matrix T as the p largest eigenvectors (e_0 ... e_p) of C.
> * Transforming a data matrix into the PCA space: R=(D-M)*T
> * Transforming data from PCA to original space: D=(R*T')+M
>
> h2.Implementation overview:
> I would suggest implementing PCA for the mahout-examples subproject. Maybe some of the jobs could also go into mahout-core/math, if wished. The implementation would consist of a set of M/R jobs and some gluing code. I assume data is given in HDFS as a file of d-dimensional data points, where each row corresponds to a data point. The map-reduce based implementation of PCA for Mahout would then consist of the following code:
> # Distributed Data centering code, that computes the emprical mean vector and centers the data
> # Distributed Code for computing the covariance matrix.
> # Sequential ( ? )  glue code for computing the eigenvalue decomposition of the covariance matrix and constructing the transformation matrices T and T'. This code would use the existing eigensolvers of Mahout.
> # Distributed code to transform data into the PCA space. This code would use the existing distributed matrix multiplication of mahout & the data centering job
> # Distributed code to transforms PCA  data back into the original space. This code would use the existing distributed matrix multiplication of mahout & the data centering job.
> # Glue code that combines all jobs into a coherent implementation of PCA.
>
> h2.Implementation details:
> For a complete implementation the following three map/reduce jobs have to be written:
> # A M/R job to compute the empirical mean vector. Each row gets a row (data point) and emits (dimension, [value,1]) tuples. A local combiner preaggregates the values for each dimension to (dimension, [sum of values, nr of datapoints]). Finally, a (single) reducer computes the average for each dimension and stores back the mean vector.
> # A M/R job to center (uncenter) the data. The job consists of a single map-phase that reads in a row (data point) of the original matrix, subtracts (or adds) the empirical mean vector and emits (row number, new ro) pairs that are written back to disk.
> # A M/R job to compute the matrix product of a large matrix with its transpose. Principle idea: each mapper reads in a row of the matrix and computes the products of all combinations (e.g. [a,b,c] --> (0,aa)(1,ab)(2,ac)(3,bb)(4,bc)(5,cc)). The key corresponds to the position within the symmetric result matrix. A local combiner sums up the local products for each key, the reducer sums up the sum of products for each mapper per key and stores (position, value) tuples to disk. Finally a (local) clean-up phase constructs the covariance matrix from the (position, value) tuples. Since the covariance matrix is dxd and d is typically low compared to the number of data points, I assume the final local step should be fine.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact