You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org> on 2011/09/21 20:15:11 UTC

[jira] [Created] (MAHOUT-817) Add PCA options to SSVD code

Add PCA options to SSVD code
----------------------------

                 Key: MAHOUT-817
                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
             Project: Mahout
          Issue Type: New Feature
    Affects Versions: 0.6
            Reporter: Dmitriy Lyubimov
            Assignee: Dmitriy Lyubimov
             Fix For: 0.6


It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 

Several approaches were suggested:

1) subtract mean off B
2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
3) --?

It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158210#comment-13158210 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

Yes expectatiin is zero but variance is going to be big regardless of the input *size I think unfortunately. So m Omega term is still a problem. For my problems itsnbrute force computation will actually take more than e.g. squaringn my input. So it was first thought but I don't think it is valid enough. So I withdraw this for now.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment:     (was: SSVD-PCA options.pdf)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: SSVD-PCA options.pdf

fixed(?)
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158057#comment-13158057 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

The way i understood original idea from Ted, since we are performing projection into B, then the center of original data would also project onto center of projected data (in this case, data are column vectors). 

if row vectors are implied as pca items that means subtraction of row mean but i am not 100% sure how this works, but it seems that this case can be solved by finding row-mean of Y and proceed with Y-M_y instead of Y. However, i am not sure at all how it plays out esp. with power iterations. It would seem to me that random projection of centered vs. non-centered data may not be the same in the context of this method. I don't immediately see this. 

Even subtraction of median in B may affect the accuracy because random projection captured the action of the original data, but not necessarily the centered data. Once data is centered, the optimal subspace capturing variances might be quite different from original subspace produced in Q. That's why i say maybe brute force approach is the right one. At least i can easily convince myself it is what PCA defines.

In addition, the main difficulty is that to know mean of A, we need one separate pass over A (at least with a row mean), and the whole idea is that probably we can do it on the fly somewehre else with already projected data. 

bq. One question: is it necessary to do mean-subtraction of A before computing the QR decomposition, or will the columns of Q still
form a good basis even without mean-subtraction?

That's exactly my concern. i think this breaks the fundamental premise of the method (unless it somehow magically appears to be just as good, bit it would seem to me it is not, at least i can construct a visual counterexample in my head).

So assume  we need to do subtraction before attempting to find a good basis for projection. Then for the case of column-wise mean it is easy, we can do it on the fly and we need just one pass over data while doing the Y and Q stuff. If we want a row-wise mean, the brute force requires one more pass to aquire the mean.

bq. It seems there are two jobs that need to be modified: BBT-job and V-job. Since they both work column wise it should
be straightforward to pass in the vector qs and the scalar a_mean( i ).

BBt job is now obsolete. BBt is now produced in reducers of Bt job as a bonus and finalized in the front end.


                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: ssvd-tests.R
                ssvd.R
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: SSVD-PCA options.pdf

udpated math document
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158191#comment-13158191 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

Still need a bit of thought how it all works with power iterations, there need to be changes there as well 

                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: SSVD-PCA options.pdf
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Raphael Cendrillon (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158010#comment-13158010 ] 

Raphael Cendrillon edited comment on MAHOUT-817 at 11/27/11 8:51 PM:
---------------------------------------------------------------------

Could you expand on this a little?

If I understand correctly we need to implicitly do mean-subtraction of A whenever we work with B.
It seems this is equivalent to subtracting qs'*a_mean from B, where qs is the sum of the rows of Q
and a_mean is the mean of the rows of A. So if bi is the ith column of B then the column with
implicit mean-subtraction of A is

  bi - qs'*a_mean( i )

where a_mean( i ) is the ith element of a_mean.

It seems there are two jobs that need to be modified: BBT-job and V-job. Since they both work column wise it should
be straightforward to pass in the vector qs and the scalar a_mean(i).

One question: is it necessary to do mean-subtraction of A before computing the QR decomposition, or will the columns of Q still
form a good basis even without mean-subtraction?

Could you explain what the 'column mean' is? I thought that each data point corresponds to a row in A, so that subtraction of row means
would be more appropriate?




                
      was (Author: cendrillon):
    Could you expand on this a little?

If I understand correctly we need to implicitly do mean-subtraction of A whenever we work with B.
It seems this is equivalent to subtracting qs'*a_mean from B, where qs is the sum of the rows of Q
and a_mean is the mean of the rows of A. So if bi is the ith column of B then the column with
implicit mean-subtraction of A is

  bi - qs'*a_mean(i)

where a_mean(i) is the ith element of a_mean.

It seems there are two jobs that need to be modified: BBT-job and V-job. Since they both work column wise it should
be straightforward to pass in the vector qs and the scalar a_mean(i).

One question: is it necessary to do mean-subtraction of A before computing the QR decomposition, or will the columns of Q still
form a good basis even without mean-subtraction?

Could you explain what the 'column mean' is? I thought that each data point corresponds to a row in A, so that subtraction of row means
would be more appropriate?




                  
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment:     (was: ssvd-tests.R)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Raphael Cendrillon <ce...@gmail.com>.
Sounds good. Let me take a look. 

Happy holidays!!

On Dec 25, 2011, at 4:16 PM, "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org> wrote:

> 
>    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175867#comment-13175867 ] 
> 
> Dmitriy Lyubimov commented on MAHOUT-817:
> -----------------------------------------
> 
> bq. Thanks for merging Dmitriy. Is there anything you need from me at this point?
> 
> I would always appreciate if you could poke CLI version and verify it independently via matlab test for precision of computed singular values and V output on a larger input. 
> 
> (I am still working on reading Mahout files into R and merging with RHadoop, when it's done i will be able to verify larger tests with R.) 
> 
> -d
> 
>> Add PCA options to SSVD code
>> ----------------------------
>> 
>>                Key: MAHOUT-817
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-817
>>            Project: Mahout
>>         Issue Type: New Feature
>>   Affects Versions: 0.6
>>           Reporter: Dmitriy Lyubimov
>>           Assignee: Dmitriy Lyubimov
>>            Fix For: Backlog
>> 
>>        Attachments: MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>> 
>> 
>> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
>> Several approaches were suggested:
>> 1) subtract mean off B
>> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
>> 3) --?
>> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
> 

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175867#comment-13175867 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

bq. Thanks for merging Dmitriy. Is there anything you need from me at this point?

I would always appreciate if you could poke CLI version and verify it independently via matlab test for precision of computed singular values and V output on a larger input. 

(I am still working on reading Mahout files into R and merging with RHadoop, when it's done i will be able to verify larger tests with R.) 

-d
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214118#comment-13214118 ] 

Hudson commented on MAHOUT-817:
-------------------------------

Integrated in Mahout-Quality #1361 (See [https://builds.apache.org/job/Mahout-Quality/1361/])
    MAHOUT-817 PCA options for SSVD (RC1) (Revision 1292532)

     Result = SUCCESS
dlyubimov : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1292532
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtDenseOutJob.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/PartialRowEmitter.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UJob.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/VJob.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDPCADenseTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCommonTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java

                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817-RC1.patch, MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: MAHOUT-817-RC1.patch

refreshing the attached patch (called RC1) to correspond to what was posted on review board.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817-RC1.patch, MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Raphael Cendrillon (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158010#comment-13158010 ] 

Raphael Cendrillon edited comment on MAHOUT-817 at 11/27/11 8:51 PM:
---------------------------------------------------------------------

Could you expand on this a little?

If I understand correctly we need to implicitly do mean-subtraction of A whenever we work with B.
It seems this is equivalent to subtracting qs'*a_mean from B, where qs is the sum of the rows of Q
and a_mean is the mean of the rows of A. So if bi is the ith column of B then the column with
implicit mean-subtraction of A is

  bi - qs'*a_mean( i )

where a_mean( i ) is the ith element of a_mean.

It seems there are two jobs that need to be modified: BBT-job and V-job. Since they both work column wise it should
be straightforward to pass in the vector qs and the scalar a_mean( i ).

One question: is it necessary to do mean-subtraction of A before computing the QR decomposition, or will the columns of Q still
form a good basis even without mean-subtraction?

Could you explain what the 'column mean' is? I thought that each data point corresponds to a row in A, so that subtraction of row means
would be more appropriate?




                
      was (Author: cendrillon):
    Could you expand on this a little?

If I understand correctly we need to implicitly do mean-subtraction of A whenever we work with B.
It seems this is equivalent to subtracting qs'*a_mean from B, where qs is the sum of the rows of Q
and a_mean is the mean of the rows of A. So if bi is the ith column of B then the column with
implicit mean-subtraction of A is

  bi - qs'*a_mean( i )

where a_mean( i ) is the ith element of a_mean.

It seems there are two jobs that need to be modified: BBT-job and V-job. Since they both work column wise it should
be straightforward to pass in the vector qs and the scalar a_mean(i).

One question: is it necessary to do mean-subtraction of A before computing the QR decomposition, or will the columns of Q still
form a good basis even without mean-subtraction?

Could you explain what the 'column mean' is? I thought that each data point corresponds to a row in A, so that subtraction of row means
would be more appropriate?




                  
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210536#comment-13210536 ] 

jiraposter@reviews.apache.org commented on MAHOUT-817:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3863/
-----------------------------------------------------------

(Updated 2012-02-17 20:43:22.593328)


Review request for mahout.


Changes
-------

commit 996464eb600400745baf25498606aca115cb7e96
Merge: cd48627 aa7e1d8
Author: Dmitriy Lyubimov <dl...@inadco.com>
Date:   Fri Feb 17 12:40:26 2012 -0800

    Merge remote-tracking branch 'apache/trunk' into MAHOUT-817
    
    Conflicts:
        core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java


Summary
-------


2d542fd4dfcc6e01577bddc28600632a88e358ee Merge remote-tracking branch 'apache/trunk' into MAHOUT-817
1f245bb5cc1354e7495ec62fbc5f41ed6d590210 Merge branch 'trunk' into MAHOUT-817
458d8112de180c93d5194d67ccfc00442ed1d460 Merge remote-tracking branch 'apache/trunk' into MAHOUT-817
3fea9bd981043e268dd003d4c6c3943bb570c0f7 added test, bug fixes
2725c1061c167126238d288039f0f68baafa7dc8 adding --pca and --pcaOffset options, minor fixes
48c7b425241afff42ce52d3bb005a87aeb68386d fixing front end to factor in the median data.
4e072615ac2b8a256d037aaf00db21820abb91e2 tweaking B' job to produce necessary correctors s_q and s_b
b10fefd8d4aa5a0ed2f60902904d551afbbdf57e cosmetic fixes
849171d3af75117a2ee1115e6d5fc8e4a1fff5ce comment
6c196ea9606b3ca05d401fa1474ee9262a6c0303 retrofitting V job to do pca correction
e6fbe7cdb606698f180127302c33d30fffc6c4d7 adding pca options to Q,ABt jobs. still need to work on B'-job, V-job and front-end pca corrections.
ecf5dd21c5d5805d70715a78abd07246d171536c Computing s_b0
b9b33cf72af85ade16fcfbf4e13a036877489afb comments
9bb6e971c68e0674b087b8c5d64f4967878f1834 More cleanup in favor of standard functions, unit tests pass but need to verify the 2G benchmark.
39faa70158b52e50d31aca2abc4006874a9ea8fd cleanup I
780b291eb902e0e832d41748d45bf6d2163f9537 cosmetic changes, adding api with out redundant parameters
02daf0024489305032320c578ac546c16bda31c1 current MAHOUT-923 patch from Raphael


This addresses bug MAHOUT-817.
    https://issues.apache.org/jira/browse/MAHOUT-817


Diffs (updated)
-----

  core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 3e0dd5e 
  core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtDenseOutJob.java c52fe2a 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java 0c3a996 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java 0fa8707 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/PartialRowEmitter.java 59bdedb 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java 703c420 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java d314186 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java PRE-CREATION 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java 98c8c59 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java b1a8b56 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UJob.java 53f26f4 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/VJob.java d58789e 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java bd8c6b1 
  core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 0ef8622 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDPCADenseTest.java PRE-CREATION 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java 59f79c5 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java beb0102 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCommonTest.java PRE-CREATION 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java 503433f 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java 32342c1 

Diff: https://reviews.apache.org/r/3863/diff


Testing
-------

Additional unit tests for PCA


Thanks,

Dmitriy


                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158196#comment-13158196 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

Computation of m*omega also may be fairly involved because even that is vector matrix multiplication, Omega is dense, bigger than input, even though we don't have to move its input around. Maybe for big inputs we can just take a math expectation in of this. For the uniform distribution of murmur(is it uniform?) We perhaps can I ore the whole m x Omega because it converges on 0 per law of big numbers.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Work started] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Work started) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on MAHOUT-817 started by Dmitriy Lyubimov.

> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817-RC1.patch, MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-CLI.pdf, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163172#comment-13163172 ] 

Dmitriy Lyubimov edited comment on MAHOUT-817 at 12/5/11 11:38 PM:
-------------------------------------------------------------------

So i did an R simulation of column-wise mean and it seems to work , so i think this verifies the math.

I still need to finish the doc (it also has a little typo in it), i will be finishing it from home as i don't seem to have the doc source on me here. 

I guess it clears the implementation on existing ssvd solver.

test results comparing "brute forced" svd with "median propagated" version: 
{code}


> respci$svalues
 [1] 9.9995227 8.9992220 7.9907894 6.9860235 5.9786348 4.9866553 3.9853651
 [8] 2.9735904 1.9999941 0.9971344
> ressvd$svalues
 [1] 9.9995227 8.9992220 7.9907894 6.9860235 5.9786348 4.9866553 3.9853651
 [8] 2.9735904 1.9999941 0.9971344
> 
{code}
                
      was (Author: dlyubimov):
    So i did an R simulation of column-wise mean and it seems to work , so i think this verifies the math.

I still need to finish the doc (it also has a little typo in it), i will be finishing it from home as i don't seem to have the doc source on me here. 

I guess it clears the implementation on existing ssvd solver.
                  
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Comment: was deleted

(was: Computation of m*omega also may be fairly involved because even that it is vector matrix multiplication, Omega is dense, bigger than input, even though we don't have to move its input around. Maybe for big inputs we can just take a math expectation  of this. For the uniform distribution of murmur(is it uniform?) -1,1 that is currently used we perhaps can ignore the whole m x Omega because it converges on 0 per law of big numbers.)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210548#comment-13210548 ] 

jiraposter@reviews.apache.org commented on MAHOUT-817:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3863/
-----------------------------------------------------------

(Updated 2012-02-17 20:50:01.339012)


Review request for mahout.


Changes
-------

commit 95d5934405d1ca51e13439a43e0fc793418e5d37
Author: Dmitriy Lyubimov <dl...@inadco.com>
Date:   Fri Feb 17 12:48:37 2012 -0800

    Fixing option recovery based on new api changes


Summary
-------


2d542fd4dfcc6e01577bddc28600632a88e358ee Merge remote-tracking branch 'apache/trunk' into MAHOUT-817
1f245bb5cc1354e7495ec62fbc5f41ed6d590210 Merge branch 'trunk' into MAHOUT-817
458d8112de180c93d5194d67ccfc00442ed1d460 Merge remote-tracking branch 'apache/trunk' into MAHOUT-817
3fea9bd981043e268dd003d4c6c3943bb570c0f7 added test, bug fixes
2725c1061c167126238d288039f0f68baafa7dc8 adding --pca and --pcaOffset options, minor fixes
48c7b425241afff42ce52d3bb005a87aeb68386d fixing front end to factor in the median data.
4e072615ac2b8a256d037aaf00db21820abb91e2 tweaking B' job to produce necessary correctors s_q and s_b
b10fefd8d4aa5a0ed2f60902904d551afbbdf57e cosmetic fixes
849171d3af75117a2ee1115e6d5fc8e4a1fff5ce comment
6c196ea9606b3ca05d401fa1474ee9262a6c0303 retrofitting V job to do pca correction
e6fbe7cdb606698f180127302c33d30fffc6c4d7 adding pca options to Q,ABt jobs. still need to work on B'-job, V-job and front-end pca corrections.
ecf5dd21c5d5805d70715a78abd07246d171536c Computing s_b0
b9b33cf72af85ade16fcfbf4e13a036877489afb comments
9bb6e971c68e0674b087b8c5d64f4967878f1834 More cleanup in favor of standard functions, unit tests pass but need to verify the 2G benchmark.
39faa70158b52e50d31aca2abc4006874a9ea8fd cleanup I
780b291eb902e0e832d41748d45bf6d2163f9537 cosmetic changes, adding api with out redundant parameters
02daf0024489305032320c578ac546c16bda31c1 current MAHOUT-923 patch from Raphael


This addresses bug MAHOUT-817.
    https://issues.apache.org/jira/browse/MAHOUT-817


Diffs (updated)
-----

  core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 3e0dd5e 
  core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtDenseOutJob.java c52fe2a 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java 0c3a996 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java 0fa8707 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/PartialRowEmitter.java 59bdedb 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java 703c420 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java d314186 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java PRE-CREATION 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java 98c8c59 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java b1a8b56 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UJob.java 53f26f4 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/VJob.java d58789e 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java bd8c6b1 
  core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 0ef8622 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDPCADenseTest.java PRE-CREATION 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java 59f79c5 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java beb0102 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCommonTest.java PRE-CREATION 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java 503433f 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java 32342c1 

Diff: https://reviews.apache.org/r/3863/diff


Testing
-------

Additional unit tests for PCA


Thanks,

Dmitriy


                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109754#comment-13109754 ] 

Ted Dunning commented on MAHOUT-817:
------------------------------------

1 & 2 sound comprehensive to me.  Option 1 (subtracting the mean from B) seems like a great approach except that it seems to be focused on column or global subtraction of means.  If you want to subtract row means then working on Y might be applicable.  As you say, this requires a bit of thinking.

> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.6
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Isabel Drost <is...@apache.org>.
On 29.11.2011 Grant Ingersoll wrote:
> > The cost of this is that it requires a bit more coordination, but in the
> > long term I think it will lead to better quality code and a more
> > successful project for everyone.
> 
> I think what you will find is that if you start putting up some patches, we
> should be able to kick in and help.   Pick an area of interest and start
> small. Write good tests and be persistent.

Judging from how people got involved previously it also helps to select an area 
where you personally have a need for Mahout, use the project and start surfacing 
and help fixing any issues you run into.

Isabel

Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
If you are looking for the areas of a new algorithms missing, I think Ted
recently published a list of things sought after.

I for myself would very much like to see SVM things done at scale. Another
feature request from me is hierarchical ckustering, if you like a new
challenge.

The challenge for any new algorithm is to figure a parallelization
technique that is usually not a part of a previously publicized matter. So
if you prefer a ground up work, be prepared to do some of that heavy
lifting first and have it reviewed.

What we found out(ok, let me speak for myself, i found out) is that the
best strategies are being created in collaboration and review. Having new
unique angles at the problem from different people creates most balanced
solution, although the actual coding jira task is usually driven by a
single contributor, subject to a subsequent review. I usually try to get
some assurance I am heading in the right direction first.

E.g. 817 would benefit greatly if somebody could double check my math (in
particular, median propagation under BB' part) and perhaps even simulate it
in matlab (something that you seem to be very skillful at). You already
helped me a lot with your simulation of m Omega.

Thanks.
 On Nov 28, 2011 2:26 PM, "Raphael Cendrillon" <ce...@gmail.com>
wrote:

> Hi Ted,
>
> I think the difficulty I have is in identifying areas to contribute that
> the community will find useful.
>
> If I understand correctly at this stage the major algorithms are in place
> and the focus is on polishing the existing code rather than adding large
> amounts of new functionality.
>
> With this in mind it seems the best thing to do is find an existing module
> to work on. So I'm wondering if there are any existing module maintainers
> that wouldn't mind taking someone new under their wing?
>
> The cost of this is that it requires a bit more coordination, but in the
> long term I think it will lead to better quality code and a more successful
> project for everyone.
>
> On Nov 28, 2011, at 2:02 PM, Ted Dunning <te...@gmail.com> wrote:
>
> > What are you finding difficult?
> >
> > Us?
> >
> > The process?
> >
> > The concept?
> >
> > What can we do to make this easier?
> >
> > On Mon, Nov 28, 2011 at 1:57 PM, Raphael Cendrillon <
> > cendrillon1978@gmail.com> wrote:
> >
> >> I would like to get involved in contributing to the code, although I'm
> >> finding this quite difficult.
> >>
>

Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 28, 2011, at 2:25 PM, Raphael Cendrillon wrote:

> Hi Ted,
> 
> I think the difficulty I have is in identifying areas to contribute that the community will find useful. 
> 
> If I understand correctly at this stage the major algorithms are in place and the focus is on polishing the existing code rather than adding large amounts of new functionality.
> 
> With this in mind it seems the best thing to do is find an existing module to work on. So I'm wondering if there are any existing module maintainers that wouldn't mind taking someone new under their wing?

I've started to label things as MAHOUT_INTRO_CONTRIBUTE.  Hopefully others have too.  See also https://cwiki.apache.org/MAHOUT/how-to-contribute.html as well.  That probably is the best way to get started.



> 
> The cost of this is that it requires a bit more coordination, but in the long term I think it will lead to better quality code and a more successful project for everyone. 

I think what you will find is that if you start putting up some patches, we should be able to kick in and help.   Pick an area of interest and start small. Write good tests and be persistent.


> 
> On Nov 28, 2011, at 2:02 PM, Ted Dunning <te...@gmail.com> wrote:
> 
>> What are you finding difficult?
>> 
>> Us?
>> 
>> The process?
>> 
>> The concept?
>> 
>> What can we do to make this easier?
>> 
>> On Mon, Nov 28, 2011 at 1:57 PM, Raphael Cendrillon <
>> cendrillon1978@gmail.com> wrote:
>> 
>>> I would like to get involved in contributing to the code, although I'm
>>> finding this quite difficult.
>>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com




Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Raphael Cendrillon <ce...@gmail.com>.
Hi Ted,

I think the difficulty I have is in identifying areas to contribute that the community will find useful. 

If I understand correctly at this stage the major algorithms are in place and the focus is on polishing the existing code rather than adding large amounts of new functionality.

With this in mind it seems the best thing to do is find an existing module to work on. So I'm wondering if there are any existing module maintainers that wouldn't mind taking someone new under their wing?

The cost of this is that it requires a bit more coordination, but in the long term I think it will lead to better quality code and a more successful project for everyone. 

On Nov 28, 2011, at 2:02 PM, Ted Dunning <te...@gmail.com> wrote:

> What are you finding difficult?
> 
> Us?
> 
> The process?
> 
> The concept?
> 
> What can we do to make this easier?
> 
> On Mon, Nov 28, 2011 at 1:57 PM, Raphael Cendrillon <
> cendrillon1978@gmail.com> wrote:
> 
>> I would like to get involved in contributing to the code, although I'm
>> finding this quite difficult.
>> 

Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Ted Dunning <te...@gmail.com>.
What are you finding difficult?

Us?

The process?

The concept?

What can we do to make this easier?

On Mon, Nov 28, 2011 at 1:57 PM, Raphael Cendrillon <
cendrillon1978@gmail.com> wrote:

> I would like to get involved in contributing to the code, although I'm
> finding this quite difficult.
>

Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Raphael Cendrillon <ce...@gmail.com>.
Sure, I'm happy to help in whatever way I can.

I would like to get involved in contributing to the code, although I'm finding this quite difficult.

On 28 Nov, 2011, at 1:35 PM, Dmitriy Lyubimov wrote:

> In any event i hope you could review stuff going on there. There are
> problems that need answers.
> 
> On Mon, Nov 28, 2011 at 12:50 PM, Raphael Cendrillon <
> cendrillon1978@gmail.com> wrote:
> 
>> Thanks Dmitriy. I certainly understand.
>> 
>> Perhaps I can find some other areas to contribute.
>> 
>> On 28 Nov, 2011, at 12:37 PM, Dmitriy Lyubimov wrote:
>> 
>>> I think it is certainly ok for you to try and your thoughts are even more
>>> appreciated because optimization of this stuff for big data that is also
>>> accurate seem to take more than one head to review.
>>> 
>>> However, I've already planned on doing 817 in the next two months and
>>> finish it in Q1 if I can work out existing issues.
>>> The existing issues are both flow and performance and IMO require a tad
>>> more contemplation w.r.t. to existing flow pecularities before reliable
>>> flow could be figured.
>>> On top of it, at the point I am primary maintainer of SSVD code and I
>> think
>>> you should know that introducing modifications which at this point seem
>>> fairly sizable may make it more difficult for me to maintain it --
>>> especially given we haven't considered effect on existing power
>> iterations
>>> yet and future issue of introducing Cholesky option (there's a  pending
>>> issue for that as well). But I think you can catalyze that process, you
>>> already did a lot.
>>> 
>>> 
>>> On Mon, Nov 28, 2011 at 12:32 AM, Raphael Cendrillon <
>>> cendrillon1978@gmail.com> wrote:
>>> 
>>>> Hi Dmitriy,
>>>> 
>>>> If it's OK with you I'd like to try implementing this decoration.
>>>> 
>>>> Any advice or guidance would be very much appreciated.
>>>> 
>>>> Raphael.
>>>> 
>>>> On 27 Nov, 2011, at 9:23 AM, Dmitriy Lyubimov (Commented) (JIRA) wrote:
>>>> 
>>>>> Dmitriy Lyubimov commented on MAHOUT-817:
>>>>> -----------------------------------------
>>>>> 
>>>>> For the column mean bruteforce approach is probably the simplest, we 'd
>>>> have to decorate input of A with mean subtraction.
>>>>> 
>>>>>> Add PCA options to SSVD code
>>>>>> ----------------------------
>>>>>> 
>>>>>>              Key: MAHOUT-817
>>>>>>              URL: https://issues.apache.org/jira/browse/MAHOUT-817
>>>>>>          Project: Mahout
>>>>>>       Issue Type: New Feature
>>>>>> Affects Versions: 0.6
>>>>>>         Reporter: Dmitriy Lyubimov
>>>>>>         Assignee: Dmitriy Lyubimov
>>>>>>          Fix For: Backlog
>>>>>> 
>>>>>> 
>>>>>> It seems that a simple solution should exist to integrate PCA mean
>>>> subtraction into SSVD algorithm without making it a pre-requisite step
>> and
>>>> also avoiding densifying the big input.
>>>>>> Several approaches were suggested:
>>>>>> 1) subtract mean off B
>>>>>> 2) propagate mean vector deeper into algorithm algebraically where the
>>>> data is already collapsed to smaller matrices
>>>>>> 3) --?
>>>>>> It needs some math done first . I'll take a stab at 1 and 2 but
>>>> thoughts and math are welcome.
>>>>> 
>>>>> --
>>>>> This message is automatically generated by JIRA.
>>>>> If you think it was sent incorrectly, please contact your JIRA
>>>> administrators:
>>>> 
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>>>> For more information on JIRA, see:
>>>> http://www.atlassian.com/software/jira
>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>> 


Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
In any event i hope you could review stuff going on there. There are
problems that need answers.

On Mon, Nov 28, 2011 at 12:50 PM, Raphael Cendrillon <
cendrillon1978@gmail.com> wrote:

> Thanks Dmitriy. I certainly understand.
>
> Perhaps I can find some other areas to contribute.
>
> On 28 Nov, 2011, at 12:37 PM, Dmitriy Lyubimov wrote:
>
> > I think it is certainly ok for you to try and your thoughts are even more
> > appreciated because optimization of this stuff for big data that is also
> > accurate seem to take more than one head to review.
> >
> > However, I've already planned on doing 817 in the next two months and
> > finish it in Q1 if I can work out existing issues.
> > The existing issues are both flow and performance and IMO require a tad
> > more contemplation w.r.t. to existing flow pecularities before reliable
> > flow could be figured.
> > On top of it, at the point I am primary maintainer of SSVD code and I
> think
> > you should know that introducing modifications which at this point seem
> > fairly sizable may make it more difficult for me to maintain it --
> > especially given we haven't considered effect on existing power
> iterations
> > yet and future issue of introducing Cholesky option (there's a  pending
> > issue for that as well). But I think you can catalyze that process, you
> > already did a lot.
> >
> >
> > On Mon, Nov 28, 2011 at 12:32 AM, Raphael Cendrillon <
> > cendrillon1978@gmail.com> wrote:
> >
> >> Hi Dmitriy,
> >>
> >> If it's OK with you I'd like to try implementing this decoration.
> >>
> >> Any advice or guidance would be very much appreciated.
> >>
> >> Raphael.
> >>
> >> On 27 Nov, 2011, at 9:23 AM, Dmitriy Lyubimov (Commented) (JIRA) wrote:
> >>
> >>> Dmitriy Lyubimov commented on MAHOUT-817:
> >>> -----------------------------------------
> >>>
> >>> For the column mean bruteforce approach is probably the simplest, we 'd
> >> have to decorate input of A with mean subtraction.
> >>>
> >>>> Add PCA options to SSVD code
> >>>> ----------------------------
> >>>>
> >>>>               Key: MAHOUT-817
> >>>>               URL: https://issues.apache.org/jira/browse/MAHOUT-817
> >>>>           Project: Mahout
> >>>>        Issue Type: New Feature
> >>>>  Affects Versions: 0.6
> >>>>          Reporter: Dmitriy Lyubimov
> >>>>          Assignee: Dmitriy Lyubimov
> >>>>           Fix For: Backlog
> >>>>
> >>>>
> >>>> It seems that a simple solution should exist to integrate PCA mean
> >> subtraction into SSVD algorithm without making it a pre-requisite step
> and
> >> also avoiding densifying the big input.
> >>>> Several approaches were suggested:
> >>>> 1) subtract mean off B
> >>>> 2) propagate mean vector deeper into algorithm algebraically where the
> >> data is already collapsed to smaller matrices
> >>>> 3) --?
> >>>> It needs some math done first . I'll take a stab at 1 and 2 but
> >> thoughts and math are welcome.
> >>>
> >>> --
> >>> This message is automatically generated by JIRA.
> >>> If you think it was sent incorrectly, please contact your JIRA
> >> administrators:
> >>
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> >>> For more information on JIRA, see:
> >> http://www.atlassian.com/software/jira
> >>>
> >>>
> >>
> >>
>
>

Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Raphael Cendrillon <ce...@gmail.com>.
Thanks Dmitriy. I certainly understand.

Perhaps I can find some other areas to contribute.

On 28 Nov, 2011, at 12:37 PM, Dmitriy Lyubimov wrote:

> I think it is certainly ok for you to try and your thoughts are even more
> appreciated because optimization of this stuff for big data that is also
> accurate seem to take more than one head to review.
> 
> However, I've already planned on doing 817 in the next two months and
> finish it in Q1 if I can work out existing issues.
> The existing issues are both flow and performance and IMO require a tad
> more contemplation w.r.t. to existing flow pecularities before reliable
> flow could be figured.
> On top of it, at the point I am primary maintainer of SSVD code and I think
> you should know that introducing modifications which at this point seem
> fairly sizable may make it more difficult for me to maintain it --
> especially given we haven't considered effect on existing power iterations
> yet and future issue of introducing Cholesky option (there's a  pending
> issue for that as well). But I think you can catalyze that process, you
> already did a lot.
> 
> 
> On Mon, Nov 28, 2011 at 12:32 AM, Raphael Cendrillon <
> cendrillon1978@gmail.com> wrote:
> 
>> Hi Dmitriy,
>> 
>> If it's OK with you I'd like to try implementing this decoration.
>> 
>> Any advice or guidance would be very much appreciated.
>> 
>> Raphael.
>> 
>> On 27 Nov, 2011, at 9:23 AM, Dmitriy Lyubimov (Commented) (JIRA) wrote:
>> 
>>> Dmitriy Lyubimov commented on MAHOUT-817:
>>> -----------------------------------------
>>> 
>>> For the column mean bruteforce approach is probably the simplest, we 'd
>> have to decorate input of A with mean subtraction.
>>> 
>>>> Add PCA options to SSVD code
>>>> ----------------------------
>>>> 
>>>>               Key: MAHOUT-817
>>>>               URL: https://issues.apache.org/jira/browse/MAHOUT-817
>>>>           Project: Mahout
>>>>        Issue Type: New Feature
>>>>  Affects Versions: 0.6
>>>>          Reporter: Dmitriy Lyubimov
>>>>          Assignee: Dmitriy Lyubimov
>>>>           Fix For: Backlog
>>>> 
>>>> 
>>>> It seems that a simple solution should exist to integrate PCA mean
>> subtraction into SSVD algorithm without making it a pre-requisite step and
>> also avoiding densifying the big input.
>>>> Several approaches were suggested:
>>>> 1) subtract mean off B
>>>> 2) propagate mean vector deeper into algorithm algebraically where the
>> data is already collapsed to smaller matrices
>>>> 3) --?
>>>> It needs some math done first . I'll take a stab at 1 and 2 but
>> thoughts and math are welcome.
>>> 
>>> --
>>> This message is automatically generated by JIRA.
>>> If you think it was sent incorrectly, please contact your JIRA
>> administrators:
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>> For more information on JIRA, see:
>> http://www.atlassian.com/software/jira
>>> 
>>> 
>> 
>> 


Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
PS I think it is better if you reply in jira rather than in the email
broadcast of it since i don't monitor it and miss your posts.

On Mon, Nov 28, 2011 at 12:37 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> I think it is certainly ok for you to try and your thoughts are even more
> appreciated because optimization of this stuff for big data that is also
> accurate seem to take more than one head to review.
>
> However, I've already planned on doing 817 in the next two months and
> finish it in Q1 if I can work out existing issues.
> The existing issues are both flow and performance and IMO require a tad
> more contemplation w.r.t. to existing flow pecularities before reliable
> flow could be figured.
> On top of it, at the point I am primary maintainer of SSVD code and I
> think you should know that introducing modifications which at this point
> seem fairly sizable may make it more difficult for me to maintain it --
> especially given we haven't considered effect on existing power iterations
> yet and future issue of introducing Cholesky option (there's a  pending
> issue for that as well). But I think you can catalyze that process, you
> already did a lot.
>
>
> On Mon, Nov 28, 2011 at 12:32 AM, Raphael Cendrillon <
> cendrillon1978@gmail.com> wrote:
>
>> Hi Dmitriy,
>>
>> If it's OK with you I'd like to try implementing this decoration.
>>
>> Any advice or guidance would be very much appreciated.
>>
>> Raphael.
>>
>> On 27 Nov, 2011, at 9:23 AM, Dmitriy Lyubimov (Commented) (JIRA) wrote:
>>
>> > Dmitriy Lyubimov commented on MAHOUT-817:
>> > -----------------------------------------
>> >
>> > For the column mean bruteforce approach is probably the simplest, we 'd
>> have to decorate input of A with mean subtraction.
>> >
>> >> Add PCA options to SSVD code
>> >> ----------------------------
>> >>
>> >>                Key: MAHOUT-817
>> >>                URL: https://issues.apache.org/jira/browse/MAHOUT-817
>> >>            Project: Mahout
>> >>         Issue Type: New Feature
>> >>   Affects Versions: 0.6
>> >>           Reporter: Dmitriy Lyubimov
>> >>           Assignee: Dmitriy Lyubimov
>> >>            Fix For: Backlog
>> >>
>> >>
>> >> It seems that a simple solution should exist to integrate PCA mean
>> subtraction into SSVD algorithm without making it a pre-requisite step and
>> also avoiding densifying the big input.
>> >> Several approaches were suggested:
>> >> 1) subtract mean off B
>> >> 2) propagate mean vector deeper into algorithm algebraically where the
>> data is already collapsed to smaller matrices
>> >> 3) --?
>> >> It needs some math done first . I'll take a stab at 1 and 2 but
>> thoughts and math are welcome.
>> >
>> > --
>> > This message is automatically generated by JIRA.
>> > If you think it was sent incorrectly, please contact your JIRA
>> administrators:
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> > For more information on JIRA, see:
>> http://www.atlassian.com/software/jira
>> >
>> >
>>
>>
>

Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
I think it is certainly ok for you to try and your thoughts are even more
appreciated because optimization of this stuff for big data that is also
accurate seem to take more than one head to review.

However, I've already planned on doing 817 in the next two months and
finish it in Q1 if I can work out existing issues.
The existing issues are both flow and performance and IMO require a tad
more contemplation w.r.t. to existing flow pecularities before reliable
flow could be figured.
On top of it, at the point I am primary maintainer of SSVD code and I think
you should know that introducing modifications which at this point seem
fairly sizable may make it more difficult for me to maintain it --
especially given we haven't considered effect on existing power iterations
yet and future issue of introducing Cholesky option (there's a  pending
issue for that as well). But I think you can catalyze that process, you
already did a lot.


On Mon, Nov 28, 2011 at 12:32 AM, Raphael Cendrillon <
cendrillon1978@gmail.com> wrote:

> Hi Dmitriy,
>
> If it's OK with you I'd like to try implementing this decoration.
>
> Any advice or guidance would be very much appreciated.
>
> Raphael.
>
> On 27 Nov, 2011, at 9:23 AM, Dmitriy Lyubimov (Commented) (JIRA) wrote:
>
> > Dmitriy Lyubimov commented on MAHOUT-817:
> > -----------------------------------------
> >
> > For the column mean bruteforce approach is probably the simplest, we 'd
> have to decorate input of A with mean subtraction.
> >
> >> Add PCA options to SSVD code
> >> ----------------------------
> >>
> >>                Key: MAHOUT-817
> >>                URL: https://issues.apache.org/jira/browse/MAHOUT-817
> >>            Project: Mahout
> >>         Issue Type: New Feature
> >>   Affects Versions: 0.6
> >>           Reporter: Dmitriy Lyubimov
> >>           Assignee: Dmitriy Lyubimov
> >>            Fix For: Backlog
> >>
> >>
> >> It seems that a simple solution should exist to integrate PCA mean
> subtraction into SSVD algorithm without making it a pre-requisite step and
> also avoiding densifying the big input.
> >> Several approaches were suggested:
> >> 1) subtract mean off B
> >> 2) propagate mean vector deeper into algorithm algebraically where the
> data is already collapsed to smaller matrices
> >> 3) --?
> >> It needs some math done first . I'll take a stab at 1 and 2 but
> thoughts and math are welcome.
> >
> > --
> > This message is automatically generated by JIRA.
> > If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >
> >
>
>

Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Raphael Cendrillon <ce...@gmail.com>.
Hi Dmitriy,

If it's OK with you I'd like to try implementing this decoration.

Any advice or guidance would be very much appreciated.

Raphael.

On 27 Nov, 2011, at 9:23 AM, Dmitriy Lyubimov (Commented) (JIRA) wrote:

> Dmitriy Lyubimov commented on MAHOUT-817:
> -----------------------------------------
> 
> For the column mean bruteforce approach is probably the simplest, we 'd have to decorate input of A with mean subtraction.
> 
>> Add PCA options to SSVD code
>> ----------------------------
>> 
>>                Key: MAHOUT-817
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-817
>>            Project: Mahout
>>         Issue Type: New Feature
>>   Affects Versions: 0.6
>>           Reporter: Dmitriy Lyubimov
>>           Assignee: Dmitriy Lyubimov
>>            Fix For: Backlog
>> 
>> 
>> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
>> Several approaches were suggested:
>> 1) subtract mean off B
>> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
>> 3) --?
>> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
> 


[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157938#comment-13157938 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

For the column mean bruteforce approach is probably the simplest, we 'd have to decorate input of A with mean subtraction.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Status: Patch Available  (was: In Progress)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817-RC1.patch, MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-CLI.pdf, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: ssvd.R
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment:     (was: ssvd.R)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment:     (was: ssvd.R)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: ssvd-tests.R
                ssvd.R

Updated R code to match working notes more closely.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Fix Version/s:     (was: Backlog)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817.patch, MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158210#comment-13158210 ] 

Dmitriy Lyubimov edited comment on MAHOUT-817 at 11/28/11 6:32 AM:
-------------------------------------------------------------------

Yes expectatiin is zero but variance is going to be big regardless of the input *size I think unfortunately. So m Omega term is still a problem. For my problems its brute force computation will actually take more than e.g. squaring my input. So it was first thought but I don't think it is valid enough. So I withdraw this for now.

But we may not have a choice for the big data though. And then again there's a connection with power iterations. The basis doesn't have to be perfect  and in practice it never is, but power iterations improve it a lot. Power iterations flow is here: https://github.com/dlyubimov/mahout-commits/blob/ssvd-docs/Power%20Iterations.pdf?raw=true. Now question is if this assumption is going to render power iteration flow useless.


                
      was (Author: dlyubimov):
    Yes expectatiin is zero but variance is going to be big regardless of the input *size I think unfortunately. So m Omega term is still a problem. For my problems itsnbrute force computation will actually take more than e.g. squaringn my input. So it was first thought but I don't think it is valid enough. So I withdraw this for now.

But we may not have a choice for the big data though. And then again there's a connection with power iterations. The basis doesn't have to be perfect  and in practice it never is, but power iterations improve it a lot. Power iterations flow is here: https://github.com/dlyubimov/mahout-commits/blob/ssvd-docs/Power%20Iterations.pdf?raw=true

                  
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113007#comment-13113007 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

why would we want to support both row and column mean subtraction? I need to re-read the motivation of this.

I think a lot also resides on a question if we actually also want _output_ the mean. 

And the next question is whether we want to spend one additional pass just to find the mean. if yes, then the rest is easy. we just will be doing mean subtraction as part of Y computation . should be ok flops-wise.

but if we think we shouldn't be waiting for mean computation as a separate pass, and we don't want to output it either, then that's where it becomes a little tricky.


> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.6
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Raphael Cendrillon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158010#comment-13158010 ] 

Raphael Cendrillon commented on MAHOUT-817:
-------------------------------------------

Could you expand on this a little?

If I understand correctly we need to implicitly do mean-subtraction of A whenever we work with B.
It seems this is equivalent to subtracting qs'*a_mean from B, where qs is the sum of the rows of Q
and a_mean is the mean of the rows of A. So if bi is the ith column of B then the column with
implicit mean-subtraction of A is

  bi - qs'*a_mean(i)

where a_mean(i) is the ith element of a_mean.

It seems there are two jobs that need to be modified: BBT-job and V-job. Since they both work column wise it should
be straightforward to pass in the vector qs and the scalar a_mean(i).

One question: is it necessary to do mean-subtraction of A before computing the QR decomposition, or will the columns of Q still
form a good basis even without mean-subtraction?

Could you explain what the 'column mean' is? I thought that each data point corresponds to a row in A, so that subtraction of row means
would be more appropriate?




                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157933#comment-13157933 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

I don't think we want to have an explicit step to compile either Y or B means. 

We can construct them and even output them in the fly albeit in a blocked form.

But we probably do need A means in the final output to enable back and forward fold ins of the new items, right?
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817-RC1.patch, MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-CLI.pdf, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158921#comment-13158921 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

bq. Another problem i identified with the scheme is that Q is produced in blocks and formation of entire row sum vector is not available at the point of B' and BB' computation. There's one more step further in this.

Ok i think i see how to fix BB' computation as well as power iterations.
 
One issue still remains as far as estimate of m*Omega term is concerned. See attached.

I am posting a first stub at bringing all the ideas together, please review. It doesn't contain the detailed modification plan though, just the algebra.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158196#comment-13158196 ] 

Dmitriy Lyubimov edited comment on MAHOUT-817 at 11/28/11 5:55 AM:
-------------------------------------------------------------------

Computation of m*omega also may be fairly involved because even that it is vector matrix multiplication, Omega is dense, bigger than input, even though we don't have to move its input around. Maybe for big inputs we can just take a math expectation  of this. For the uniform distribution of murmur(is it uniform?) -1,1 that is currently used we perhaps can ignore the whole m x Omega because it converges on 0 per law of big numbers.
                
      was (Author: dlyubimov):
    Computation of m*omega also may be fairly involved because even that it is vector matrix multiplication, Omega is dense, bigger than input, even though we don't have to move its input around. Maybe for big inputs we can just take a math expectation in of this. For the uniform distribution of murmur(is it uniform?) we perhaps can ignore the whole m x Omega because it converges on 0 per law of big numbers.
                  
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: SSVD-PCA options.pdf
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817-RC1.patch, MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-CLI.pdf, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: MAHOUT-817.patch

brought patch in sync with current post-release trunk.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158196#comment-13158196 ] 

Dmitriy Lyubimov edited comment on MAHOUT-817 at 11/28/11 5:52 AM:
-------------------------------------------------------------------

Computation of m*omega also may be fairly involved because even that is vector matrix multiplication, Omega is dense, bigger than input, even though we don't have to move its input around. Maybe for big inputs we can just take a math expectation in of this. For the uniform distribution of murmur(is it uniform?) we perhaps can ignore the whole m x Omega because it converges on 0 per law of big numbers.
                
      was (Author: dlyubimov):
    Computation of m*omega also may be fairly involved because even that is vector matrix multiplication, Omega is dense, bigger than input, even though we don't have to move its input around. Maybe for big inputs we can just take a math expectation in of this. For the uniform distribution of murmur(is it uniform?) We perhaps can I ore the whole m x Omega because it converges on 0 per law of big numbers.
                  
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158196#comment-13158196 ] 

Dmitriy Lyubimov edited comment on MAHOUT-817 at 11/28/11 5:53 AM:
-------------------------------------------------------------------

Computation of m*omega also may be fairly involved because even that it is vector matrix multiplication, Omega is dense, bigger than input, even though we don't have to move its input around. Maybe for big inputs we can just take a math expectation in of this. For the uniform distribution of murmur(is it uniform?) we perhaps can ignore the whole m x Omega because it converges on 0 per law of big numbers.
                
      was (Author: dlyubimov):
    Computation of m*omega also may be fairly involved because even that is vector matrix multiplication, Omega is dense, bigger than input, even though we don't have to move its input around. Maybe for big inputs we can just take a math expectation in of this. For the uniform distribution of murmur(is it uniform?) we perhaps can ignore the whole m x Omega because it converges on 0 per law of big numbers.
                  
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: SSVD-PCA options.pdf

Actually, propagating median thru power iterations is not yet quite finished. I will finish it a tad later.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, SSVD-PCA options.pdf, SSVD-PCA options.pdf, SSVD-PCA options.pdf, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Fix Version/s:     (was: 0.6)

removed from 0.6 roadmap per conversation on the list.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment:     (was: SSVD-PCA options.pdf)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158157#comment-13158157 ] 

Ted Dunning commented on MAHOUT-817:
------------------------------------

For the SSVD and PCA, what I had in mind was that forming an offset Y was easy if you have the row means because you can compute

Y = (A - m) \Omega = A \Omega - m \Omega

That is, each row of Y can be adjusted on the fly as it is computed.  The computation of Q in the next step will be unchanged, but the definition of B must include the mean subtraction as well:

B = Q' (A - m) = Q' A - Q' m

Other than this, the actual decomposition should be nearly good to go.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158176#comment-13158176 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

OK so that's what I called brute force approach. Assuming we somehow know the median, just adjust the input as we go. For column wise median we will know the median right away. For row wise median, which I think the majority of use cases would want to do, we will have to precompute it with one more pass. Good thing about it is that at least it wiukd have a very little shuffle and sort pressure, so it would practically run almost as fast as a map only job.

I think this is a very easy change.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163192#comment-13163192 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

and i also don't see any difference for small 100x200 inputs between pci and svd on a fixed(mean subtracted) input even if bypass median correction for Ys in both B_0 and power iterations!.. 

perhaps it has to do with the way i generate the input. that also may not necessarily be the case for extreme sparse cases. 

But i think first patch could bypass the Y fix.

{code}
 respci$svalues
 [1] 9.9013440 8.9980801 7.9936265 6.9882617 5.9982148 4.9935232 3.9848657
 [8] 2.9811621 1.9891654 0.9977757
> ressvd$svalues
 [1] 9.9013440 8.9980801 7.9936265 6.9882617 5.9982148 4.9935232 3.9848657
 [8] 2.9811621 1.9891654 0.9977757
> 
{code}
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment:     (was: SSVD-PCA options.pdf)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817-RC1.patch, MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-CLI.pdf, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment:     (was: ssvd-tests.R)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: SSVD-PCA options.pdf
                SSVD-CLI.pdf
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817-RC1.patch, MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-CLI.pdf, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Fix Version/s: Backlog
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: ssvd-tests.R

So i did an R simulation of column-wise mean and it seems to work , so i think this verifies the math.

I still need to finish the doc (it also has a little typo in it), i will be finishing it from home as i don't seem to have the doc source on me here. 

I guess it clears the implementation on existing ssvd solver.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Raphael Cendrillon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159041#comment-13159041 ] 

Raphael Cendrillon commented on MAHOUT-817:
-------------------------------------------

It seems to be OK in the examples I've looked at. This may be quite dependent on m, n,k, p etc. though.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, SSVD-PCA options.pdf, SSVD-PCA options.pdf, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: MAHOUT-817.patch

First round. unit test seems to pass, although it is debatable how off-centered the data is in it. Also put in CLI options for pca (--pca=true, --pca-offset= location to override default computation of row means).
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158210#comment-13158210 ] 

Dmitriy Lyubimov edited comment on MAHOUT-817 at 11/28/11 6:29 AM:
-------------------------------------------------------------------

Yes expectatiin is zero but variance is going to be big regardless of the input *size I think unfortunately. So m Omega term is still a problem. For my problems itsnbrute force computation will actually take more than e.g. squaringn my input. So it was first thought but I don't think it is valid enough. So I withdraw this for now.

But we may not have a choice for the big data though. And then again there's a connection with power iterations. The basis doesn't have to be perfect  and in practice it never is, but power iterations improve it a lot. Power iterations flow is here: https://github.com/dlyubimov/mahout-commits/blob/ssvd-docs/Power%20Iterations.pdf?raw=true

                
      was (Author: dlyubimov):
    Yes expectatiin is zero but variance is going to be big regardless of the input *size I think unfortunately. So m Omega term is still a problem. For my problems itsnbrute force computation will actually take more than e.g. squaringn my input. So it was first thought but I don't think it is valid enough. So I withdraw this for now.
                  
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: MAHOUT-817.patch

rebasing on current trunk
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: MAHOUT-817.patch, MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210530#comment-13210530 ] 

jiraposter@reviews.apache.org commented on MAHOUT-817:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3863/
-----------------------------------------------------------

(Updated 2012-02-17 20:38:49.925577)


Review request for mahout.


Changes
-------

commit cd4862738fb74f01114e0e4c2fee8a737a009c13
Author: Dmitriy Lyubimov <dl...@inadco.com>
Date:   Fri Feb 17 12:35:47 2012 -0800

    Getting rid of prototype code; styling round

:100644 100644 d61210f... ebf087d... M  core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java
:100644 100644 254887a... d9c03cb... M  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java
:100644 100644 959d491... 8be8df1... M  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java
:100644 000000 59bdedb... 0000000... D  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/PartialRowEmitter.java
:100644 100644 d247af4... 59f64ba... M  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java
:100644 100644 96fe5e1... 1127f6a... M  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java
:100644 000000 09f05d1... 0000000... D  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java
:100644 100644 915fce5... 4168e98... M  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java
:100644 100644 885f5fa... 1346d71... M  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDPCADenseTest.j
:100644 100644 760c715... 280e10a... M  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTes
:100644 100644 7015283... 0e34568... M  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSe
:000000 100644 0000000... 5bb5706... A  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCommonTest.java
:100644 000000 503433f... 0000000... D  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java
:100644 100644 32342c1... d6605c1... M  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java


Summary
-------


2d542fd4dfcc6e01577bddc28600632a88e358ee Merge remote-tracking branch 'apache/trunk' into MAHOUT-817
1f245bb5cc1354e7495ec62fbc5f41ed6d590210 Merge branch 'trunk' into MAHOUT-817
458d8112de180c93d5194d67ccfc00442ed1d460 Merge remote-tracking branch 'apache/trunk' into MAHOUT-817
3fea9bd981043e268dd003d4c6c3943bb570c0f7 added test, bug fixes
2725c1061c167126238d288039f0f68baafa7dc8 adding --pca and --pcaOffset options, minor fixes
48c7b425241afff42ce52d3bb005a87aeb68386d fixing front end to factor in the median data.
4e072615ac2b8a256d037aaf00db21820abb91e2 tweaking B' job to produce necessary correctors s_q and s_b
b10fefd8d4aa5a0ed2f60902904d551afbbdf57e cosmetic fixes
849171d3af75117a2ee1115e6d5fc8e4a1fff5ce comment
6c196ea9606b3ca05d401fa1474ee9262a6c0303 retrofitting V job to do pca correction
e6fbe7cdb606698f180127302c33d30fffc6c4d7 adding pca options to Q,ABt jobs. still need to work on B'-job, V-job and front-end pca corrections.
ecf5dd21c5d5805d70715a78abd07246d171536c Computing s_b0
b9b33cf72af85ade16fcfbf4e13a036877489afb comments
9bb6e971c68e0674b087b8c5d64f4967878f1834 More cleanup in favor of standard functions, unit tests pass but need to verify the 2G benchmark.
39faa70158b52e50d31aca2abc4006874a9ea8fd cleanup I
780b291eb902e0e832d41748d45bf6d2163f9537 cosmetic changes, adding api with out redundant parameters
02daf0024489305032320c578ac546c16bda31c1 current MAHOUT-923 patch from Raphael


This addresses bug MAHOUT-817.
    https://issues.apache.org/jira/browse/MAHOUT-817


Diffs (updated)
-----

  core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/DatasetSplitter.java c9003ad 
  core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/FactorizationEvaluator.java 0c6e3f7 
  core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/ParallelALSFactorizationJob.java 7dc3b79 
  core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java 9ca0b16 
  core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java 1feaa03 
  core/src/main/java/org/apache/mahout/cf/taste/hadoop/preparation/PreparePreferenceMatrixJob.java fbe8914 
  core/src/main/java/org/apache/mahout/cf/taste/hadoop/pseudo/RecommenderJob.java 02d1ba6 
  core/src/main/java/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.java 951c860 
  core/src/main/java/org/apache/mahout/cf/taste/hadoop/slopeone/SlopeOneAverageDiffsJob.java 57fa036 
  core/src/main/java/org/apache/mahout/cf/taste/impl/model/PlusAnonymousConcurrentUserDataModel.java 11eb295 
  core/src/main/java/org/apache/mahout/cf/taste/impl/model/PlusAnonymousUserDataModel.java 7f9cfd4 
  core/src/main/java/org/apache/mahout/classifier/naivebayes/test/TestNaiveBayesDriver.java 15da502 
  core/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainNaiveBayesJob.java 4da6426 
  core/src/main/java/org/apache/mahout/clustering/AbstractCluster.java 2ceb01b 
  core/src/main/java/org/apache/mahout/clustering/CIMapper.java 5f25f4f 
  core/src/main/java/org/apache/mahout/clustering/CIReducer.java 726363e 
  core/src/main/java/org/apache/mahout/clustering/Cluster.java 2f8d4dd 
  core/src/main/java/org/apache/mahout/clustering/ClusterIterator.java e39c71e 
  core/src/main/java/org/apache/mahout/clustering/ClusterWritable.java dba8c37 
  core/src/main/java/org/apache/mahout/clustering/ClusteringPolicy.java b07b649 
  core/src/main/java/org/apache/mahout/clustering/ClusteringPolicyWritable.java 8c148a8 
  core/src/main/java/org/apache/mahout/clustering/DirichletClusteringPolicy.java 116973f 
  core/src/main/java/org/apache/mahout/clustering/FuzzyKMeansClusteringPolicy.java 6c39d94 
  core/src/main/java/org/apache/mahout/clustering/KMeansClusteringPolicy.java 7b0d874 
  core/src/main/java/org/apache/mahout/clustering/Model.java 79dab30 
  core/src/main/java/org/apache/mahout/clustering/WeightedPropertyVectorWritable.java 92373eb 
  core/src/main/java/org/apache/mahout/clustering/canopy/CanopyDriver.java 7147015 
  core/src/main/java/org/apache/mahout/clustering/canopy/CanopyMapper.java 52fe865 
  core/src/main/java/org/apache/mahout/clustering/canopy/CanopyReducer.java ca814f9 
  core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationConfigKeys.java 366ec3c 
  core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationDriver.java 49a9cfc 
  core/src/main/java/org/apache/mahout/clustering/classify/ClusterClassificationMapper.java 09be170 
  core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletCluster.java 7293479 
  core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletClusterer.java 3cf25bc 
  core/src/main/java/org/apache/mahout/clustering/dirichlet/DirichletState.java d19842f 
  core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansClusterer.java 2d882b0 
  core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansDriver.java aa7389f 
  core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/FuzzyKMeansUtil.java 5f6cb47 
  core/src/main/java/org/apache/mahout/clustering/fuzzykmeans/SoftCluster.java 52fd764 
  core/src/main/java/org/apache/mahout/clustering/kmeans/Cluster.java PRE-CREATION 
  core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansClusterMapper.java 3cf41ec 
  core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansClusterer.java 9471e74 
  core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansCombiner.java eb086d8 
  core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansDriver.java 1099206 
  core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansMapper.java 0945dcb 
  core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansReducer.java bb777a4 
  core/src/main/java/org/apache/mahout/clustering/kmeans/KMeansUtil.java 1c84f87 
  core/src/main/java/org/apache/mahout/clustering/kmeans/Kluster.java 8b22709 
  core/src/main/java/org/apache/mahout/clustering/kmeans/RandomSeedGenerator.java 4a725e7 
  core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java 28fc43b 
  core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopyDriver.java a33f1ca 
  core/src/main/java/org/apache/mahout/clustering/spectral/eigencuts/EigencutsDriver.java 06e0549 
  core/src/main/java/org/apache/mahout/clustering/spectral/kmeans/SpectralKMeansDriver.java 82daa5b 
  core/src/main/java/org/apache/mahout/clustering/topdown/postprocessor/ClusterCountReader.java 11c4d88 
  core/src/main/java/org/apache/mahout/common/AbstractJob.java 55040f6 
  core/src/main/java/org/apache/mahout/common/commandline/DefaultOptionCreator.java 868d82f 
  core/src/main/java/org/apache/mahout/common/iterator/sequencefile/PathFilters.java 19f78b5 
  core/src/main/java/org/apache/mahout/graph/AdjacencyMatrixJob.java ae419f6 
  core/src/main/java/org/apache/mahout/graph/linkanalysis/RandomWalk.java 5727a77 
  core/src/main/java/org/apache/mahout/graph/linkanalysis/RandomWalkWithRestartJob.java fcf4549 
  core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 3e0dd5e 
  core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION 
  core/src/main/java/org/apache/mahout/math/hadoop/MatrixMultiplicationJob.java e907a6d 
  core/src/main/java/org/apache/mahout/math/hadoop/TransposeJob.java a046b41 
  core/src/main/java/org/apache/mahout/math/hadoop/decomposer/DistributedLanczosSolver.java c81ef71 
  core/src/main/java/org/apache/mahout/math/hadoop/decomposer/EigenVerificationJob.java 2e152c4 
  core/src/main/java/org/apache/mahout/math/hadoop/similarity/SeedVectorUtil.java 4d63f46 
  core/src/main/java/org/apache/mahout/math/hadoop/similarity/cooccurrence/RowSimilarityJob.java ff517dc 
  core/src/main/java/org/apache/mahout/math/hadoop/solver/DistributedConjugateGradientSolver.java eba6d2a 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtDenseOutJob.java c52fe2a 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java 0c3a996 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java 0fa8707 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/PartialRowEmitter.java 59bdedb 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java 703c420 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java d314186 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java PRE-CREATION 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java 98c8c59 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java b1a8b56 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UJob.java 53f26f4 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/VJob.java d58789e 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java bd8c6b1 
  core/src/main/java/org/apache/mahout/math/stats/entropy/Entropy.java 4a8078e 
  core/src/main/java/org/apache/mahout/vectorizer/collocations/llr/CollocDriver.java 7a0c639 
  core/src/test/java/org/apache/mahout/cf/taste/impl/model/PlusAnonymousConcurrentUserDataModelTest.java 984ef6c 
  core/src/test/java/org/apache/mahout/clustering/TestClusterClassifier.java 391bdf6 
  core/src/test/java/org/apache/mahout/clustering/TestClusterInterface.java d9f06ec 
  core/src/test/java/org/apache/mahout/clustering/canopy/TestCanopyCreation.java 0b70339 
  core/src/test/java/org/apache/mahout/clustering/classify/ClusterClassificationDriverTest.java 8a5e1ea 
  core/src/test/java/org/apache/mahout/clustering/dirichlet/TestDirichletClustering.java d87c3e3 
  core/src/test/java/org/apache/mahout/clustering/dirichlet/TestMapReduce.java c996d97 
  core/src/test/java/org/apache/mahout/clustering/kmeans/TestKmeansClustering.java aa32112 
  core/src/test/java/org/apache/mahout/clustering/meanshift/TestMeanShift.java 8dd9d41 
  core/src/test/java/org/apache/mahout/common/AbstractJobTest.java 4feae91 
  core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 0ef8622 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDPCADenseTest.java PRE-CREATION 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java 59f79c5 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java beb0102 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCommonTest.java PRE-CREATION 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java 503433f 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java 32342c1 
  examples/src/main/java/org/apache/mahout/cf/taste/example/email/MailToPrefsDriver.java 1781481 
  examples/src/main/java/org/apache/mahout/classifier/email/PrepEmailVectorsDriver.java 4d4836f 
  examples/src/main/java/org/apache/mahout/clustering/display/DisplayClustering.java 7faf92e 
  examples/src/main/java/org/apache/mahout/clustering/display/DisplayDirichlet.java 2edadf1 
  examples/src/main/java/org/apache/mahout/clustering/display/DisplayFuzzyKMeans.java a5ef4d0 
  examples/src/main/java/org/apache/mahout/clustering/display/DisplayKMeans.java bc5c2ea 
  examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/canopy/Job.java 3833932 
  examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/dirichlet/Job.java 32b9efe 
  examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/fuzzykmeans/Job.java 3ac3cca 
  examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/kmeans/Job.java d63ac9e 
  examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/meanshift/Job.java ef69827 
  integration/pom.xml b751b98 
  integration/src/main/java/org/apache/mahout/classifier/ConfusionMatrixDumper.java 5958ce8 
  integration/src/main/java/org/apache/mahout/utils/MatrixDumper.java b71cb95 
  integration/src/main/java/org/apache/mahout/utils/SequenceFileDumper.java e108aa4 
  integration/src/main/java/org/apache/mahout/utils/clustering/ClusterDumper.java 3bc72ab 
  integration/src/main/java/org/apache/mahout/utils/vectors/RowIdJob.java 11769b1 
  integration/src/main/java/org/apache/mahout/utils/vectors/VectorDumper.java 5a9d0f2 
  integration/src/main/java/org/apache/mahout/utils/vectors/VectorHelper.java 716aaf9 
  integration/src/test/java/org/apache/mahout/clustering/dirichlet/TestL1ModelClustering.java eef9551 
  pom.xml 7485994 

Diff: https://reviews.apache.org/r/3863/diff


Testing
-------

Additional unit tests for PCA


Thanks,

Dmitriy


                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Raphael Cendrillon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158203#comment-13158203 ] 

Raphael Cendrillon commented on MAHOUT-817:
-------------------------------------------

I noticed the same thing with some quick matlab tests. It seems that the orthogonal basis (Q) of Y does not change too much even if  mean-subtraction is not applied to A.  This seems to be true even when the mean of A is not zero.  I still need to think some more about this to understand if it is always the case or not.




                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Fix Version/s:     (was: 0.6)
                   0.7
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog, 0.7
>
>         Attachments: MAHOUT-817.patch, MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158065#comment-13158065 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

situation gets even more hairy if you factor in power iterations and future option with Cholesky route, unless you assume already modified input. So i am dubious about everything except brute force from every angle of it so far.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment:     (was: SSVD-PCA options.pdf)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817-RC1.patch, MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-CLI.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158779#comment-13158779 ] 

Ted Dunning commented on MAHOUT-817:
------------------------------------

{quote}
BTW is there a formal name of a vector product of a and b in a form of a new vector (a_1 * b_1, a2 * b_2, ... a_n * b_n)?
{quote}
Element-wise product.

                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: SSVD-PCA options.pdf
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, SSVD-PCA options.pdf, SSVD-PCA options.pdf, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176388#comment-13176388 ] 

Dmitriy Lyubimov edited comment on MAHOUT-817 at 12/28/11 12:58 AM:
--------------------------------------------------------------------

btw this patch doesn't address use cases of "folding in" and "folding out" which are basically special cases of SVD fold-in  adjusted to row-wise input and PCA offset.

Do we want to leave it out of scope? Generally it usually doesn't make sense to do this stuff in a batch, but rather in real time which requires some indexing mechanism for V (and U). Other than that, it is a simple multiplication operation, perhaps we could just engineer a fold-in using regular distributed matrix operations? I never investigated an issue of a batch fold in with Mahout.
                
      was (Author: dlyubimov):
    btw this patch doesn't address use cases of "folding in" and "folding out" which are basically special cases of SVD fold-in  adjusted to row-wise input and PCA offset.

Do we want to leave it out of scope? Generally it usually doesn't make sense to do this stuff in a batch, but rather in real time which requires indexing mechanism of V (and U). Other than that, it is a simple multiplication operation, perhaps we could just engineer a fold-in using regular distributed matrix operations? I never investigated an issue of a batch fold in with Mahout.
                  
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Raphael Cendrillon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163747#comment-13163747 ] 

Raphael Cendrillon commented on MAHOUT-817:
-------------------------------------------

Yeah. It looks like this will indeed be necessary. 

By the way, could you take a look through the column-wise mean job in MAHOUT-880?



                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment:     (was: ssvd.R)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13159877#comment-13159877 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

rolling back solution for now. There are errors.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: SSVD-PCA options.pdf

minor editions
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, SSVD-PCA options.pdf
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176388#comment-13176388 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

btw this patch doesn't address use cases of "folding in" and "folding out" which are basically special cases of SVD fold-in  adjusted to row-wise input and PCA offset.

Do we want to leave it out of scope? Generally it usually doesn't make sense to do this stuff in a batch, but rather in real time which requires indexing mechanism of V (and U). Other than that, it is a simple multiplication operation, perhaps we could just engineer a fold-in using regular distributed matrix operations? I never investigated an issue of a batch fold in with Mahout.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment:     (was: SSVD-PCA options.pdf)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, SSVD-PCA options.pdf, SSVD-PCA options.pdf, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175599#comment-13175599 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

I merged with MAHOUT-923 and started some initial cleanup and work in MAHOUT-817 branch in my github on this.

Mostly the cleanup so far, removing old kludgy code and replacing stuff with standard vector framework functions.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Raphael Cendrillon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175615#comment-13175615 ] 

Raphael Cendrillon commented on MAHOUT-817:
-------------------------------------------

Thanks for merging Dmitriy. Is there anything you need from me at this point?
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163741#comment-13163741 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

Ok found a case what affects the Y fix. As soon as I take random gen off the 0 mean for the simulated orthonormal matrices for the test input, the difference between version with Y fix  and without it appears in the output.

The first printout is for PCA routine with Y fix, the second is for PCA routine without Y fix, and the third one is SSVD over A-mean matrix.

re-attached the newest R files.

{code}
> ## PCActest
> # compute median xi
> 
> xfixed=matrix(nrow=m,ncol=n)
> for ( i in 1:m) xfixed[i,]=x[i,]-xi
> 
> 
> respca=ssvd.cpca(x,k,qiter=qi)
fixing Y...
Warning message:
In sqrt(e$values) : NaNs produced
> # compare also with results when Y fix is ignored
> respca1=ssvd.cpca(x,k,qiter=qi,fixY=F)
Warning message:
In sqrt(e$values) : NaNs produced
> 
> ressvd=ssvd.svd(xfixed,k,qiter=qi)
> 
> # compare 3 sets of singular values
> respca$svalues
 [1] 9.0584987 8.0500343 7.0271257 6.0267613 5.0266239 4.0221945 3.0428140
 [8] 2.0328541 1.1788628 0.8524032
> respca1$svalues
 [1] 9.0504971 8.0487910 7.0238114 6.0246926 5.0250013 4.0221219 3.0371404
 [8] 2.0306501 1.0668975 0.3805301
> ressvd$svalues
 [1] 9.0584987 8.0500343 7.0271257 6.0267613 5.0266239 4.0221945 3.0428140
 [8] 2.0328541 1.1788628 0.8524032
> 
> #compare first rows of singular vectors
> respca$v[1,]
 [1]  0.010705297  0.002515335 -0.015630454 -0.023178851 -0.022406230
 [6] -0.023602299  0.016234821  0.045020972 -0.084333758 -0.053624133
> respca1$v[1,]
 [1] -0.010691547  0.002485415 -0.015705498 -0.023117058  0.022482137
 [6] -0.023557896  0.015686873  0.046335615 -0.061378867 -0.226028214
> ressvd$v[1,]
 [1]  0.010705297  0.002515335 -0.015630454 -0.023178851 -0.022406230
 [6] -0.023602299  0.016234821 -0.045020972  0.084333758 -0.053624133
> 
{code}
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158986#comment-13158986 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

ok. that's what i suspected. but i think the variance is going to depend a lot on variance in the input (between different rows). Can you try and test how it is going to be affected if you increase the variances of the input such that deviation >> mean?
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, SSVD-PCA options.pdf, SSVD-PCA options.pdf, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163192#comment-13163192 ] 

Dmitriy Lyubimov edited comment on MAHOUT-817 at 12/6/11 2:37 AM:
------------------------------------------------------------------

and i also don't see any difference for small 100x200 inputs between pci and svd on a fixed(mean subtracted) input even if bypass Y correction for mean for Ys in both B_0 and power iterations!.. 

perhaps it has to do with the way i generate the input. that also may not necessarily be the case for extreme sparse cases. 

But i think first patch could bypass the Y fix.

{code}
 respci$svalues
 [1] 9.9013440 8.9980801 7.9936265 6.9882617 5.9982148 4.9935232 3.9848657
 [8] 2.9811621 1.9891654 0.9977757
> ressvd$svalues
 [1] 9.9013440 8.9980801 7.9936265 6.9882617 5.9982148 4.9935232 3.9848657
 [8] 2.9811621 1.9891654 0.9977757
> 
{code}
                
      was (Author: dlyubimov):
    and i also don't see any difference for small 100x200 inputs between pci and svd on a fixed(mean subtracted) input even if bypass median correction for Ys in both B_0 and power iterations!.. 

perhaps it has to do with the way i generate the input. that also may not necessarily be the case for extreme sparse cases. 

But i think first patch could bypass the Y fix.

{code}
 respci$svalues
 [1] 9.9013440 8.9980801 7.9936265 6.9882617 5.9982148 4.9935232 3.9848657
 [8] 2.9811621 1.9891654 0.9977757
> ressvd$svalues
 [1] 9.9013440 8.9980801 7.9936265 6.9882617 5.9982148 4.9935232 3.9848657
 [8] 2.9811621 1.9891654 0.9977757
> 
{code}
                  
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Raphael Cendrillon (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raphael Cendrillon updated MAHOUT-817:
--------------------------------------

    Attachment: ssvd.m

Here's a little snipet of Matlab code which evaluates the performance of SSVD with and without mean-subtraction on A.

At first glance it seems that Q is relatively insensitive to the mean of A, so that reasonable performance can be achieved even if A is not normalized.

I'm not sure if there are corner cases where this may not hold. It probably requires further study.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, SSVD-PCA options.pdf, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Fix Version/s: 0.6
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.6, Backlog
>
>         Attachments: MAHOUT-817.patch, MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Raphael Cendrillon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157699#comment-13157699 ] 

Raphael Cendrillon commented on MAHOUT-817:
-------------------------------------------

Dmitriy, what the current state of this? I'll start looking into this if it suits
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by Raphael Cendrillon <ce...@gmail.com>.
> BTW is there a formal name of a vector product of a and b in a form of a new vector (a_1 * b_1, a2 * b_2, ... a_n * b_n)? 

I call this element-wise multiplication, although I'm not sure if that's a formal term.


[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158680#comment-13158680 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

BTW is there a formal name of a vector product of a and b in a form of a new vector (a_1 * b_1, a2 * b_2, ... a_n * b_n)? 

Another problem i identified with the scheme is that Q is produced in blocks and formation of entire row sum vector is not available at the point of B' and BB' computation. There's one more step further in this.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: SSVD-CLI.pdf

reorganized SSVD-CLI manual.
                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817-RC1.patch, MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-CLI.pdf, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205999#comment-13205999 ] 

jiraposter@reviews.apache.org commented on MAHOUT-817:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3863/
-----------------------------------------------------------

(Updated 2012-02-11 03:15:25.803911)


Review request for mahout.


Summary
-------


2d542fd4dfcc6e01577bddc28600632a88e358ee Merge remote-tracking branch 'apache/trunk' into MAHOUT-817
1f245bb5cc1354e7495ec62fbc5f41ed6d590210 Merge branch 'trunk' into MAHOUT-817
458d8112de180c93d5194d67ccfc00442ed1d460 Merge remote-tracking branch 'apache/trunk' into MAHOUT-817
3fea9bd981043e268dd003d4c6c3943bb570c0f7 added test, bug fixes
2725c1061c167126238d288039f0f68baafa7dc8 adding --pca and --pcaOffset options, minor fixes
48c7b425241afff42ce52d3bb005a87aeb68386d fixing front end to factor in the median data.
4e072615ac2b8a256d037aaf00db21820abb91e2 tweaking B' job to produce necessary correctors s_q and s_b
b10fefd8d4aa5a0ed2f60902904d551afbbdf57e cosmetic fixes
849171d3af75117a2ee1115e6d5fc8e4a1fff5ce comment
6c196ea9606b3ca05d401fa1474ee9262a6c0303 retrofitting V job to do pca correction
e6fbe7cdb606698f180127302c33d30fffc6c4d7 adding pca options to Q,ABt jobs. still need to work on B'-job, V-job and front-end pca corrections.
ecf5dd21c5d5805d70715a78abd07246d171536c Computing s_b0
b9b33cf72af85ade16fcfbf4e13a036877489afb comments
9bb6e971c68e0674b087b8c5d64f4967878f1834 More cleanup in favor of standard functions, unit tests pass but need to verify the 2G benchmark.
39faa70158b52e50d31aca2abc4006874a9ea8fd cleanup I
780b291eb902e0e832d41748d45bf6d2163f9537 cosmetic changes, adding api with out redundant parameters
02daf0024489305032320c578ac546c16bda31c1 current MAHOUT-923 patch from Raphael


This addresses bug MAHOUT-817.
    https://issues.apache.org/jira/browse/MAHOUT-817


Diffs
-----

  core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 3e0dd5e 
  core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtDenseOutJob.java c52fe2a 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java 0c3a996 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java 0fa8707 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java 703c420 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java 0d81ccd 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDHelper.java PRE-CREATION 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java 98c8c59 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java b1a8b56 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UJob.java 53f26f4 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/VJob.java d58789e 
  core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java bd8c6b1 
  core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 0ef8622 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDPCADenseTest.java PRE-CREATION 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java 59f79c5 
  core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java beb0102 

Diff: https://reviews.apache.org/r/3863/diff


Testing
-------

Additional unit tests for PCA


Thanks,

Dmitriy


                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158188#comment-13158188 ] 

Dmitriy Lyubimov commented on MAHOUT-817:
-----------------------------------------

And it seems when mean of rows is used then indeed what Raphael is saying the output if Q has to produce sum of rows as single vector and with mean of columns output of Q will have to produce sum of columns as blocked vector. Then this vector must be incorporated to Bt job to produce offsets there. Got it.



                
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment: ssvd.R
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: SSVD-PCA options.pdf, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment:     (was: SSVD-PCA options.pdf)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: Backlog
>
>         Attachments: ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-817) Add PCA options to SSVD code

Posted by "Dmitriy Lyubimov (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-817:
------------------------------------

    Attachment:     (was: SSVD-CLI.pdf)
    
> Add PCA options to SSVD code
> ----------------------------
>
>                 Key: MAHOUT-817
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-817
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.6
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>             Fix For: 0.7
>
>         Attachments: MAHOUT-817-RC1.patch, MAHOUT-817.patch, MAHOUT-817.patch, MAHOUT-817.patch, SSVD-PCA options.pdf, ssvd-tests.R, ssvd.R, ssvd.m
>
>
> It seems that a simple solution should exist to integrate PCA mean subtraction into SSVD algorithm without making it a pre-requisite step and also avoiding densifying the big input. 
> Several approaches were suggested:
> 1) subtract mean off B
> 2) propagate mean vector deeper into algorithm algebraically where the data is already collapsed to smaller matrices
> 3) --?
> It needs some math done first . I'll take a stab at 1 and 2 but thoughts and math are welcome.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira