You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org> on 2011/08/26 01:40:29 UTC

[jira] [Created] (MAHOUT-796) Modified power iterations in existing SSVD code

Modified power iterations in existing SSVD code
-----------------------------------------------

                 Key: MAHOUT-796
                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
             Project: Mahout
          Issue Type: Improvement
          Components: Math
    Affects Versions: 0.5
            Reporter: Dmitriy Lyubimov
            Assignee: Dmitriy Lyubimov
             Fix For: 0.6


Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 

Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .

Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: [jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by Ted Dunning <te...@gmail.com>.

One of the most serious worries is that B_i is roughly the same size as A
which may make the successive products very expensive.  In particular, it
looks at first glance like we may need column-wise segmentation of A in
addition to row-wise segmentation in order to compute A B'_j.  That is only
a first glance so it is likely not very deep.

On Thu, Aug 25, 2011 at 7:31 PM, Nathan Halko <na...@spotinfluence.com>wrote:

> The lower the condition number (or low signal to noise) the harder it is to
> extract the top k singular vectors because in a sense they are not that
> much
> more important than the other n-k.  We see pollution from the smaller n-k
> singular directions and that degrades our approximation of the top k space.
>  Power iterations (just a few) are extremely important to amplify the gap
> between important directions and the unimportant directions.  Instead of
> sampling matrix A, we sample matrix (AA*)^qA which has the same singular
> vectors but an exaggerated spectrum
>
>   sigma^{2q+1}
>
> In infinite precision there would be no need to orthogonalize between
> iterations, only at the last step.  However, in finite precision, the small
> singular values can fall below machine precision when taken to the 2q+1st
> power and we won't be able to accurately recover them.  It also prevents
> overflow if your matrix has a very large sig_max.  It is mostly a
> precaution
> to keep from loosing information and for most cases could probably be
> skipped or done only intermittently.  If orthogonalization is a bottleneck
> we could consider not doing it.
>
>
> > > Modified power iterations in existing SSVD code
> > > -----------------------------------------------
> > >
> > >                 Key: MAHOUT-796
> > >                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
> > >             Project: Mahout
> > >          Issue Type: Improvement
> > >          Components: Math
> > >    Affects Versions: 0.5
> > >            Reporter: Dmitriy Lyubimov
> > >            Assignee: Dmitriy Lyubimov
> > >              Labels: SSVD
> > >             Fix For: 0.6
> > >
> > >
> >
>

Re: [jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by Nathan Halko <na...@spotinfluence.com>.

The lower the condition number (or low signal to noise) the harder it is to
extract the top k singular vectors because in a sense they are not that much
more important than the other n-k.  We see pollution from the smaller n-k
singular directions and that degrades our approximation of the top k space.
 Power iterations (just a few) are extremely important to amplify the gap
between important directions and the unimportant directions.  Instead of
sampling matrix A, we sample matrix (AA*)^qA which has the same singular
vectors but an exaggerated spectrum

   sigma^{2q+1}

In infinite precision there would be no need to orthogonalize between
iterations, only at the last step.  However, in finite precision, the small
singular values can fall below machine precision when taken to the 2q+1st
power and we won't be able to accurately recover them.  It also prevents
overflow if your matrix has a very large sig_max.  It is mostly a precaution
to keep from loosing information and for most cases could probably be
skipped or done only intermittently.  If orthogonalization is a bottleneck
we could consider not doing it.


> > Modified power iterations in existing SSVD code
> > -----------------------------------------------
> >
> >                 Key: MAHOUT-796
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Math
> >    Affects Versions: 0.5
> >            Reporter: Dmitriy Lyubimov
> >            Assignee: Dmitriy Lyubimov
> >              Labels: SSVD
> >             Fix For: 0.6
> >
> >
>

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092385#comment-13092385 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

PS it also just occurred to me that full B' does not have to be ever written out either and can be computed in reducers of 2nd step. Then front end will just have to aggregate few triangular partial B' products produced by however many reducers, directly (we don't save full B' since it's symmetrical, nor do we compute full B). That saves on full Bt I/O and avoid startup costs of BB' job. 

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-796:
------------------------------------

    Attachment: MAHOUT-796-4.patch

oops. test fix.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>         Attachments: MAHOUT-796-2.patch, MAHOUT-796-3.patch, MAHOUT-796-4.patch, MAHOUT-796.patch
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091532#comment-13091532 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

Orthogonalization, i wouldn't say it has been a bottleneck for me personally. I do orthogonalization in two MR steps, but those steps are also doing 2 multiplications. I am still looking to run an experiment with an input where it would be a practical problem (such as QR step taking > 10..15 min). It also so happened that my previous results were tainted by the fact that i ran without native libraries installed so decompression was chomping on my CPU too, so no clean data now. But it seems indeed like roughly 50% of everything else.

but i am still waiting for run-a-terabyte-input opportunity.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-796:
------------------------------------

    Attachment: MAHOUT-796-3.patch

rewrote AB' and Bt jobs to do row-wise sparse blocking of outer product outputs from mapper and combiner. 

In 1G tests looks like a major improvment, about order of magnitude speed up for multiplication part and 4 times speed-up of a combo QR+multiplication (multiplication is still the bigger part). 

I think this is the final iteration on this issue, i will commit within couple of days.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>         Attachments: MAHOUT-796-2.patch, MAHOUT-796-3.patch, MAHOUT-796.patch
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092000#comment-13092000 ] 

Ted Dunning commented on MAHOUT-796:
------------------------------------

To clarify a bit about my worries, the issue that I see is that we could compute A\Omega in a single pass of A because we could effectively store all of \Omega in memory (i.e. just keep the seed and hash function).  This is nice because row-wise decomposition of A makes everything work nicely.

In computing (AB') A out-of-core, however, we have a product of A by a tall skinny matrix B that requires roughly as much storage as A.  Each row-wise patch of A will have to be combined with each row-wise chunk of B' (i.e. each column-wise chunk of B).  This means that we have to read B many times which leads to quadratic time.



> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091527#comment-13091527 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

bq. For most recommendation applications, I would be surprised if iterations are important,

Ok I have no first hand knowledge myself, Nathan mentioned it did appear quite important in his previous work. I hope he could comment himself on a nature of the data he was looking at.

So are you encouraging to proceed, benchmark, compare? get more info? do it anyway? or drop it without comparisons?

  

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096598#comment-13096598 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

also changed CLI a little bit: 

* made A block height optional with default value of 10,000 which should be fine with most inputs on 64mb splits, ~200 eigen values and -Xmx500m in child processes. 

* added -oh outer product  sparse row-wise block cardinality used by 'big' multiplications Q'A and AB' with default value of 10,000. It may need increases with very sparse inputs in order to be more efficient for spill sorts (map-side combiners), but it would be always equally efficent in reduce-side sorts which was so far the longest running stuff in all there is (blocked multiplications are still most expensive; but they are closing in on QR expenses which does not use sorts). 



> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>         Attachments: MAHOUT-796-2.patch, MAHOUT-796-3.patch, MAHOUT-796-4.patch, MAHOUT-796.patch
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093389#comment-13093389 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

PS i removed result comparison to Colt's SVD since it takes too much time to compute 20mln matrix for its in-core full rank SVD algorithm.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093388#comment-13093388 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

I re-did dense test to construct 20,000x1,000 dense matrix (20 mln non-zero elements)  with random singular vectors 

The way i construct input is i generate random singular vector matrices and orthogonalize them using stable Gramm-Schmidt, multiply one of them (whatever is shorter) by Sigma and then produce row-wise surrogate input. 

For predefined singular values = 10,4,1,(0.1...), n=1000, m=20000, k=3, p=10 i get stochastic values 
--SSVD solver singular values:
svs: 9.998401  3.998322  0.972622  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  


so you see if the decay is good then precision loss with 1 pass in my case doesn't exceed 2.8% in the worst case (3rd value) and the time is quite good. (same brunch as for mahout-797 in my github). 

Keep in mind that this precision loss also includes loss generated during simulated input construction, it's not all the solver's.

I also got rid of BBt job and fixed problems with sparse input on that branch.



> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-796:
------------------------------------

    Attachment: Power Iterations.pdf

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>         Attachments: MAHOUT-796-2.patch, MAHOUT-796-3.patch, MAHOUT-796-4.patch, MAHOUT-796.patch, Power Iterations.pdf
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Nathan Halko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091545#comment-13091545 ] 

Nathan Halko commented on MAHOUT-796:
-------------------------------------

I checked Dmitriy's scheme today and it makes sense.  It accumulates Q* using the machinery already in place, QJob and BtJob

QR = Y = A\Omega
B0 = Q*A
B1 = (AB0*)*A = B0A*A = (Q*AA*)A = (Q_new)*A 

A'A should never be computed, only Z = A'AY where Y is dense and X=AY, Z=A'X avoiding the problem of scarce overlap and fill in.
 

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094114#comment-13094114 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

AFAIK distributed cache would actually do the same except it would also store the file on disk. 

The disadvantage here is that we add disk i/o time to this. The advantage is that if we hit the same node with a mapper of the same task more than once, as far as i understand, they'd have the entire B' locally. That's an interesting idea, actually. But for big clusters where a job is unlikely to hit the same node with more than 1 task, this probably would actually be detrimental. Plus, if B is really big (somethin like 100Gb big) then we are requiring a lot of hdd from a node. 

Plus for jobs that use memory mapping or any sort of random access, distributed cache is the only option -- but we don't need that. 

Ok, let me make implementation that opens a stream first, just to prove/measure whatever we are improving, and later perhaps there's a good sense to add an option to use distributed cache for this. Maybe there will be another trick we don't see to streamline this, but i so far did not find any. So it will give us some time to think.



> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Nathan Halko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091545#comment-13091545 ] 

Nathan Halko edited comment on MAHOUT-796 at 8/26/11 3:32 AM:
--------------------------------------------------------------

I checked Dmitriy's scheme today and it makes sense.  It accumulates Q' using the machinery already in place, QJob and BtJob

QR = Y = A\Omega
B0 = Q'A
B1 = (AB0')'A = B0A'A = (Q'AA')A = (Q_new)'A 

A'A should never be computed, only Z = A'AY where Y is dense and X=AY, Z=A'X avoiding the problem of scarce overlap and fill in.
 

      was (Author: nathanhalko):
    I checked Dmitriy's scheme today and it makes sense.  It accumulates Q* using the machinery already in place, QJob and BtJob

QR = Y = A\Omega
B0 = Q*A
B1 = (AB0*)*A = B0A*A = (Q*AA*)A = (Q_new)*A 

A'A should never be computed, only Z = A'AY where Y is dense and X=AY, Z=A'X avoiding the problem of scarce overlap and fill in.
 
  
> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092380#comment-13092380 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

Hi, 

I put together B_i pipeline https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad. It seems it is a pretty straightforward enhancement that falls back on a lot of existing stuff, with fundamental additions of AB' multiplication and QR pushdown to reducer of the first job (instead of doing it in the mapper of the first job) 

bq. A quick and dirty trick to avoid even sweeping through A again is to neglect the cross terms in the product (AA')^qA\Omega and just use (A_iA_i')^qA_i\Omega. 

I think that even if that's less flops, it is still more difficult to implement than the full power iterations with reorthogonalization as you've initially proposed. 

After all, IMO there's no big reason to be afraid of more work for as long as it brings more precision and we have a control over how much more work we want to do. 

I also can incorporate a Cholesky trick into B_0 pipeline at some point -- or just have it as an alternative flow controlled by a job parameter.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092386#comment-13092386 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

And finally, i can't see a reason why we can't incorporate "Cholesky trick" either by substituting Y_1..Y_q = AB' instead of A\Omega to compute B_i.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Nathan Halko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091528#comment-13091528 ] 

Nathan Halko commented on MAHOUT-796:
-------------------------------------

The lower the condition number (or low signal to noise) the harder it is to extract the top k singular vectors because in a sense they are not that much more important than the other n-k.  We see pollution from the smaller n-k singular directions and that degrades our approximation of the top k space.  Power iterations (just a few) are extremely important to amplify the gap between important directions and the unimportant directions.  Instead of sampling matrix A, we sample matrix (AA*)^qA which has the same singular vectors but an exaggerated spectrum

   sigma^{2q+1}

In infinite precision there would be no need to orthogonalize between iterations, only at the last step.  However, in finite precision, the small singular values can fall below machine precision when taken to the 2q+1st power and we won't be able to accurately recover them.  It also prevents overflow if your matrix has a very large sig_max.  It is mostly a precaution to keep from loosing information and for most cases could probably be skipped or done only intermittently.  If orthogonalization is a bottleneck we could consider not doing it.



> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092386#comment-13092386 ] 

Dmitriy Lyubimov edited comment on MAHOUT-796 at 8/28/11 12:49 AM:
-------------------------------------------------------------------

And finally, i can't see a reason why we can't incorporate "Cholesky trick" either by substituting Y_1..Y_q = AB' instead of A\Omega to compute B_i.

In other words, if we assert that we have some function such that currently provides B_0=g(Y_0) where Y_0=A\Omega, then there's no reason to assume we can't use the same function g to compute B_i=g(Y_i) for as long as Y_i=AB'_{i-1}. also see my scratchpad for the same.

      was (Author: dlyubimov):
    And finally, i can't see a reason why we can't incorporate "Cholesky trick" either by substituting Y_1..Y_q = AB' instead of A\Omega to compute B_i.
  
> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091509#comment-13091509 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

if by low condition number you mean ratio of max to min singular value, then with slow decay i never got good results either. That's a sign of either highly noisy or ever randomly generated data. I think it would help if user came forward and explained his/her case, esp. given experimental nature of the code.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091538#comment-13091538 ] 

Ted Dunning commented on MAHOUT-796:
------------------------------------

For the in-memory implementations, I think that this is a non-issue.  Power iteration should simply be implemented.  In that case, the original form using Y = (A'A)^q A \Omega seems fine and I don't yet quite see how the iteration that Dmitriy proposes will get the right result.  Whichever method is used, it is a good thing to do.

The problems that I see are for the out-of-core problems.  There, computing A'A can often give pathologically bad results if the sparse pattern is highly skewed.  That approach also leads to significant fill-in which is not a good thing.  On the hand, multiplying A times anything too large to store in memory such as B typically is may be horribly bad as well. 

The orthogonalization is no big deal since it requires only a single pass through the data to accumulate the small matrix required for the Cholesky trick.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-796:
------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Resolving.

i will add code doing the same with Cholesky trick as a part of MAHOUT-797 once it's implemented.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>         Attachments: MAHOUT-796-2.patch, MAHOUT-796-3.patch, MAHOUT-796-4.patch, MAHOUT-796.patch
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096264#comment-13096264 ] 

Dmitriy Lyubimov edited comment on MAHOUT-796 at 9/2/11 8:18 PM:
-----------------------------------------------------------------

I am now convinced that all the inefficiency and cpu-bound behavior comes from the multiplications Q'B, AB'. Indeed, I generate outer products which i then output row by row and sum up in combiners and reducers. this creates tremendous pressure in terms of keys for spill and reduce sorts. Although it seems DRM.times(DRM) employs exactly the same problem. 

I think tremendous improvements would result if we output outer products of B' or AB' in vertical thin but long blocks. 

This seem like a trivial idea but I was preoccupied by 'proof of concept' issues whereas matrix multiplication has been seen as algorithmically trivial. 

the command line i used: 

bin/mahout ssvd --input /tmp/DRM/* --output /tmp/SSVD1 --tempDir /tmp/SSVD-tmp -k 10 -p 3 -q 1 -r 10000 --reduceTasks 3 -Dmapred.child.java.opts='-Xmx500m'

For ABt job you need perhaps more than double the memory of the split in case of dense input because ABtJob is catering to sparse inputs first and loads A block column-wise using SequentialAccessSparseVectors. So for a split size of 64Mb standard -Xmx200m may not be enough, and it is definitely not enough for 128mb splits.

note also that I ran on CDH3u0 without any Mahout recompilation with CDH3 binaries and it seems to work out of the box.


      was (Author: dlyubimov):
    I am now convinced that all the inefficiency and cpu-bound behavior comes from the multiplications Q'B, AB'. Indeed, I generate outer products which i then output row by row and sum up in combiners and reducers. this creates tremendous pressure in terms of keys for spill and reduce sorts.

I think tremendous improvements would result if we output outer products of B' or AB' in vertical thin but long blocks. This seem like a trivial idea but I was preoccupied by 'proof of concept' issues whereas matrix multiplication has been seen as algorithmically trivial. 

the command line i used: 

bin/mahout ssvd --input /tmp/DRM/* --output /tmp/SSVD1 --tempDir /tmp/SSVD-tmp -k 10 -p 3 -q 1 -r 10000 --reduceTasks 3 -Dmapred.child.java.opts='-Xmx500m'

note also that I ran on CDH3u0 without any Mahout recompilation with CDH3 binaries and it seems to work out of the box.

  
> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>         Attachments: MAHOUT-796-2.patch, MAHOUT-796.patch
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096622#comment-13096622 ] 

Hudson commented on MAHOUT-796:
-------------------------------

Integrated in Mahout-Quality #1017 (See [https://builds.apache.org/job/Mahout-Quality/1017/])
    MAHOUT-796: SSVD power iterations; performance patches.

dlyubimov : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1164806
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/common/IOUtils.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/ABtJob.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BBtJob.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/BtJob.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/GivensThinSolver.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/Omega.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/QJob.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDCli.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototype.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDSolver.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SparseRowBlockAccumulator.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SparseRowBlockWritable.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/SplitPartitionedWritable.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UpperTriangular.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/YtYJob.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/GivensThinSolver.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/GrammSchmidt.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/QRFirstStep.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/qr/QRLastStep.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverDenseTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/LocalSSVDSolverSparseSequentialTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDPrototypeTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/hadoop/stochasticsvd/SSVDTestsHelper.java
* /mahout/trunk/src/conf/ssvd.props


> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>         Attachments: MAHOUT-796-2.patch, MAHOUT-796-3.patch, MAHOUT-796-4.patch, MAHOUT-796.patch
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-796:
------------------------------------

    Attachment: MAHOUT-796-2.patch

Fixed CLI issues. 

Tested distributed MR on 1Gb input with multiple reducers q=1 ok.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>         Attachments: MAHOUT-796-2.patch, MAHOUT-796.patch
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095146#comment-13095146 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

Ok first implementation with QR solvers is ready, added -q parameter. (all in git remote git@github.com:dlyubimov/mahout-commits branch MAHOUT-796)

Did not have time to test distributed version and larger inputs. on a toy input 1000x2000, k=3, p=10 (optimal 10,4,1,(0.1) ): 

q=0: 
--SSVD solver singular values:
svs: 9.998472  3.993542  0.990456  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  

q=1: (+2 more sequential steps):
--SSVD solver singular values:
svs: 10.000000  4.000000  0.999999  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  0.100000  

So, much better (although much slower as well). 
I of course understand that each run exhibit noise, so to prove it works better consistently i need to run more than just 2 attempts. But that's encouraging. it worked (and actually at first attempt)!

I tried some optimization to handle sparse cases a little better as well, i guess it taxes densier cases a little bit.

So this will be put on hold until i add Cholesky option and then i will have to return to this issue to enable the same schema but Y'Y+ Cholesky path.

I refactored QR steps into standalone OutputCollector implementations so that they can now be more easily be pipelined inside mappers and reducers so code is much more readable now. 

So after a few tests and final fixes i think it is a commit worthy but it has dependency on Ted's refactoring MAHOUT-790 pushed to trunk.



> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Nathan Halko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095365#comment-13095365 ] 

Nathan Halko commented on MAHOUT-796:
-------------------------------------

Wow those are great results Dmitriy!  q definitely adds power and reliability to the algorithm.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092386#comment-13092386 ] 

Dmitriy Lyubimov edited comment on MAHOUT-796 at 8/28/11 12:52 AM:
-------------------------------------------------------------------

And finally, i can't see a reason why we can't incorporate "Cholesky trick" either by substituting Y_1..Y_q = AB' instead of A\Omega to compute B_i.

In other words, MR operationally aside, if we assert that we have some function such that currently provides B_0=g(Y_0) where Y_0=A\Omega, then there's no reason to assume we can't use the same function g to compute B_i=g(Y_i) for as long as Y_i=AB'_{i-1}. also see my scratchpad for the same.

      was (Author: dlyubimov):
    And finally, i can't see a reason why we can't incorporate "Cholesky trick" either by substituting Y_1..Y_q = AB' instead of A\Omega to compute B_i.

In other words, if we assert that we have some function such that currently provides B_0=g(Y_0) where Y_0=A\Omega, then there's no reason to assume we can't use the same function g to compute B_i=g(Y_i) for as long as Y_i=AB'_{i-1}. also see my scratchpad for the same.
  
> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091515#comment-13091515 ] 

Ted Dunning commented on MAHOUT-796:
------------------------------------

Yes.  That is what I mean by condition number.  

The only place that I have seen slow decay is with synthetically generated random data.  Even real random data has some interesting singular values.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091601#comment-13091601 ] 

Ted Dunning commented on MAHOUT-796:
------------------------------------

One problem here is that the Q's are large and potentially dense.  Thus, accumulating them is not a great idea.  That can be worked around in the single iteration because we can keep R in memory and can reconstruct chunks of Y given chunks of A.

That trick becomes a bit more involved if we want to keep all of the Q's in such an implicit form.  Computing (AA')^q A\Omega is relatively simple as you point out if A' is available as a linear operator, but I thought I understood that reorthogonalizing was a good idea.  What I don't see is how to re-orthogonalize without keeping very large matrices in memory or doing a dangerously dense operation.  Yet.





> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092385#comment-13092385 ] 

Dmitriy Lyubimov edited comment on MAHOUT-796 at 8/27/11 9:45 PM:
------------------------------------------------------------------

PS it also just occurred to me that full B' does not have to be ever written out either because BB' can be accumulated in reducers of 2nd step. Then front end will just have to aggregate few triangular partial B' products produced by however many reducers, directly (we don't save full B' since it's symmetrical, nor do we compute full B). That saves on full Bt I/O and avoids startup costs of BB' job. 

Thus, full job is 2 MR passes with q=0, 4MR passes with q=1 and 6MR passes with q=2. If understand Nathan's point right, 3 orthogonalizations (which corresponds to q=2) is quite enough.

V and U jobs are optional and running in parallel, so they can count for another iteration.

so with q=2 (maximum case) and U,V output requested we end up with 7 *sequential* MR iterations.


      was (Author: dlyubimov):
    PS it also just occurred to me that full B' does not have to be ever written out either and can be computed in reducers of 2nd step. Then front end will just have to aggregate few triangular partial B' products produced by however many reducers, directly (we don't save full B' since it's symmetrical, nor do we compute full B). That saves on full Bt I/O and avoids startup costs of BB' job. 

Thus, full job is 2 MR passes with q=0, 4MR passes with q=1 and 6MR passes with q=2. If understand Nathan's point right, 3 orthogonalizations (which corresponds to q=2) is quite enough.

V and U jobs are optional and running in parallel, so they can count for another iteration.

so with q=2 (maximum case) and U,V output requested we end up with 7 *sequential* MR iterations.

  
> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094018#comment-13094018 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

AB' is a heavy multiplication of course. 

I don't want to use standard multiplication because

-- i want to be doing more things in reducer 
-- need custom grouping/sorting to ensure output is partitioned the same way as A splits 

At this point it seems that the best strategy is just to preload entire A block into memory as a (sparse) matrix and open B' stream as a side file and hope it is not going to generate too much flood i/o. I don't know a workaround for it anyway since whatever blocking scheme is used, we need cartesian products from both matrix inputs and that will cause i/o and i don't think there's any clever collocation trick to be had there



> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092385#comment-13092385 ] 

Dmitriy Lyubimov edited comment on MAHOUT-796 at 8/27/11 9:47 PM:
------------------------------------------------------------------

PS it also just occurred to me that full B' does not have to be ever written out either because BB' can be accumulated in reducers of 2nd step. Then front end will just have to aggregate few triangular partial B' products produced by however many reducers, directly (we don't save full BB' since it's symmetrical, nor do we compute full outer products of BB' there). That saves on full Bt I/O and avoids startup costs of BB' job. 

Thus, full job is 2 MR passes with q=0, 4MR passes with q=1 and 6MR passes with q=2. If understand Nathan's point right, 3 orthogonalizations (which corresponds to q=2) is quite enough.

V and U jobs are optional and running in parallel, so they can count for another iteration.

so with q=2 (maximum case) and U,V output requested we end up with 7 *sequential* MR iterations.


      was (Author: dlyubimov):
    PS it also just occurred to me that full B' does not have to be ever written out either because BB' can be accumulated in reducers of 2nd step. Then front end will just have to aggregate few triangular partial B' products produced by however many reducers, directly (we don't save full B' since it's symmetrical, nor do we compute full B). That saves on full Bt I/O and avoids startup costs of BB' job. 

Thus, full job is 2 MR passes with q=0, 4MR passes with q=1 and 6MR passes with q=2. If understand Nathan's point right, 3 orthogonalizations (which corresponds to q=2) is quite enough.

V and U jobs are optional and running in parallel, so they can count for another iteration.

so with q=2 (maximum case) and U,V output requested we end up with 7 *sequential* MR iterations.

  
> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092385#comment-13092385 ] 

Dmitriy Lyubimov edited comment on MAHOUT-796 at 8/27/11 9:41 PM:
------------------------------------------------------------------

PS it also just occurred to me that full B' does not have to be ever written out either and can be computed in reducers of 2nd step. Then front end will just have to aggregate few triangular partial B' products produced by however many reducers, directly (we don't save full B' since it's symmetrical, nor do we compute full B). That saves on full Bt I/O and avoids startup costs of BB' job. 

Thus, full job is 2 MR passes with q=0, 4MR passes with q=1 and 6MR passes with q=2. If understand Nathan's point right, 3 orthogonalizations (which corresponds to q=2) is quite enough.

V and U jobs are optional and running in parallel, so they can count for another iteration.

so with q=2 (maximum case) and U,V output requested we end up with 7 *sequential* MR iterations.


      was (Author: dlyubimov):
    PS it also just occurred to me that full B' does not have to be ever written out either and can be computed in reducers of 2nd step. Then front end will just have to aggregate few triangular partial B' products produced by however many reducers, directly (we don't save full B' since it's symmetrical, nor do we compute full B). That saves on full Bt I/O and avoid startup costs of BB' job. 
  
> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Nathan Halko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091533#comment-13091533 ] 

Nathan Halko commented on MAHOUT-796:
-------------------------------------

We did some work with facial recognition, computing 'eigenfaces' and reported in the paper.  In this case there is only 2 orders of magnitude between the signal and the 'noise'.  It shows a dramatic difference between the accuracy of one pass versus just one power iteration.  Note that after one power iteration, there is now 6 orders of magnitude separating signal and noise.  

But this is only looking at approximation error ||A-UU*A||.  It could very well be the case in recommendation applications that this measure is not appropriate, I don't know.  But it is a very valuable option to have at one's disposal just in case.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Nathan Halko (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092179#comment-13092179 ] 

Nathan Halko commented on MAHOUT-796:
-------------------------------------

The reorthogonalizations aren't essential and if its a barrier to power iterations we should forego them at the moment.  A quick and dirty trick to avoid even sweeping through A again is to neglect the cross terms in the product (AA')^qA\Omega and just use (A_iA_i')^qA_i\Omega. This could be extremely naive but I've been getting some good results with it.  The accuracy typically falls about half way between single pass and full power iterations so it could be useful (although it could be dangerous as well).  

sig_51  <-  optimal
   60.6531

full power iters 1
   81.2668

full power iters 4
   67.5545

row-wise power iters 1
   89.4983

row-wise power iters 4
   82.2247

single pass
   92.8736

norm A
  100.0000 

The 'row-wise power iters' being (A_iA_i')^qA_i\Omega.  
  

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091436#comment-13091436 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

current code is essentially equivalent to running this mod with q=0.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091529#comment-13091529 ] 

Ted Dunning commented on MAHOUT-796:
------------------------------------

{quote}
So are you encouraging to proceed, benchmark, compare? get more info? do it anyway? or drop it without comparisons?
{quote}

I will proceed with my latest family of computations without power iterations.  Along the way, I will look for ways to do them efficiently.  It took me months of looking at other things for the coin to drop on the fact that we don't have to keep the Q around for the QR decompositions, however, so it may be a bit before I figure out how to implement the power iteration really efficiently.


> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096264#comment-13096264 ] 

Dmitriy Lyubimov commented on MAHOUT-796:
-----------------------------------------

I am now convinced that all the inefficiency and cpu-bound behavior comes from the multiplications Q'B, AB'. Indeed, I generate outer products which i then output row by row and sum up in combiners and reducers. this creates tremendous pressure in terms of keys for spill and reduce sorts.

I think tremendous improvements would result if we output outer products of B' or AB' in vertical thin but long blocks. This seem like a trivial idea but I was preoccupied by 'proof of concept' issues whereas matrix multiplication has been seen as algorithmically trivial. 

the command line i used: 

bin/mahout ssvd --input /tmp/DRM/* --output /tmp/SSVD1 --tempDir /tmp/SSVD-tmp -k 10 -p 3 -q 1 -r 10000 --reduceTasks 3 -Dmapred.child.java.opts='-Xmx500m'

note also that I ran on CDH3u0 without any Mahout recompilation with CDH3 binaries and it seems to work out of the box.


> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>         Attachments: MAHOUT-796-2.patch, MAHOUT-796.patch
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091493#comment-13091493 ] 

Ted Dunning commented on MAHOUT-796:
------------------------------------

The outline that I have in MAHOUT-792 does exactly this without the iteration and has the framework
available to do the power iteration efficiently, including the out-of-core QR.

For most recommendation applications, I would be surprised if iterations are important, but at
least one user of the current system was surprised when their test matrix with a very low condition
number didn't get good results with the current algorithm.  It would be nice to be able to say
"turn the knob to get better results".

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094041#comment-13094041 ] 

Ted Dunning commented on MAHOUT-796:
------------------------------------

{quote}
At this point it seems that the best strategy is just to preload entire A block into memory as a (sparse) matrix and open B' stream as a side file and hope it is not going to generate too much flood i/o. I don't know a workaround for it anyway since whatever blocking scheme is used, we need cartesian products from both matrix inputs and that will cause i/o and i don't think there's any clever collocation trick to be had there
{quote}

Presumably there could be a role for the distributed cache here to make the I/O load more manageable.

This is just the sort of thing that the MapR ability to control placement, mirroring and to read via NFS comes in really, really handy.  Can't really assume that for Mahout, though.


> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-796:
------------------------------------

    Attachment: MAHOUT-796.patch

Patch. Local solver tests pass. I also tested multiple splits on sufficiently larger inputs and sparse inputs with local MR. 

I still need to test with yet bigger file with multiple reducers since local MR does not support multiple reducers.

What i noticed is that with just one additional power iteration with orthogonalization there's practically no need to run any oversampling (p). So yes power iteration runs more steps but runtime can be reduced significantly just because you don't need as wide projection anymore. small values are pretty good without much oversampling. 

Amazing.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>         Attachments: MAHOUT-796.patch
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-796) Modified power iterations in existing SSVD code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitriy Lyubimov updated MAHOUT-796:
------------------------------------

    Status: Patch Available  (was: Open)

Patch attached. 
I will commit after another round of MR in distributed mode. Hopefully will be done by Tue.

> Modified power iterations in existing SSVD code
> -----------------------------------------------
>
>                 Key: MAHOUT-796
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-796
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.5
>            Reporter: Dmitriy Lyubimov
>            Assignee: Dmitriy Lyubimov
>              Labels: SSVD
>             Fix For: 0.6
>
>         Attachments: MAHOUT-796.patch
>
>
> Nathan Halko contacted me and pointed out importance of availability of power iterations and their significant effect on accuracy of smaller eigenvalues and noise attenuation. 
> Essentially, we would like to introduce yet another job parameter, q, that governs amount of optional power iterations. The suggestion how to modify the algorithm is outlined here : https://github.com/dlyubimov/ssvd-lsi/wiki/Power-iterations-scratchpad .
> Note that it is different from original power iterations formula in the paper in the sense that additional orthogonalization performed after each iteration. Nathan points out that that improves errors in smaller eigenvalues a lot (If i interpret it right). 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira