You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2011/08/20 23:34:27 UTC

[jira] [Created] (MAHOUT-792) Add new stochastic decomposition code

Add new stochastic decomposition code
-------------------------------------

                 Key: MAHOUT-792
                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
             Project: Mahout
          Issue Type: New Feature
            Reporter: Ted Dunning


I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.

I will produce a patch that contains the following:

  - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.

  - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.

  - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-792:
-------------------------------

    Attachment: MAHOUT-792.patch

This is the current state.  This has an in-memory implementation and an out-of-core implementation for review.  The out-of-core implementation needs to be changed to use threads well before it is production ready.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096570#comment-13096570 ] 

Lance Norskog edited comment on MAHOUT-792 at 9/4/11 12:28 AM:
---------------------------------------------------------------

Please run this:
{code}
 public static void main(String[] args) {
	  RandomTrinaryMatrix rtm = new RandomTrinaryMatrix(0, 30, 40, true);
	  for(int row = 0; row < 30; row++) 
		  for(int column = 0; column < 40; column++)
			  System.out.println(rtm.get(row, column));	
	  rtm = new RandomTrinaryMatrix(0, 30, 40, true);
	  for(int row = 0; row < 30; row++) 
		  for(int column = 0; column < 40; column++)
			  System.out.println(rtm.get(row, column));	
  }
{code}


      was (Author: lancenorskog):
    Please run this:
{code}
 public static void main(String[] args) {
	  RandomTrinaryMatrix rtm = new RandomTrinaryMatrix(0, 30, 40, true);
	  for(int row = 0; row < 30; row++) 
		  for(int column = 0; column < 40; column++)
			  System.out.println(rtm.get(row, column));	
	  rtm = new RandomTrinaryMatrix(0, 30, 40, true);
	  for(int row = 0; row < 30; row++) 
		  for(int column = 0; column < 40; column++)
			  System.out.println(rtm.get(row, column));	
  }
{code}
The "low-grade" mode does not emit 0. The high-grade mode does not emit negative numbers.

I'm curious: what is the math behind an even split of -1,0,1? The Achlioptas math says +1/-1 or +1/-1/0/0/0 . So, the low-grade mode is correct, just not what the comment describes :)

  
  
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102388#comment-13102388 ] 

Dmitriy Lyubimov commented on MAHOUT-792:
-----------------------------------------

so, this is not fully committed yet? 

I don't see our version of CholeskyDecomposition class on the trunk (only apache-math's one). Neighter do i think i can see the SequentialOutOfCoreSVD etc. that seems to be present in the patch posted?

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092071#comment-13092071 ] 

Dmitriy Lyubimov commented on MAHOUT-792:
-----------------------------------------

bq. But A is sparse so Y, B and Q are all about the same (storage) size as A. In fact, if k+p > average number of elements per row of A, then Y, B and Q will all be larger than A.

That's true. I had one guy who had millions of rows but only 10 average measurements per row. Probably ratings or something. it is not going to be efficient (cpu-wise) in these cases. 

But my response has always been, if you have input thinner than projection, then why even use projection? It's my understanding that the whole idea of this is to drastically reduce the input to analyze. Original paper actually never suggested to compute BB' as far as i remember, it's something i did to open up the n by sacrificing rounding. In original paper, they compute SVD of B, so if B is greater than input, it would only cost more to compute SVD of one. So that's what i understood -- B is _supposed_ to be much less than A in original work, otherwise there's not much sense. 

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102403#comment-13102403 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

No.  I was waiting for comments.  If I can find time between packing this evening I will commit.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114218#comment-13114218 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

If you look at my comments in sd-2, you see that Y'Y is computed in one pass and then in the pass where you compute Q'A using Y, I just recompute Y on the fly and solve using R to get a block of Q, also on the fly.  Since we are reading blocks of A anyway in this pass, it costs nothing except a small amount of computation to avoid the storing of Y.



> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114425#comment-13114425 ] 

Dmitriy Lyubimov commented on MAHOUT-792:
-----------------------------------------

I like trinary matrix idea. I would like to add it as an option with time. Indeed, may save space fro specific extrasoarse cases.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114287#comment-13114287 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

{quote}
So we either have to save Y, or incur passes over big dataset (assuming A is much densier than Y).
{quote}

I think that this got mangled in the typing.  A is sparser than Y and may be larger or smaller depending on the average number of non-zero elements.  If A is binary or contains only small integers, then it could well be smaller on disk than Y if a continuous random matrix Omega is used.  If Omega contains only -1, 0, 1 (trinary values), then the values of Y should compress nearly as well as the values of A and will be dense as well so we don't have to store the indices.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096570#comment-13096570 ] 

Lance Norskog edited comment on MAHOUT-792 at 9/4/11 12:27 AM:
---------------------------------------------------------------

Please run this:
{code}
 public static void main(String[] args) {
	  RandomTrinaryMatrix rtm = new RandomTrinaryMatrix(0, 30, 40, true);
	  for(int row = 0; row < 30; row++) 
		  for(int column = 0; column < 40; column++)
			  System.out.println(rtm.get(row, column));	
	  rtm = new RandomTrinaryMatrix(0, 30, 40, true);
	  for(int row = 0; row < 30; row++) 
		  for(int column = 0; column < 40; column++)
			  System.out.println(rtm.get(row, column));	
  }
{code}
The "low-grade" mode does not emit 0. The high-grade mode does not emit negative numbers.

I'm curious: what is the math behind an even split of -1,0,1? The Achlioptas math says +1/-1 or +1/-1/0/0/0 . So, the low-grade mode is correct, just not what the comment describes :)

  

      was (Author: lancenorskog):
    Please run this:
{code}
 public static void main(String[] args) {
	  RandomTrinaryMatrix rtm = new RandomTrinaryMatrix(0, 30, 40, true);
	  for(int row = 0; row < 30; row++) 
		  for(int column = 0; column < 40; column++)
			  System.out.println(rtm.get(row, column));	
	  rtm = new RandomTrinaryMatrix(0, 30, 40, true);
	  for(int row = 0; row < 30; row++) 
		  for(int column = 0; column < 40; column++)
			  System.out.println(rtm.get(row, column));	
  }
{code}
  
  
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092428#comment-13092428 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

It should be very good.

I have a few corrections that I need to post. That might be the cause of
your problem.

On Saturday, August 27, 2011, Dmitriy Lyubimov (JIRA) <ji...@apache.org>
https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092404#comment-13092404]
test failure
 testSingularValues(org.apache.mahout.math.ssvd.SequentialOutOfCoreSvdTest):
expected:<0.0> but was:<5250028.000000007>
eliminates the QR decomposition and makes life easier.
rank-revealing) or not.  This should actually be useful for solution of
large out-of-core least squares problems.
about 1/3 of available memory.
large matrices.  It should take time about equal to the cost of reading the
input matrix 4 times and will require working disk roughly equal to the size
of the input.


> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning resolved MAHOUT-792.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.6
         Assignee: Ted Dunning

Finally committed these.  This provides Cholesky decomposition and several in-memory implementations of SSVD.  These are deficient with respect to power iterations.
                
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.6
>
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning resolved MAHOUT-792.
--------------------------------

    Resolution: Fixed

Last few Jenkins runs confirm my expectation that this problem was a bug in the test case itself that is now fixed.
                
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.6
>
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (Reopened) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning reopened MAHOUT-792:
--------------------------------


I don't think that this issue should have been closed.

                
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.6
>
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092099#comment-13092099 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

{quote}
But my response has always been, if you have input thinner than projection, then why even use projection?
{quote}
Because it is cheaper to process a thin dense matrix than a wide sparse one.

Likewise, even if B is bigger than A, it can be much cheaper to compute the SVD of B than of A if only because the Cholesky trick works on the skinny dense case.  If the Cholesky decomposition loses some accuracy then a QR or LQ decomposition could be used the same way.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088516#comment-13088516 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

This presumes MAHOUT-790 and MAHOUT-793 and includes them in this patch.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183827#comment-13183827 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

Jenkins succeeded after this change.  Good news.

I will try on an EC2 instance to see if I can.

Hmm... https://builds.apache.org/job/mahout-nightly/752/console shows the evil test still failing.

Very confusing.

                
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.6
>
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123743#comment-13123743 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

Yes.  The new-stochastic branch is the one to go with.  It is a little (but not much) out of date.


                
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088777#comment-13088777 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

Github branch MAHOUT-792 available at git://github.com/tdunning/mahout.git

That provides more details on successive changes.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183275#comment-13183275 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

My current theory is that the problem has to do with the ordering of the result of File.listFiles.  I just committed a cleanup fix.  The cleanups were important in any case.

Let's see what Jenkins has to say.
                
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.6
>
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114109#comment-13114109 ] 

Dmitriy Lyubimov edited comment on MAHOUT-792 at 9/25/11 2:27 AM:
------------------------------------------------------------------

Another thing that I realized while working on MAHOUT-797, is that this execution plan requires to pass 3 times over input of A instead of existing 1 pass, if we are not saving Y. (first time to comute Y'Y, second time to compute B, and third time to compute U).

So we either have to save Y, or incur passes over big dataset (assuming A is much densier than Y). 

And we we do save Y then we swap Q for Y,  which are of the same size.

So.. it looks like we got to save Y... or it is not worth it.

      was (Author: dlyubimov):
    Another thing that I realized while working on MAHOUT-797, is that this execution plan requires to pass 3 times over input of A instead of existing 1 pass, if we are not saving Y. (first time to comute Y'Y, second time to compute B, and third time to compute U).

So we either have to save Y, or incur passes over big dataset (assuming A is much densier than Y). 

And we we do save Y then we swap Q for Y, and Y would be much bigger than Q if A is tall enough.

Potentially we are not saving intermediate Q blocks in the middle(which we can reduce replication for), but in my task running 3 times over A actually increases running time.

So there are probably cases where this execution plan would be sufficiently preferrable, but it seems that A must be supersparse, and wide (which contradicts each other).
  
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186623#comment-13186623 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

OK.  I think I have this failing test corralled.  The problem was that the test case was reading the results and assuming the blocks in the result matrix would be sorted.  This is true on the mac and sometimes true on ubuntu.  Once it fails, it is pretty consistent.

I put a sort into test and all seems well.  I replicated the failure 3 times in a row, added the sort and verified that the failure did not occur on multiple runs.  I removed the sort and the failure returned consistently.

Fix is forthcoming shortly as soon as I do the same tests on ubuntu 10 as well as 11.
                
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.6
>
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dan Brickley (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13123652#comment-13123652 ] 

Dan Brickley commented on MAHOUT-792:
-------------------------------------

Looking to try this, I tried applying the last patch per https://cwiki.apache.org/MAHOUT/how-to-contribute.html#HowToContribute-Applyingapatch  i.e. 'patch -p 0 -i MAHOUT-792.patch', and get a pile of 'Reversed (or previously applied) patch detected' and a 'can't find file to patch'. Is some but not all of this now committed? 

Per "Github branch MAHOUT-792 available at git://github.com/tdunning/mahout.git"  I'm looking around https://github.com/tdunning/mahout but there's no branch listed of that name.  Presumably https://github.com/tdunning/mahout/tree/new-stochastic is the place to go?

                
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096733#comment-13096733 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

Lance,

Can you say why?  What is it that one would see here?



> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091944#comment-13091944 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

{quote}
One thing that i don't understand is how you are proposing to derive this R^-1. To do Cholesky on Y'Y, take transpose and inverse and declare it R^-1?
{quote}
This is just a notational deficiency.  The CholeskyDecomposition has solveLeft and solveRight methods that do use back or forward substitution to effectively right or left multiply by the inverse of R (or R').  I find my current naming to be confusing and think that it might be better to have the methods timesLinv, timesRinv and leftTimesLinv and leftTimesRinv or some such because that would match the actual mathematical exposition better.   If you have a good idea for naming, please suggest it.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114109#comment-13114109 ] 

Dmitriy Lyubimov commented on MAHOUT-792:
-----------------------------------------

Another thing that I realized while working on MAHOUT-797, is that this execution plan requires to pass 3 times over input of A instead of existing A, if we are not saving Y. (first time to comute Y'Y, second time to compute B, and third time to compute U).

So we either have to save Y, or incur passes over big dataset (assuming A is much densier than Y). 

And we we do save Y then we swap Q for Y, and Y would be much bigger than Q if A is tall enough.

Potentially we are not saving intermediate Q blocks in the middle(which we can reduce replication for), but in my task running 3 times over A actually increases running time.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089635#comment-13089635 ] 

Dmitriy Lyubimov commented on MAHOUT-792:
-----------------------------------------

a-- It looks like it affects a lot of not very relevant stuff? (GraidentMachine, AbstractClassifier files)

b-- Ted, are you going to implement MR version? do you have a parallelization strategy?

Thanks.


> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114282#comment-13114282 ] 

Dmitriy Lyubimov commented on MAHOUT-792:
-----------------------------------------

Yes... and that's what I mean. I was implementing it as Y-on-the-fly. But that implies full pass over A every time we need access to Y, and we need it 3 times without option to parallelize. That's why I think I need to save it.

Also, I am thinking one step ahead, power iterations. In that chain, Yi is AB', and there's no way to compute that on the fly. So, for first stab at it, it would make my life easier to save Y with low degree of replication, and then use the rest of pipeline the same way regardless of i. Besides, I think that saving Y is supposed to make things more efficient, not less, with most generic cases, assuming A >> Y in volume.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091927#comment-13091927 ] 

Dmitriy Lyubimov edited comment on MAHOUT-792 at 8/26/11 6:38 PM:
------------------------------------------------------------------

more thoughts: 

1 -- it looks like this method is compressing a lot of info into k+p upper-triangular R. I guess there's more potential for rounding errors there. (I am not buying too much into rounding errors though). 

2 -- Argument of Q matrix being large is not very convincing to me. Q is m x (k+p), that is, assuming A is much wider (in terms of non-zero element average #) which it should be in order for projection method to make sense at all, it is a fraction of A. Besides, it is written in blocks by mappers, and read by mappers, so each mapper sees only one block of size say 30,000 x (k+p), which it will not be, assuming A is sufficiently wide, and assuming k+p=500, amounts to 120mb per mapper. So there's not so much actual memory pressure, that's what distributed computations are for. Same goes for dense operations, we just distribute them. 

3 -- it looks like you are writing B as output of the second MR, which is (k+p) x n. Going back to argument of a 'big Q', we can't assume that B would be any less. In fact, some time ago i came to conclusion that it looks like projection method would be much more efficient if input is wide rather than tall since projection compression factor would be much better for what seems to be fairly inexpensive operation (since it costs nothing to redistribute Omega which is only backed by one long number as a random gen seed, as opposed to actual long or wide bands such as B or Q).  So we can't exclude very wide inputs. 


Overall it looks like a great improvement. I am not convinced that it would cut processing time that much (it looks it has the same amount of proposed MR steps but it looks like all of them would require shuffle-and-sort and reduce phase, whereas with QR is a reduceless process), but it definitely reduces complexity of MR implementation and i would be very eager to try it out. Certainly all i said is the first impression  and intuition only; and in my experience intuition turns out to be wrong surprisingly often as far as benchmark guesstimates are concerned.

      was (Author: dlyubimov):
    more thoughts: 

1 -- it looks like this method is compressing a lot of info into k+p upper-triangular R. I guess there's more potential for rounding errors there. (I am not buying too much into rounding errors though). 

2 -- Argument of Q matrix being large is not very convincing to me. Q is n x (k+p), that is, assuming A is much wider (in terms of non-zero element average #) which it should be in order for projection method to make sense at all, it is a fraction of A. Besides, it is written in blocks by mappers, and read by mappers, so each mapper sees only one block of size say 30,000 x (k+p), which it will not be, assuming A is sufficiently wide, and assuming k+p=500, amounts to 120mb per mapper. So there's not so much actual memory pressure, that's what distributed computations are for. Same goes for dense operations, we just distribute them. 

3 -- it looks like you are writing B as output of the second MR, which is (k+p) x n. Going back to argument of a 'big Q', we can't assume that B would be any less. In fact, some time ago i came to conclusion that it looks like projection method would be much more efficient if input is wide rather than tall since projection compression factor would be much better for what seems to be fairly inexpensive operation (since it costs nothing to redistribute Omega which is only backed by one long number as a random gen seed, as opposed to actual long or wide bands such as B or Q).  So we can't exclude very wide inputs. 


Overall it looks like a great improvement. I am not convinced that it would cut processing time that much (it looks it has the same amount of proposed MR steps but it looks like all of them would require shuffle-and-sort and reduce phase, whereas with QR is a reduceless process), but it definitely reduces complexity of MR implementation and i would be very eager to try it out. Certainly all i said is the first impression  and intuition only; and in my experience intuition turns out to be wrong surprisingly often as far as benchmark guesstimates are concerned.
  
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091902#comment-13091902 ] 

Dmitriy Lyubimov commented on MAHOUT-792:
-----------------------------------------

This looks like a very promising shortcut for MR operationally. 

AFAICT, you suggest to compute Y'Y and then derive R^-1 out of it and use it in another pass to derive B. Y'Y is indeed operationally much easier to accumulate and produce with MR in the same projection step. 

One thing that i don't understand is how you are proposing to derive this R^-1. To do Cholesky on Y'Y, take transpose and inverse and declare it R^-1?  

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114109#comment-13114109 ] 

Dmitriy Lyubimov edited comment on MAHOUT-792 at 9/25/11 1:15 AM:
------------------------------------------------------------------

Another thing that I realized while working on MAHOUT-797, is that this execution plan requires to pass 3 times over input of A instead of existing 1 pass, if we are not saving Y. (first time to comute Y'Y, second time to compute B, and third time to compute U).

So we either have to save Y, or incur passes over big dataset (assuming A is much densier than Y). 

And we we do save Y then we swap Q for Y, and Y would be much bigger than Q if A is tall enough.

Potentially we are not saving intermediate Q blocks in the middle(which we can reduce replication for), but in my task running 3 times over A actually increases running time.

So there are probably cases where this execution plan would be sufficiently preferrable, but it seems that A must be supersparse, and wide (which contradicts each other).

      was (Author: dlyubimov):
    Another thing that I realized while working on MAHOUT-797, is that this execution plan requires to pass 3 times over input of A instead of existing A, if we are not saving Y. (first time to comute Y'Y, second time to compute B, and third time to compute U).

So we either have to save Y, or incur passes over big dataset (assuming A is much densier than Y). 

And we we do save Y then we swap Q for Y, and Y would be much bigger than Q if A is tall enough.

Potentially we are not saving intermediate Q blocks in the middle(which we can reduce replication for), but in my task running 3 times over A actually increases running time.
  
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13183830#comment-13183830 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

This is not obviously an ubuntu portability issue.  We have failures on ubuntu4 and ubuntu5, successes on ubuntu3 and solaris1.  And then a failure on ubuntu3.

                
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.6
>
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186685#comment-13186685 ] 

Hudson commented on MAHOUT-792:
-------------------------------

Integrated in Mahout-Quality #1309 (See [https://builds.apache.org/job/Mahout-Quality/1309/])
    MAHOUT-792 - Forced correct block ordering in out-of-core SVD.  Hopefully addresses ubuntu test failures.  Also forced file closing.

tdunning : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1231800
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/ssvd/SequentialOutOfCoreSvd.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/ssvd/SequentialOutOfCoreSvdTest.java

                
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.6
>
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092001#comment-13092001 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

{quote}
1 – it looks like this method is compressing a lot of info into k+p upper-triangular R. I guess there's more potential for rounding errors there. (I am not buying too much into rounding errors though).
{quote}

This is absolutely true.  On the other hand, this is only really putting as much information into that matrix as we ultimately want back out (i.e. we *are* reducing dimension here).

{quote}
2 – Argument of Q matrix being large is not very convincing to me. Q is m x (k+p), that is, assuming A is much wider (in terms of non-zero element average #) which it should be in order for projection method to make sense at all, it is a fraction of A. 
{quote}
But A is sparse so Y, B and Q are all about the same (storage) size as A.  In fact, if k+p > average number of elements per row of A, then Y, B and Q will all be *larger* than A.

{quote}
Besides, it is written in blocks by mappers, and read by mappers, so each mapper sees only one block of size say 30,000 x (k+p), which it will not be, assuming A is sufficiently wide, and assuming k+p=500, amounts to 120mb per mapper. So there's not so much actual memory pressure, that's what distributed computations are for. Same goes for dense operations, we just distribute them.
{quote}
Of course.  But using the Cholesky trick means that Q'A can be done by reading only A instead of Q and A.

{quote}
3 – it looks like you are writing B as output of the second MR, which is (k+p) x n. Going back to argument of a 'big Q', we can't assume that B would be any less. 
{quote}
I completely agree.  B is too big to store all in memory and could easily be comparable to or larger than A.

{quote}
In fact, some time ago i came to conclusion that it looks like projection method would be much more efficient if input is wide rather than tall since projection compression factor would be much better for what seems to be fairly inexpensive operation (since it costs nothing to redistribute Omega which is only backed by one long number as a random gen seed, as opposed to actual long or wide bands such as B or Q). So we can't exclude very wide inputs.
{quote}
Indeed.

The common web-case could easily wind up that way.  You might have a million users exploring a billion web pages.  

{quote}
Overall it looks like a great improvement. I am not convinced that it would cut processing time that much (it looks it has the same amount of proposed MR steps but it looks like all of them would require shuffle-and-sort and reduce phase, whereas with QR is a reduceless process), but it definitely reduces complexity of MR implementation
{quote}
This is might thought exactly.  It also decreases the memory load of in-memory implementations which is good.  But as you say later, intuition is extraordinarily dangerous for run-times.


> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-792:
-------------------------------

    Attachment: MAHOUT-792.patch

Updated with small fixes.  The tests are enough better now that I am beginning to believe that this might be correct.

Github has been updated as well.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-792:
-------------------------------

    Attachment: sd-2.pdf

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092404#comment-13092404 ] 

Dmitriy Lyubimov commented on MAHOUT-792:
-----------------------------------------

Ted, 
what's degree of stability of the CholeskyDecomposition code? I am getting test failure 

{code}
Failed tests: 
  testSingularValues(org.apache.mahout.math.ssvd.SequentialOutOfCoreSvdTest): expected:<0.0> but was:<5250028.000000007>
{code}

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175254#comment-13175254 ] 

Hudson commented on MAHOUT-792:
-------------------------------

Integrated in Mahout-Quality #1265 (See [https://builds.apache.org/job/Mahout-Quality/1265/])
    MAHOUT-792 - More small fixes.  Added internal exception class for CholeskyDecomp
MAHOUT-792 - Made tests use odd sizes to detect row/column confusion.  Fixed small errors in out of core SVD
MAHOUT-792 - New SSVD codes.

MAHOUT-792 - Copyright fixes for Cholesky
MAHOUT-792 - Additional test for QR decomposition.
MAHOUT-792 - Cholesky decomposition

MAHOUT-792 - New Cholesky decomposition for SSVD update.

MAHOUT-792 - Switch Cholesky to views.

tdunning : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1222517
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/ssvd/SequentialOutOfCoreSvd.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/CholeskyDecomposition.java

tdunning : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1222516
Files : 
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/ssvd/SequentialOutOfCoreSvdTest.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/ssvd/SequentialBigSvd.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/ssvd/SequentialBigSvdTest.java

tdunning : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1222515
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/ssvd
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/ssvd/SequentialOutOfCoreSvd.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/ssvd
* /mahout/trunk/core/src/test/java/org/apache/mahout/math/ssvd/SequentialOutOfCoreSvdTest.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/CholeskyDecomposition.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/ssvd/SequentialBigSvd.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/CholeskyDecompositionTest.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/ssvd
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/ssvd/SequentialBigSvdTest.java

tdunning : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1222514
Files : 
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/QRDecompositionTest.java

tdunning : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1222513
Files : 
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/CholeskyDecomposition.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/CholeskyDecompositionTest.java

                
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.6
>
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096776#comment-13096776 ] 

Lance Norskog commented on MAHOUT-792:
--------------------------------------

Yes, bad habit. The "low-grade" mode does not emit 0. The high-grade mode does not emit negative numbers.

I'm curious: what is the math behind an even split of -1,0,1? The Achlioptas math says +1/-1 or +1/-1/0/0/0 are OK. So, the low-grade mode is proven correct, just not what the comment describes :)

  

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114286#comment-13114286 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

{quote}
Yes... and that's what I mean. I was implementing it as Y-on-the-fly. But that implies full pass over A every time we need access to Y, and we need it 3 times without option to parallelize. That's why I think I need to save it.
{quote}

Without the power iterations, every time that Y is needed, A is also scanned.  That means that Y-on-the-fly is fine.

But I think that saving Y is a fine idea.  It should usually (but surprisingly, not at all always) be smaller than A.  It is the same size as Q.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13091927#comment-13091927 ] 

Dmitriy Lyubimov commented on MAHOUT-792:
-----------------------------------------

more thoughts: 

1 -- it looks like this method is compressing a lot of info into k+p upper-triangular R. I guess there's more potential for rounding errors there. (I am not buying too much into rounding errors though). 

2 -- Argument of Q matrix being large is not very convincing to me. Q is n x (k+p), that is, assuming A is much wider (in terms of non-zero element average #) which it should be in order for projection method to make sense at all, it is a fraction of A. Besides, it is written in blocks by mappers, and read by mappers, so each mapper sees only one block of size say 30,000 x (k+p), which it will not be, assuming A is sufficiently wide, and assuming k+p=500, amounts to 120mb per mapper. So there's not so much actual memory pressure, that's what distributed computations are for. Same goes for dense operations, we just distribute them. 

3 -- it looks like you are writing B as output of the second MR, which is (k+p) x n. Going back to argument of a 'big Q', we can't assume that B would be any less. In fact, some time ago i came to conclusion that it looks like projection method would be much more efficient if input is wide rather than tall since projection compression factor would be much better for what seems to be fairly inexpensive operation (since it costs nothing to redistribute Omega which is only backed by one long number as a random gen seed, as opposed to actual long or wide bands such as B or Q).  So we can't exclude very wide inputs. 


Overall it looks like a great improvement. I am not convinced that it would cut processing time that much (it looks it has the same amount of proposed MR steps but it looks like all of them would require shuffle-and-sort and reduce phase, whereas with QR is a reduceless process), but it definitely reduces complexity of MR implementation and i would be very eager to try it out. Certainly all i said is the first impression  and intuition only; and in my experience intuition turns out to be wrong surprisingly often as far as benchmark guesstimates are concerned.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095251#comment-13095251 ] 

Hudson commented on MAHOUT-792:
-------------------------------

Integrated in Mahout-Quality #1012 (See [https://builds.apache.org/job/Mahout-Quality/1012/])
    MAHOUT-790 - kill the cardinality array and size() for matrices.  Use rowSize() and columnSize() instead.

MAHOUT-792 - Fix RTM to avoid size() and cardinality array.

tdunning : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1164016
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/als/eval/InMemoryFactorizationEvaluator.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/bayes/InMemoryBayesDatastore.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/discriminative/LinearTrainer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/NaiveBayesModel.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/naivebayes/training/TrainUtils.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/classifier/sequencelearning/hmm/HmmUtils.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/MatrixWritable.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/decomposer/EigenVerificationJob.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/math/hadoop/stochasticsvd/UpperTriangular.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/hadoop/als/ParallelALSFactorizationJobTest.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/AbstractMatrix.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/DenseMatrix.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/DiagonalMatrix.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/Matrix.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/MatrixView.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/PermutedVectorView.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/PivotedMatrix.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/RandomAccessSparseVector.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/RandomTrinaryMatrix.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/SequentialAccessSparseVector.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/SparseColumnMatrix.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/SparseMatrix.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/SparseRowMatrix.java
* /mahout/trunk/math/src/main/java/org/apache/mahout/math/decomposer/hebbian/HebbianSolver.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/AbstractTestVector.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/MatrixTest.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestMatrixView.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestSparseColumnMatrix.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestSparseMatrix.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestSparseRowMatrix.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/TestVectorView.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/als/AlternateLeastSquaresSolverTest.java
* /mahout/trunk/math/src/test/java/org/apache/mahout/math/decomposer/SolverTest.java


> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096570#comment-13096570 ] 

Lance Norskog commented on MAHOUT-792:
--------------------------------------

Please run this:
{code}
 public static void main(String[] args) {
	  RandomTrinaryMatrix rtm = new RandomTrinaryMatrix(0, 30, 40, true);
	  for(int row = 0; row < 30; row++) 
		  for(int column = 0; column < 40; column++)
			  System.out.println(rtm.get(row, column));	
	  rtm = new RandomTrinaryMatrix(0, 30, 40, true);
	  for(int row = 0; row < 30; row++) 
		  for(int column = 0; column < 40; column++)
			  System.out.println(rtm.get(row, column));	
  }
{code}
  

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13144452#comment-13144452 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

Committed this new code.
                
> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-792) Add new stochastic decomposition code

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089659#comment-13089659 ] 

Ted Dunning commented on MAHOUT-792:
------------------------------------

On (a), the patch includes MAHOUT-790 since it depends on it.  When I commit that JIRA, it will be easier to read.

On (b), see the paper.  Yes.

> Add new stochastic decomposition code
> -------------------------------------
>
>                 Key: MAHOUT-792
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-792
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ted Dunning
>         Attachments: MAHOUT-792.patch, MAHOUT-792.patch, sd-2.pdf
>
>
> I have figured out some simplification for our SSVD algorithms.  This eliminates the QR decomposition and makes life easier.
> I will produce a patch that contains the following:
>   - a CholeskyDecomposition implementation that does pivoting (and thus rank-revealing) or not.  This should actually be useful for solution of large out-of-core least squares problems.
>   - an in-memory SSVD implementation that should work for matrices up to about 1/3 of available memory.
>   - an out-of-core SSVD threaded implementation that should work for very large matrices.  It should take time about equal to the cost of reading the input matrix 4 times and will require working disk roughly equal to the size of the input.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira