You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2009/09/25 20:25:10 UTC

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Isabel,

Very interesting post.  Here are more accessible resources:

http://arxiv.org/abs/0909.4061
http://www.pnas.org/content/104/51/20167

THese provide a very interesting and solid link between random indexing and
SVD algorithms.  They also definitely provide a fantastic way to implement
large scale SVD using map-reduce.

Nice pointer!

2009/9/25 Michael Brückner <mi...@cs.uni-potsdam.de>

>
> year's NIPS (http://nips.cc/Conferences/2009/Program/event.php?ID=1491)




-- 
Ted Dunning, CTO
DeepDyve

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Ted Dunning <te...@gmail.com>.
Doh!

Brilliantly stupid way to make a seekable PRNG.

Everything that I was looking for.  And less!

On Mon, Sep 28, 2009 at 1:05 PM, Jake Mannix <ja...@gmail.com> wrote:

>
> Considering the PRNG is just needed to get vectors which aren't linearly
> dependent, it seems like even using a completely random-access random
> stream (like M_i,j = new Random(seed ^ (i * numRows + j)).nextInt() )
> should work, requiring no memory at all.




-- 
Ted Dunning, CTO
DeepDyve

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Ted Dunning <te...@gmail.com>.
Probably better to save it.  Probably also depends on the problem set.  If A
has fewer non-zero elements, then recomputing y may be faster than
retrieving a saved copy.  Even if y and A have the same number of of
non-zero elements, then recomputing y might be faster because you save the
writing cost.

But I really don't know which way this should go in the end.

On Mon, Sep 28, 2009 at 11:49 PM, Jake Mannix <ja...@gmail.com> wrote:

> This is true, but is recomputing Y on the second pass worth not saving it?




-- 
Ted Dunning, CTO
DeepDyve

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Ted Dunning <te...@gmail.com>.
I don't see that, actually.  In fact, for many data-sets it is better to
decompose (AA') A to get better convergence which seems to add at least one
more pass through the data.

On Mon, Sep 28, 2009 at 11:49 PM, Jake Mannix <ja...@gmail.com> wrote:

> Total work done - three passes over the dataset.  Now I just need to
> read further into these papers and see how that gets turned into only
> *one* pass through!   But I'll save that for another day.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Jake Mannix <ja...@gmail.com>.
On Mon, Sep 28, 2009 at 5:25 PM, Ted Dunning <te...@gmail.com> wrote:

> Yes, but y probably has about as much data as A had.  A is sparse, y is
> dense, but they have about the same number of rows.  Y will have dozens to
> hundreds of non-zero elements per row which is probably comparable to A.
>

This is true, but is recomputing Y on the second pass worth not saving it?

This means that dealing with y'y implicitly at the cost of passing over y is
> a major loss.  Better to go ahead and form y' y on the fly as y is formed,
> do an in-memory eigenvector computation and then pass over y to get Q (in
> the original algorithm notation).  Q is also quite large.
>

Both your technique and mine seem to require all the different mappers
building pieces of Y'Y, so the "in-memory pass" to get the small
decomposition of Y'Y is only going to be done after one full pass through
A.  If we're not storing Y anywhere, what are we going to do with the
k-by-k decomposition of Y'Y, other than to get Q, where you do another
pass over Y (computing Y on the fly, I guess you're suggesting, by really
doing a pass over A?)?

Maybe I'm just tired (6am flight this morning and I'm still up why?), but
the tradeoff of not storing Y anywhere means that when you need to
compute Q, you have to recompute it, which requires another factor
of k * (avg. density of A) more flops to compute, right?  If you've got
the decomposition of Y'Y = V S^2 V', and distributed V S^-1 to all the
mappers for the Q computation, then to get Q = Y (V S^-1) you need
to recompute y_i = a_i * R (requiring k sparse dot products, each with
j operations needed if the average density of A is j), then post-multiply
by the k x k matrix V S^-1.  If you were instead doing a pass over Y,
you'd miss that entire multiplicative factor of k*j at the expense of
basically
doubling your temporary HDFS usage.  Maybe I'm being dense (hah!
man I should really get to bed, with humor like that), what am I messing
up here?

Either way, at this point you've got one half of the decomposition of A
and for many use-cases you don't need to go any further: you already
have the singular values from your Y'Y decomposition, so really the only
thing to do is really do a transpose on your q_i to put it in the right
form for doing projections or folding-in of new data.

If you do really want Q'A, you've got it in a reasonable form to do this:
both Q and A are presented in the same "wrong" way (Q as n k-vectors:
y_i * (VS^-1), and A as a set of n sparse m-vectors: a_j), so doing the
outer-product trick works here too:

  Q'A = matrixSum_{i=1..n} ( q_i (x) a_i )

(where (x) is my lovely ascii rendition of a tensor product)

Total work done - three passes over the dataset.  Now I just need to
read further into these papers and see how that gets turned into only
*one* pass through!   But I'll save that for another day.

  -jake

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Ted Dunning <te...@gmail.com>.
Yes, but y probably has about as much data as A had.  A is sparse, y is
dense, but they have about the same number of rows.  Y will have dozens to
hundreds of non-zero elements per row which is probably comparable to A.

This means that dealing with y'y implicitly at the cost of passing over y is
a major loss.  Better to go ahead and form y' y on the fly as y is formed,
do an in-memory eigenvector computation and then pass over y to get Q (in
the original algorithm notation).  Q is also quite large.

Forming Q' A can be done in a merge step similar to the production of AR to
get S which is wide rather than tall like y.  As such, it is probably better
to compute S' and proceed as before.  It isn't clear to me just off-hand how
this might be done with row-wise access to A, but that would be a key step..

On Mon, Sep 28, 2009 at 1:05 PM, Jake Mannix <ja...@gmail.com> wrote:

> When decomposing y'y, I tend to use a similar trick to this, which is
> probably
> equivalent: to multiply matrix by vector (y'y)*u, just do the vector sum of
> y_i * (y_i.dot(u)) over the rows y_i of y (in which case you store y, but
> never
> bother to store y'y - and just like the outer-product sum, the vector sum
> can
> be done partially in the combiners as well).   Of course, y'y is much
> smaller
> than y, so avoiding saving y'y is not really that much of a win - but or
> this
> algorithm, it seems you do probably need both y and y'y, in which case it
> might be nice to not have to ever really deal with y'y explicitly.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Jake Mannix <ja...@gmail.com>.
On Sun, Sep 27, 2009 at 9:37 PM, Ted Dunning <te...@gmail.com> wrote:

>
> So ... if R is defined by a PRNG with well defined seeds for each row, then
> you don't have to store it.  It is particularly nice if you can compute the
> n-th random number from a particular seed at high speed.  If you can do
> that, then multiplication by a sparse matrix is very clean.  You can
> probably do that with most generators by building a moral equivalent of a
> skip list ... just save the seed every few hundred elements.  Better to
> have
> a generator that can be iterated more efficiently, but usable.  Note the
> comments in the articles about how very poor generators work just fine.
>

Considering the PRNG is just needed to get vectors which aren't linearly
dependent, it seems like even using a completely random-access random
stream (like M_i,j = new Random(seed ^ (i * numRows + j)).nextInt() )
should work, requiring no memory at all.

Now in a MR setting, A will probably be split into row-wise patches.  y = AR
> is the row-wise concatenation of these.  With the PRNG trick, each mapper
> can have an implicit copy of R.  Otherwise, it will have to be distributed
> as a side file to all mappers and a merge will be necessary.  But if we are
> after y' y as well, we can compute this by taking outer products of each
> row
> of y and summing the resulting k x k matrices.  This works wonderfully
> because combiners and such can combine partial map outputs.  You probably
> still have to store y itself, but with a few fancy keys and a multiple
> output format, you should be good to go.
>

When decomposing y'y, I tend to use a similar trick to this, which is
probably
equivalent: to multiply matrix by vector (y'y)*u, just do the vector sum of
y_i * (y_i.dot(u)) over the rows y_i of y (in which case you store y, but
never
bother to store y'y - and just like the outer-product sum, the vector sum
can
be done partially in the combiners as well).   Of course, y'y is much
smaller
than y, so avoiding saving y'y is not really that much of a win - but or
this
algorithm, it seems you do probably need both y and y'y, in which case it
might be nice to not have to ever really deal with y'y explicitly.

  -jake

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Ted Dunning <te...@gmail.com>.
Awesome!

Let me know how it goes.  I would love to kibitz but can't code anything.

One thing that I did notice is that after computing y = AR, a critical step
is SVD of y.  This is best done, perhaps, by decomposing y'y since that is a
k x k matrix.

So ... if R is defined by a PRNG with well defined seeds for each row, then
you don't have to store it.  It is particularly nice if you can compute the
n-th random number from a particular seed at high speed.  If you can do
that, then multiplication by a sparse matrix is very clean.  You can
probably do that with most generators by building a moral equivalent of a
skip list ... just save the seed every few hundred elements.  Better to have
a generator that can be iterated more efficiently, but usable.  Note the
comments in the articles about how very poor generators work just fine.

Now in a MR setting, A will probably be split into row-wise patches.  y = AR
is the row-wise concatenation of these.  With the PRNG trick, each mapper
can have an implicit copy of R.  Otherwise, it will have to be distributed
as a side file to all mappers and a merge will be necessary.  But if we are
after y' y as well, we can compute this by taking outer products of each row
of y and summing the resulting k x k matrices.  This works wonderfully
because combiners and such can combine partial map outputs.  You probably
still have to store y itself, but with a few fancy keys and a multiple
output format, you should be good to go.

On Sun, Sep 27, 2009 at 9:03 PM, Jake Mannix <ja...@gmail.com> wrote:

>  That shouldn't be too hard to Hadoop-ify.  I'll see if I can try it out
> while porting over the decomposer stuff.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Sep 25, 2009 at 1:28 PM, Ted Dunning <te...@gmail.com> wrote:

>
> With these randomized algorithms, you multiply several random vectors by
> your matrix A at one time to get A R = y.  Since you choose a few extra
> vectors in R, you know with high certainty that the projection of these
> random vectors y spans the range of A.  This means that you can use y to
> compute an orthonormal basis of A very accurately.
>
> So the key trick here is that unlike power methods, you aren't doing
> repeated multiplications by A.  That avoids the entire round-off explosion
> problem.
>

  Ok, so round-off explosion / MAX_DOUBLE overflow avoidance is great, I
dig that part.

  So the way this is going is that instead of computing A r, A(A r), etc...,
you
just compute A R, which really is the same algorithmic complexity as
Lanczos,
(at first I was assuming that "one pass" through the matrix meant that you'd
be avoiding the factor of "k" in your asymptotic complexity of the rank-k
decomposition, which would have been a pretty neat trick!).

Moreover, since all of these multiplications are independent, you can do
> these multiplications row by row on A.  As you do this, you can store A R
> and at the same time be recursively doing a random decomposition on AR so
> that after one pass of the data, you have a small dense matrix to work on.
> Another pass over AR and one on A' later and you have full decomposition.
> If you only want one of the sets of eigenvectors, then you only need the
> pass over AR which is much smaller than A.
>

  Oh now that is indeed clever!  N x N - > N x k -> k x k -> multiply back
and
you're pretty much done.

  That shouldn't be too hard to Hadoop-ify.  I'll see if I can try it out
while
porting over the decomposer stuff.

  -jake

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Ted Dunning <te...@gmail.com>.
The basic reason is the Johnson-Lindenstrauss lemma which says that you can
approximate the structure of an operator with a small number of random
vectors.  Krylog sub-space methods like Lanczos use only a single random
vector and with repeated multiplication, round-off errors shoot off in the
direction of the larger eigenvectors.  When looking for the smaller
eigenvalues, this makes re-orthogonalization happen.

With these randomized algorithms, you multiply several random vectors by
your matrix A at one time to get A R = y.  Since you choose a few extra
vectors in R, you know with high certainty that the projection of these
random vectors y spans the range of A.  This means that you can use y to
compute an orthonormal basis of A very accurately.

So the key trick here is that unlike power methods, you aren't doing
repeated multiplications by A.  That avoids the entire round-off explosion
problem.

Moreover, since all of these multiplications are independent, you can do
these multiplications row by row on A.  As you do this, you can store A R
and at the same time be recursively doing a random decomposition on AR so
that after one pass of the data, you have a small dense matrix to work on.
Another pass over AR and one on A' later and you have full decomposition.
If you only want one of the sets of eigenvectors, then you only need the
pass over AR which is much smaller than A.

On Fri, Sep 25, 2009 at 12:21 PM, Jake Mannix <ja...@gmail.com> wrote:

> On Fri, Sep 25, 2009 at 12:11 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> a) the random vectors are multiplied by A simultaneously rather than
> > sequentially as in Lanczos
> >
> > b) no risk of losing orthgonality
> >
>
> I haven't read the papers in detail yet, have you seen how these two points
> are done?  In a parallel environment, how do you maintain orthogonality
> while
> looking for multiple projection directions?  Whats to keep you from finding
> the same
> directions?




-- 
Ted Dunning, CTO
DeepDyve

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Jake Mannix <ja...@gmail.com>.
On Fri, Sep 25, 2009 at 12:11 PM, Ted Dunning <te...@gmail.com> wrote:

a) the random vectors are multiplied by A simultaneously rather than
> sequentially as in Lanczos
>
> b) no risk of losing orthgonality
>

I haven't read the papers in detail yet, have you seen how these two points
are
done?  In a parallel environment, how do you maintain orthogonality while
looking
for multiple projection directions?  Whats to keep you from finding the same

directions?

d) it looks like it is possible to compute the decomposition in one or two
> passes over A.  This point alone is huge
>

Yeah, that is huge!

  -jake

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Ted Dunning <te...@gmail.com>.
My impression is that the advantages over Lanczos include the following
points:

a) the random vectors are multiplied by A simultaneously rather than
sequentially as in Lanczos

b) no risk of losing orthgonality

c) the bounds are known so we know how many extra vectors we need to use.
With Lanczos, you are never quite sure how many you need.

d) it looks like it is possible to compute the decomposition in one or two
passes over A.  This point alone is huge

On Fri, Sep 25, 2009 at 12:03 PM, Jake Mannix <ja...@gmail.com> wrote:

>  Those look very cool, I'd love to see how those compare with doing
> plain-old Lanczos for SVD on Hadoop.
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Jake Mannix <ja...@gmail.com>.
I looked at them a while ago, and part of the issue is that general-purpose
vectors need
to be able to do all things for all people... when you give that up, and say
that your sparse vectors
are, say, immutable, after creation, and only designed to do operations like
dot(DenseVector)
and DenseVector.plus(SparseVector) (and maybe DenseVector.plus(SparseVector,
double multiple),
for doing u + b*v), then you can get away with being very simple - just
parallel arrays of
indexes and values for the sparse vectors, and double[] arrays for the dense
ones.

I'll see if I can compare performance of my specialized simple vectors and
general purpose ones,
and see what the difference is.

  -jake

On Fri, Sep 25, 2009 at 12:08 PM, Ted Dunning <te...@gmail.com> wrote:

> Consider improving the versions in Mahout while you are at it!
>
> On Fri, Sep 25, 2009 at 12:03 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > but I'm currently using mostly my own sparse and dense
> > vector writables for use on Hadoop (designed specifically for things like
> > Lanczos and AGHA), so I'd need to port them over to use whichever vector
> > impls Mahout is using
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Ted Dunning <te...@gmail.com>.
Consider improving the versions in Mahout while you are at it!

On Fri, Sep 25, 2009 at 12:03 PM, Jake Mannix <ja...@gmail.com> wrote:

> but I'm currently using mostly my own sparse and dense
> vector writables for use on Hadoop (designed specifically for things like
> Lanczos and AGHA), so I'd need to port them over to use whichever vector
> impls Mahout is using
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Ted Dunning <te...@gmail.com>.
YES.

On Fri, Sep 25, 2009 at 12:03 PM, Jake Mannix <ja...@gmail.com> wrote:

> I noticed that we still
> don't have any large-scale SVD impls in Mahout.  Is there any interest by
> the community for me to try and port that / contribute this to Mahout?
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Jake Mannix <ja...@gmail.com>.
I only ask because I'm not sure if someone was already working on it, but
I'll
file a ticket and see if I start porting some of it over and make some
patches.

Mahout's on Hadoop 0.20 and commons-math-2.0 by now, IIRC?  That's what
decomposer is using.

  -jake

On Fri, Sep 25, 2009 at 12:10 PM, Robin Anil <ro...@gmail.com> wrote:

> You dont have to ask.  Please go ahead file a JIRA issue, and start working
> on it.
> http://issues.apache.org/jira/browse/MAHOUT
>
> Robin
>
>
> On Sat, Sep 26, 2009 at 12:33 AM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> >  Those look very cool, I'd love to see how those compare with doing
> > plain-old Lanczos for SVD on Hadoop.  Speaking of which, I've got an
> > implementation of that which I wrote up for my own matrix library (
> > http://decomposer.googlecode.com ) a while back, and I noticed that we
> > still
> > don't have any large-scale SVD impls in Mahout.  Is there any interest by
> > the community for me to try and port that / contribute this to Mahout?
> >  It's
> > Apache-licensed, but I'm currently using mostly my own sparse and dense
> > vector writables for use on Hadoop (designed specifically for things like
> > Lanczos and AGHA), so I'd need to port them over to use whichever vector
> > impls Mahout is using.
> >
> >  -jake
> >
> > On Fri, Sep 25, 2009 at 11:25 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > Isabel,
> > >
> > > Very interesting post.  Here are more accessible resources:
> > >
> > > http://arxiv.org/abs/0909.4061
> > > http://www.pnas.org/content/104/51/20167
> > >
> > > THese provide a very interesting and solid link between random indexing
> > and
> > > SVD algorithms.  They also definitely provide a fantastic way to
> > implement
> > > large scale SVD using map-reduce.
> > >
> > > Nice pointer!
> > >
> > > 2009/9/25 Michael Brückner <mi...@cs.uni-potsdam.de>
> > >
> > > >
> > > > year's NIPS (
> http://nips.cc/Conferences/2009/Program/event.php?ID=1491
> > )
> > >
> > >
> > >
> > >
> > > --
> > > Ted Dunning, CTO
> > > DeepDyve
> > >
> >
>

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Robin Anil <ro...@gmail.com>.
You dont have to ask.  Please go ahead file a JIRA issue, and start working
on it.
http://issues.apache.org/jira/browse/MAHOUT

Robin


On Sat, Sep 26, 2009 at 12:33 AM, Jake Mannix <ja...@gmail.com> wrote:

>  Those look very cool, I'd love to see how those compare with doing
> plain-old Lanczos for SVD on Hadoop.  Speaking of which, I've got an
> implementation of that which I wrote up for my own matrix library (
> http://decomposer.googlecode.com ) a while back, and I noticed that we
> still
> don't have any large-scale SVD impls in Mahout.  Is there any interest by
> the community for me to try and port that / contribute this to Mahout?
>  It's
> Apache-licensed, but I'm currently using mostly my own sparse and dense
> vector writables for use on Hadoop (designed specifically for things like
> Lanczos and AGHA), so I'd need to port them over to use whichever vector
> impls Mahout is using.
>
>  -jake
>
> On Fri, Sep 25, 2009 at 11:25 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Isabel,
> >
> > Very interesting post.  Here are more accessible resources:
> >
> > http://arxiv.org/abs/0909.4061
> > http://www.pnas.org/content/104/51/20167
> >
> > THese provide a very interesting and solid link between random indexing
> and
> > SVD algorithms.  They also definitely provide a fantastic way to
> implement
> > large scale SVD using map-reduce.
> >
> > Nice pointer!
> >
> > 2009/9/25 Michael Brückner <mi...@cs.uni-potsdam.de>
> >
> > >
> > > year's NIPS (http://nips.cc/Conferences/2009/Program/event.php?ID=1491
> )
> >
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>

Re: Making Very Large-Scale Linear Algebraic Computations Possible Via Randomization

Posted by Jake Mannix <ja...@gmail.com>.
  Those look very cool, I'd love to see how those compare with doing
plain-old Lanczos for SVD on Hadoop.  Speaking of which, I've got an
implementation of that which I wrote up for my own matrix library (
http://decomposer.googlecode.com ) a while back, and I noticed that we still
don't have any large-scale SVD impls in Mahout.  Is there any interest by
the community for me to try and port that / contribute this to Mahout?  It's
Apache-licensed, but I'm currently using mostly my own sparse and dense
vector writables for use on Hadoop (designed specifically for things like
Lanczos and AGHA), so I'd need to port them over to use whichever vector
impls Mahout is using.

  -jake

On Fri, Sep 25, 2009 at 11:25 AM, Ted Dunning <te...@gmail.com> wrote:

> Isabel,
>
> Very interesting post.  Here are more accessible resources:
>
> http://arxiv.org/abs/0909.4061
> http://www.pnas.org/content/104/51/20167
>
> THese provide a very interesting and solid link between random indexing and
> SVD algorithms.  They also definitely provide a fantastic way to implement
> large scale SVD using map-reduce.
>
> Nice pointer!
>
> 2009/9/25 Michael Brückner <mi...@cs.uni-potsdam.de>
>
> >
> > year's NIPS (http://nips.cc/Conferences/2009/Program/event.php?ID=1491)
>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>