You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Tamas Jambor <ja...@googlemail.com> on 2010/05/06 19:28:28 UTC

distributed SVD

I am looking into the problem of distributed SVD for recommender 
systems. does anyone know whether someone else tried to tackle this 
problem before?

Re: distributed SVD

Posted by Jake Mannix <ja...@gmail.com>.
On Thu, May 6, 2010 at 2:18 PM, Ted Dunning <te...@gmail.com> wrote:
>
> The only issues with the current distributed Lanczos solver is storage for
> the auxiliary matrices as they are produced.  Jake intimated that the had a
> solution for that that wasn't prime-time yet.
>

Measuring exactly how much this can affect you is easy, as well:  the
eigenvectors
which currently live in memory on the "driver" machine (where you launch
your
hadoop job from) take up k*M*8 bytes, where k is the number of eigenvectors
you're aiming for, and M is the dimensionality of the rows of your input
matrix.
This may actually require currently twice this value, as the basis which is
built
up is transformed once during an orthonormalization step.

So if you want 250 singular vectors, and your input matrix is comprised of
twenty seven gazillion input vectors, each of dimension 10 million, then you
would need between 20-40GB of RAM on the driver box (this value being
independent of what "a gazillion" is, of course).  If there's only 1 million
features per row, instead of 10 million, this drops down to 2-4GB needed.

The solution I have which isn't ready for prime time, is very simple, and
could be done by really anyone (meaning you don't really need to know too
much fancy linear algebra or SVD internals) who had the time (which I don't
right now, as I'm traveling abroad and have a million "real life" issues
I'm
dealing with at present):  if you look in LanczosSolver.solve(), the
SparseRowMatrix called "basis" is used, as one might imagine, to store
the set of basis vectors which the Lanczos iteration is generating.  Instead
of keeping that set in memory, on line 136 the new vector could be
persisted to disk (either local, or to HDFS [as a DistributedRowMatrix],
or a DB if reliability is a concern).  The only catch is that the basis
matrix is currently used inside that loop on line 126:

  orthoganalizeAgainstAllButLast(nextVector, basis);

Technically, this is overkill, as Lanczos should have already made sure
that nextVector is orthogonal to all of the previous vectors in the basis,
but I don't really completely trust that all overflow/underflow issues with
the finite precision we have in double values, when applied to matrix
multiplication with enormous matrices, have been dealt with completely.
Ideally we could have a flag for LanczosSolver.solve(), for "do it safely /
do it fast", and either skip that line or not.  In the case where it is "do
it safely", the slow way is to use the disk-saved matrix to do this
method either locally (one vector at a time) or via another small
hadoop job (which should actually just be the following:
nextVector.assign(basis.timesSquared(nextVector), Functions.minus) ).

Similarly, the loop at lines 157-172 is just building out the actual
eigenvectors from the basis and the (small) eigenvectors of the tri-diagonal
auxiliary matrix.  This could be done by a small hadoop job, or just done
without hanging onto all of the eigenvectors in memory at the same time
(since none of them require each other to be computed).

If someone wants to submit a patch doing some or all of this, I'd be happy
to review and commit it.  Otherwise I'll get it done sometime in the next
couple of months, but possibly not before mid-july or so.

  -jake


>
> On Thu, May 6, 2010 at 12:20 PM, Jake Mannix <ja...@gmail.com>
> wrote:
>
> > Tamas,
> >
> >  MAHOUT-371 will be able to leverage the existing
> DistributedLanczosSolver
> > and DistributedRowMatrix (in o.a.m.math.decomposer.hadoop package in
> core)
> > to do full sparse truncated SVD on the entire user-item matrix already,
> so
> > that part is taken care of.
> >
> >  -jake
> >
> > On Thu, May 6, 2010 at 11:38 AM, Tamas Jambor <jamborta@googlemail.com
> > >wrote:
> >
> > > that looks interesting, but quite general. I'd be interested to know
> how
> > he
> > > plans to divide the task that will be distributed. I mean SVD in
> general
> > > takes the whole user-item matrix, so it will be challenging to find a
> > good
> > > way to divide the task. Papers written on SVD do not discuss this
> aspect,
> > as
> > > far as I know.
> > >
> > >
> > > On 06/05/2010 18:32, Sean Owen wrote:
> > >
> > >> We're lucky to have a GSoC student implementing this over the summer:
> > >> https://issues.apache.org/jira/browse/MAHOUT-371
> > >>
> > >> On Thu, May 6, 2010 at 6:28 PM, Tamas Jambor<ja...@googlemail.com>
> > >>  wrote:
> > >>
> > >>
> > >>> I am looking into the problem of distributed SVD for recommender
> > systems.
> > >>> does anyone know whether someone else tried to tackle this problem
> > >>> before?
> > >>>
> > >>>
> > >>>
> > >>
> >
>

Re: distributed SVD

Posted by Jake Mannix <ja...@gmail.com>.
On Thu, May 6, 2010 at 2:18 PM, Ted Dunning <te...@gmail.com> wrote:
>
> The only issues with the current distributed Lanczos solver is storage for
> the auxiliary matrices as they are produced.  Jake intimated that the had a
> solution for that that wasn't prime-time yet.
>

I think what you were referring to, actually, Ted, is that I have code
already
which implements MAHOUT-376, but that code is not quite ready for
prime-time.
It should also get tested and patch posted sometime in the next 4-6 weeks,
hopefully.

  -jake

Re: distributed SVD

Posted by Ted Dunning <te...@gmail.com>.
Tamas,

See https://issues.apache.org/jira/browse/MAHOUT-376 as well.  I have been
using these techniques in R and want to push them into Mahout over the next
few months for more scalability.

The only issues with the current distributed Lanczos solver is storage for
the auxiliary matrices as they are produced.  Jake intimated that the had a
solution for that that wasn't prime-time yet.

On Thu, May 6, 2010 at 12:20 PM, Jake Mannix <ja...@gmail.com> wrote:

> Tamas,
>
>  MAHOUT-371 will be able to leverage the existing DistributedLanczosSolver
> and DistributedRowMatrix (in o.a.m.math.decomposer.hadoop package in core)
> to do full sparse truncated SVD on the entire user-item matrix already, so
> that part is taken care of.
>
>  -jake
>
> On Thu, May 6, 2010 at 11:38 AM, Tamas Jambor <jamborta@googlemail.com
> >wrote:
>
> > that looks interesting, but quite general. I'd be interested to know how
> he
> > plans to divide the task that will be distributed. I mean SVD in general
> > takes the whole user-item matrix, so it will be challenging to find a
> good
> > way to divide the task. Papers written on SVD do not discuss this aspect,
> as
> > far as I know.
> >
> >
> > On 06/05/2010 18:32, Sean Owen wrote:
> >
> >> We're lucky to have a GSoC student implementing this over the summer:
> >> https://issues.apache.org/jira/browse/MAHOUT-371
> >>
> >> On Thu, May 6, 2010 at 6:28 PM, Tamas Jambor<ja...@googlemail.com>
> >>  wrote:
> >>
> >>
> >>> I am looking into the problem of distributed SVD for recommender
> systems.
> >>> does anyone know whether someone else tried to tackle this problem
> >>> before?
> >>>
> >>>
> >>>
> >>
>

Re: distributed SVD

Posted by Jake Mannix <ja...@gmail.com>.
Tamas,

  MAHOUT-371 will be able to leverage the existing DistributedLanczosSolver
and DistributedRowMatrix (in o.a.m.math.decomposer.hadoop package in core)
to do full sparse truncated SVD on the entire user-item matrix already, so
that part is taken care of.

  -jake

On Thu, May 6, 2010 at 11:38 AM, Tamas Jambor <ja...@googlemail.com>wrote:

> that looks interesting, but quite general. I'd be interested to know how he
> plans to divide the task that will be distributed. I mean SVD in general
> takes the whole user-item matrix, so it will be challenging to find a good
> way to divide the task. Papers written on SVD do not discuss this aspect, as
> far as I know.
>
>
> On 06/05/2010 18:32, Sean Owen wrote:
>
>> We're lucky to have a GSoC student implementing this over the summer:
>> https://issues.apache.org/jira/browse/MAHOUT-371
>>
>> On Thu, May 6, 2010 at 6:28 PM, Tamas Jambor<ja...@googlemail.com>
>>  wrote:
>>
>>
>>> I am looking into the problem of distributed SVD for recommender systems.
>>> does anyone know whether someone else tried to tackle this problem
>>> before?
>>>
>>>
>>>
>>

Re: distributed SVD

Posted by Tamas Jambor <ja...@googlemail.com>.
that looks interesting, but quite general. I'd be interested to know how 
he plans to divide the task that will be distributed. I mean SVD in 
general takes the whole user-item matrix, so it will be challenging to 
find a good way to divide the task. Papers written on SVD do not discuss 
this aspect, as far as I know.

On 06/05/2010 18:32, Sean Owen wrote:
> We're lucky to have a GSoC student implementing this over the summer:
> https://issues.apache.org/jira/browse/MAHOUT-371
>
> On Thu, May 6, 2010 at 6:28 PM, Tamas Jambor<ja...@googlemail.com>  wrote:
>    
>> I am looking into the problem of distributed SVD for recommender systems.
>> does anyone know whether someone else tried to tackle this problem before?
>>
>>      

Re: distributed SVD

Posted by Sean Owen <sr...@gmail.com>.
We're lucky to have a GSoC student implementing this over the summer:
https://issues.apache.org/jira/browse/MAHOUT-371

On Thu, May 6, 2010 at 6:28 PM, Tamas Jambor <ja...@googlemail.com> wrote:
> I am looking into the problem of distributed SVD for recommender systems.
> does anyone know whether someone else tried to tackle this problem before?
>