You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Jake Mannix <ja...@gmail.com> on 2010/09/08 22:31:22 UTC

Monstrous matrices...

Hey all,

  It's been a while since I've been all that active, but I've been trying to
run some of the matrix stuff against really monstrous matrices (many of you
are aware that I switched jobs recently, and my new gig has some pretty
ridiculous data sets), and I'm also trying to get things to run in a
maximally scalable way: in particular, running on large Hadoop clusters, but
without the luxury of increasing the mapper/reducer heaps, and without the
luxury of a ton of memory on the machine I'm driving the mahout jobs from.

  What this means is I've been running into little things like dependencies
on O(numColumns) memory, or forgetting to override and make sure you're
using a bunch of reducers, or having the ability to chose several different
implementations (or better: not having to chose), such as if you matrix
multiply two matrices, one of them might be small enough to fit in memory,
or alternately, you're multiplying a matrix by a really really big vector,
and even the vector itself doesn't fit in memory.

  So I've been playing with a few different ways to approach some of these
issues, and I thought I'd bring them up here for any thoughts:


   -   The distributed SVD code currently stores, in memory (on the
   "driving" computer), desiredRank Vectors (each of size numCols of the input
   matrix) as a basis, to be used for eigen-generation.  Then, the eigenvectors
   (all desiredRank of them) are computed as linear combinations of this basis
   (essentially finalEigens = lanczosEigens.times(basis), where lanczosEigens
   are the eigenvectors of the small (desiredRank x desiredRank) tri-diagonal
   auxiliary matrix generated during power iteration).  Net result: the current
   algorithm needs actually (desiredRank * numCols * 16)bytes available on the
   driving machine, and since Lanczos currently grabs both large eigenvalue
   eigenvectors, and small ones, and the user typically wants only one or the
   other, they're typically going to need to ask for 2-3x as high a desiredRank
   as the truly desire.  So for 100K columns, a rank 250 SVD is currently
   taking about 1GB, and 1M columns, rank 250 is up to 10GB.  A bit hefty for a
   "scalable" algorithm.
   - Even once the above is taken care of (by paging vectors off to local
   disk), the issue of dealing with some DenseVector instances with hundreds of
   millions of entries comes up sometimes.  250M doubles is 4GB for a single
   vector.  Hanging onto even one of these in a Reducer is unacceptably
   unscalable when the dimensionality gets up that high.  Backing a DenseVector
   with a SequenceFile<IntWritable,DoubleWritable> might be the right way to
   go, and would allow for a variety of distributed computations with vectors
   of this monstrous size, and would be truly scalable, although many
   algorithms which could currently be done in one MR pass (or "k") would take
   two (or 2*k), by needing reduce-side joins interleaved in their main
   activities.

  One question I'm wrestling with is whether we should have disk-backed
(local or HDFS) vectors and matrices transparently implement Vector and
Matrix, or whether this will lead to confusion and trick people into
treating them in the same way they think of in-memory versions (ie think
they can do lots of random access operations which just thrash and aren't
performant).

  Any thoughts?

  -jake

Re: Monstrous matrices...

Posted by Jake Mannix <ja...@gmail.com>.

Ok cool beans, I'll try and get some form of my (local or hdfs) disk-backed
vectors and matrices in a patch at some point soon, as well as a
 SequenceFile<IntWritable,DoubleWritable> vector if I can get around to it
after that.

My main aim is to remove the various memory scaling constraints (which don't
get hit until the 10's to 100's of millions of columns typically) in the
distributed SVD code, but I have a feeling some of this will be more
generally useful elsewhere in the codebase.

  -jake

On Fri, Sep 10, 2010 at 12:12 PM, Ted Dunning <te...@gmail.com> wrote:

> On Fri, Sep 10, 2010 at 7:18 AM, Isabel Drost <is...@apache.org> wrote:
>
> > > I'd say add the disk backed ones and we'll worry about education
> > > separately.
> >
> > +1
> >
>
> +1
>
>
> >
> >
> > > Perhaps it's possible for the vector to keep track of
> > > thrashing and spit out warnings. Either that or you override the
> > > random accessors on the file-based ones to throw exceptions so that
> > > it fails early for users.
> >
> > The first options gives users more freedom about what to do with the
> > implementation - while the latter one probably saves us a few mails
> > about the implementation being sooooo slow... I'd be fine to go with
> > either option.
>
>
> I think fail early is the right answer.  Nobody much minds that remove
> throws exception on all of our iterators.
>

Re: Monstrous matrices...

Posted by Ted Dunning <te...@gmail.com>.

On Fri, Sep 10, 2010 at 7:18 AM, Isabel Drost <is...@apache.org> wrote:

> > I'd say add the disk backed ones and we'll worry about education
> > separately.
>
> +1
>

+1


>
>
> > Perhaps it's possible for the vector to keep track of
> > thrashing and spit out warnings. Either that or you override the
> > random accessors on the file-based ones to throw exceptions so that
> > it fails early for users.
>
> The first options gives users more freedom about what to do with the
> implementation - while the latter one probably saves us a few mails
> about the implementation being sooooo slow... I'd be fine to go with
> either option.


I think fail early is the right answer.  Nobody much minds that remove
throws exception on all of our iterators.

Re: Monstrous matrices...

Posted by Isabel Drost <is...@apache.org>.

On Thu, 9 Sep 2010 Grant Ingersoll <gs...@apache.org> wrote:
> On Sep 8, 2010, at 4:31 PM, Jake Mannix wrote:
> 
> > Hey all,
> >  One question I'm wrestling with is whether we should have
> > disk-backed (local or HDFS) vectors and matrices transparently
> > implement Vector and Matrix, or whether this will lead to confusion
> > and trick people into treating them in the same way they think of
> > in-memory versions (ie think they can do lots of random access
> > operations which just thrash and aren't performant).
> > 
> >  Any thoughts?
> 
> I'd say add the disk backed ones and we'll worry about education
> separately.

+1


> Perhaps it's possible for the vector to keep track of
> thrashing and spit out warnings. Either that or you override the
> random accessors on the file-based ones to throw exceptions so that
> it fails early for users.

The first options gives users more freedom about what to do with the
implementation - while the latter one probably saves us a few mails
about the implementation being sooooo slow... I'd be fine to go with
either option.

Isabel

Re: Monstrous matrices...

Posted by Grant Ingersoll <gs...@apache.org>.

On Sep 8, 2010, at 4:31 PM, Jake Mannix wrote:

> Hey all,
>  One question I'm wrestling with is whether we should have disk-backed
> (local or HDFS) vectors and matrices transparently implement Vector and
> Matrix, or whether this will lead to confusion and trick people into
> treating them in the same way they think of in-memory versions (ie think
> they can do lots of random access operations which just thrash and aren't
> performant).
> 
>  Any thoughts?

I'd say add the disk backed ones and we'll worry about education separately.   Perhaps it's possible for the vector to keep track of thrashing and spit out warnings.  Either that or you override the random accessors on the file-based ones to throw exceptions so that it fails early for users.

-Grant