You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/07/05 23:59:21 UTC

SVD and input args

Trying out SVD for the first time and trying to make sense of the parameters...

Am I missing a more obvious way to get the number of rows to give to SVD than to iterate through the whole sequence file of vectors and count them up?  Assuming a sufficiently large vector file, don't I need a M/R job to do this?  Likewise, one would have to do this for the --numCols as well, right?  In reality, I suppose it would be useful to have a utility that checked to make sure all the vectors in a file are the same cardinality, right?

Just trying to get my head around the practical side of running SVD.


Thanks,
Grant

Re: SVD and input args

Posted by Jake Mannix <ja...@gmail.com>.

On Tue, Jul 6, 2010 at 1:17 PM, Grant Ingersoll <gs...@apache.org> wrote:
>
> Hmm, I was looking at the code and it is passed into DistributedMatrix,
> etc., so it seemed like it was needed.
>

Yep, that's why I made sure it was required.  It just turns out that nothing
in the Lanczos code
ever uses the numRows value of said DistributedRowMatrix - it only computes
"timesSquared(Vector)", which
uses the column space.  Unless you're doing a symmetric matrix, in which
case it does times(Vector).  But
if you've got a symmetric matrix, and you know the dim of the columns, you
also know numRows. :)

>  > Glad to see some more other committers playing with the SVD code finally
> - I
> > should have pretended I left those hacks in on purpose specifically to
> see
> > when y'all would use it and mention how horrible it was. :P
> >
>
> You're hacks beat my non-existent SVD code!
>

Heh, sure sure.  Still, I really do need to clean up these hacks one of
these days...

  -jake

Re: SVD and input args

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 6, 2010, at 2:24 AM, Jake Mannix wrote:

> It turns out that the number of rows isn't actually used in the SVD code at
> all (you can put in any number for this parameter), but this is an artifact
> of the particular choice of spitting out only the right singular vectors.
> NumCols is indeed necessary, but there's an ugly trick to figure it out
> too: run it with numCols = anything, and the first time you run, you'll get
> an exception which tells you what the cardinality of the vectors are.  This
> is the true numCols to use.
> 
> This should probably be fixed, as this is ugly as sin.  Easy fix is: remove
> numRows (add back when they become necessary, if ever), and make numCols
> optional, calculating it on the fly by fetching the first chunk of the
> SequenceFile from HDFS and finding out the dim of the vector.

Hmm, I was looking at the code and it is passed into DistributedMatrix, etc., so it seemed like it was needed.

> 
> Glad to see some more other committers playing with the SVD code finally - I
> should have pretended I left those hacks in on purpose specifically to see
> when y'all would use it and mention how horrible it was. :P
> 

You're hacks beat my non-existent SVD code!

>  -jake
> 
> On Mon, Jul 5, 2010 at 11:59 PM, Grant Ingersoll <gs...@apache.org>wrote:
> 
>> Trying out SVD for the first time and trying to make sense of the
>> parameters...
>> 
>> Am I missing a more obvious way to get the number of rows to give to SVD
>> than to iterate through the whole sequence file of vectors and count them
>> up?  Assuming a sufficiently large vector file, don't I need a M/R job to do
>> this?  Likewise, one would have to do this for the --numCols as well, right?
>> In reality, I suppose it would be useful to have a utility that checked to
>> make sure all the vectors in a file are the same cardinality, right?
>> 
>> Just trying to get my head around the practical side of running SVD.
>> 
>> 
>> Thanks,
>> Grant

Re: SVD and input args

Posted by Jake Mannix <ja...@gmail.com>.

It turns out that the number of rows isn't actually used in the SVD code at
all (you can put in any number for this parameter), but this is an artifact
of the particular choice of spitting out only the right singular vectors.
 NumCols is indeed necessary, but there's an ugly trick to figure it out
too: run it with numCols = anything, and the first time you run, you'll get
an exception which tells you what the cardinality of the vectors are.  This
is the true numCols to use.

This should probably be fixed, as this is ugly as sin.  Easy fix is: remove
numRows (add back when they become necessary, if ever), and make numCols
optional, calculating it on the fly by fetching the first chunk of the
SequenceFile from HDFS and finding out the dim of the vector.

Glad to see some more other committers playing with the SVD code finally - I
should have pretended I left those hacks in on purpose specifically to see
when y'all would use it and mention how horrible it was. :P

  -jake

On Mon, Jul 5, 2010 at 11:59 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Trying out SVD for the first time and trying to make sense of the
> parameters...
>
> Am I missing a more obvious way to get the number of rows to give to SVD
> than to iterate through the whole sequence file of vectors and count them
> up?  Assuming a sufficiently large vector file, don't I need a M/R job to do
> this?  Likewise, one would have to do this for the --numCols as well, right?
>  In reality, I suppose it would be useful to have a utility that checked to
> make sure all the vectors in a file are the same cardinality, right?
>
> Just trying to get my head around the practical side of running SVD.
>
>
> Thanks,
> Grant

Re: SVD and input args

Posted by Ted Dunning <te...@gmail.com>.

It scales better than producing the vectors does!

Seriously, whatever is producing the vectors can easily produce counts, even
if there are many counts.  The SVD driver code can read and summarize many,
many counts in essentially zero time.

On Mon, Jul 5, 2010 at 4:46 PM, Grant Ingersoll <gs...@apache.org> wrote:

> > Yes and no.  The number of rows should be the number of documents you
> > vectorized.  The number of columns should be the number of distinct terms
> > that you observed in vectorizing.  Both should be pretty easily
> available.
>
> Yeah, I can count the rows w/ the VectorDumper, but that doesn't really
> scale.

Re: SVD and input args

Posted by Grant Ingersoll <gs...@apache.org>.

On Jul 5, 2010, at 7:14 PM, Ted Dunning wrote:

> On Mon, Jul 5, 2010 at 2:59 PM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> Trying out SVD for the first time and trying to make sense of the
>> parameters...
>> 
>> Am I missing a more obvious way to get the number of rows to give to SVD
>> than to iterate through the whole sequence file of vectors and count them
>> up?
> 
> 
> Pretty much.  But you can also integrate that task into the production of
> the vectors.
> 
> 
>> Assuming a sufficiently large vector file, don't I need a M/R job to do
>> this?  Likewise, one would have to do this for the --numCols as well, right?
>> In reality, I suppose it would be useful to have a utility that checked to
>> make sure all the vectors in a file are the same cardinality, right?
>> 
> 
> Yes and no.  The number of rows should be the number of documents you
> vectorized.  The number of columns should be the number of distinct terms
> that you observed in vectorizing.  Both should be pretty easily available.

Yeah, I can count the rows w/ the VectorDumper, but that doesn't really scale.  Just wondering if I was missing some tool that people are using.

Re: SVD and input args

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Jul 5, 2010 at 2:59 PM, Grant Ingersoll <gs...@apache.org> wrote:

> Trying out SVD for the first time and trying to make sense of the
> parameters...
>
> Am I missing a more obvious way to get the number of rows to give to SVD
> than to iterate through the whole sequence file of vectors and count them
> up?

Pretty much.  But you can also integrate that task into the production of
the vectors.

> Assuming a sufficiently large vector file, don't I need a M/R job to do
> this?  Likewise, one would have to do this for the --numCols as well, right?
>  In reality, I suppose it would be useful to have a utility that checked to
> make sure all the vectors in a file are the same cardinality, right?
>

Yes and no.  The number of rows should be the number of documents you
vectorized.  The number of columns should be the number of distinct terms
that you observed in vectorizing.  Both should be pretty easily available.
 With sparse vectors, we don't care quite as much about the size of the
vector and often set it to a "large enough" value.

The other major approach is to use random projection to get fixed length
vectors of known and predetermined size out.  This is the strategy I use in
the SGD code and it makes a lot of things much, much easier because you can
set the cardinality of the vectors involved ahead of time.  IT makes
converting a vector back into terms much harder, though.