You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/08/29 22:38:40 UTC

SVD Expectations

I'm running SVD as:
./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200 --numCols 65458 --numRows  130103
 ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut --corpusInput /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal --maxError 0.1 --minEigenvalue 10.0

part-out.vec is 52 MB.  The output from SVD  (svdOut) is 104 MB and largestCleanEigens is 88 MB.  For some reason, this really doesn't feel right.

Is there a guide on interpreting the output of SVD anywhere?  Intuitively, I believe the output should be a lot smaller?   I mean that's the point, right?  

I can share the vector if you want.

-Grant

--------------------------
Grant Ingersoll
http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8

Re: SVD Expectations

Posted by Grant Ingersoll <gs...@apache.org>.

Thanks to all for the explanations.  


On Aug 29, 2010, at 7:49 PM, Ted Dunning wrote:

> Like Jake said.
> 
> On Sun, Aug 29, 2010 at 4:48 PM, Ted Dunning <te...@gmail.com> wrote:
> 
>> 
>> In particular, since our sparse representation requires an int (4 bytes)
>> and a double (8 bytes) to store one non-zero entry while a dense row
>> requires only 8 bytes per entry then your original data would require less
>> storage if it has less than 200 * 8 / 12 = 133 non-zero
>> entries per row on average.  Depending on the data-set, this could be very
>> likely or totally implausible.
>> 
>> SVD is still useful in these cases because it can provide useful smoothing.
>> 
>> 
>> On Sun, Aug 29, 2010 at 3:29 PM, Akshay Bhat <ak...@gmail.com>wrote:
>> 
>>> Even though the SVD is supposed to reduce dimensionality it does not means
>>> that your results will have smaller size [in terms of memory], since U , S
>>> and V are dense matrices. except if you are using too few eigenvectors.
>>> Your
>>> input matrix is a sparse, had it been represented as a dense matrix it
>>> would
>>> have far large size.
>>> 
>>> 
>>> On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <gsingers@apache.org
>>>> wrote:
>>> 
>>>> Should be noted, that cranking the rank down to 20 produces a
>>> significantly
>>>> smaller result.
>>>> 
>>>> 
>>>> On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:
>>>> 
>>>>> I'm running SVD as:
>>>>> ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir
>>>> /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200
>>>> --numCols 65458 --numRows  130103
>>>>> ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut
>>> --corpusInput
>>>> /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal
>>>> --maxError 0.1 --minEigenvalue 10.0
>>>>> 
>>>>> part-out.vec is 52 MB.  The output from SVD  (svdOut) is 104 MB and
>>>> largestCleanEigens is 88 MB.  For some reason, this really doesn't feel
>>>> right.
>>>>> 
>>>>> Is there a guide on interpreting the output of SVD anywhere?
>>>> Intuitively, I believe the output should be a lot smaller?   I mean
>>> that's
>>>> the point, right?
>>>>> 
>>>>> I can share the vector if you want.
>>>>> 
>>>>> -Grant
>>>>> 
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
>>>>> 
>>>> 
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>>> 7-8
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Akshay Uday Bhat.
>>> Graduate Student, Computer Science, Cornell University
>>> Website: http://www.akshaybhat.com
>>> 
>> 
>> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8

Re: SVD Expectations

Posted by Ted Dunning <te...@gmail.com>.

Like Jake said.

On Sun, Aug 29, 2010 at 4:48 PM, Ted Dunning <te...@gmail.com> wrote:

>
> In particular, since our sparse representation requires an int (4 bytes)
> and a double (8 bytes) to store one non-zero entry while a dense row
> requires only 8 bytes per entry then your original data would require less
> storage if it has less than 200 * 8 / 12 = 133 non-zero
> entries per row on average.  Depending on the data-set, this could be very
> likely or totally implausible.
>
> SVD is still useful in these cases because it can provide useful smoothing.
>
>
> On Sun, Aug 29, 2010 at 3:29 PM, Akshay Bhat <ak...@gmail.com>wrote:
>
>> Even though the SVD is supposed to reduce dimensionality it does not means
>> that your results will have smaller size [in terms of memory], since U , S
>> and V are dense matrices. except if you are using too few eigenvectors.
>> Your
>> input matrix is a sparse, had it been represented as a dense matrix it
>> would
>> have far large size.
>>
>>
>> On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <gsingers@apache.org
>> >wrote:
>>
>> > Should be noted, that cranking the rank down to 20 produces a
>> significantly
>> > smaller result.
>> >
>> >
>> > On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:
>> >
>> > > I'm running SVD as:
>> > > ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir
>> > /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200
>> > --numCols 65458 --numRows  130103
>> > >  ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut
>> --corpusInput
>> > /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal
>> > --maxError 0.1 --minEigenvalue 10.0
>> > >
>> > > part-out.vec is 52 MB.  The output from SVD  (svdOut) is 104 MB and
>> > largestCleanEigens is 88 MB.  For some reason, this really doesn't feel
>> > right.
>> > >
>> > > Is there a guide on interpreting the output of SVD anywhere?
>> >  Intuitively, I believe the output should be a lot smaller?   I mean
>> that's
>> > the point, right?
>> > >
>> > > I can share the vector if you want.
>> > >
>> > > -Grant
>> > >
>> > > --------------------------
>> > > Grant Ingersoll
>> > > http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
>> > >
>> >
>> > --------------------------
>> > Grant Ingersoll
>> > http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>> 7-8
>> >
>> >
>>
>>
>> --
>> Akshay Uday Bhat.
>> Graduate Student, Computer Science, Cornell University
>> Website: http://www.akshaybhat.com
>>
>
>

Re: SVD Expectations

Posted by Ted Dunning <te...@gmail.com>.

In particular, since our sparse representation requires an int (4 bytes) and
a double (8 bytes) to store one non-zero entry while a dense row requires
only 8 bytes per entry then your original data would require less storage if
it has less than 200 * 8 / 12 = 133 non-zero
entries per row on average.  Depending on the data-set, this could be very
likely or totally implausible.

SVD is still useful in these cases because it can provide useful smoothing.


On Sun, Aug 29, 2010 at 3:29 PM, Akshay Bhat <ak...@gmail.com> wrote:

> Even though the SVD is supposed to reduce dimensionality it does not means
> that your results will have smaller size [in terms of memory], since U , S
> and V are dense matrices. except if you are using too few eigenvectors.
> Your
> input matrix is a sparse, had it been represented as a dense matrix it
> would
> have far large size.
>
>
> On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> > Should be noted, that cranking the rank down to 20 produces a
> significantly
> > smaller result.
> >
> >
> > On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:
> >
> > > I'm running SVD as:
> > > ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir
> > /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200
> > --numCols 65458 --numRows  130103
> > >  ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut --corpusInput
> > /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal
> > --maxError 0.1 --minEigenvalue 10.0
> > >
> > > part-out.vec is 52 MB.  The output from SVD  (svdOut) is 104 MB and
> > largestCleanEigens is 88 MB.  For some reason, this really doesn't feel
> > right.
> > >
> > > Is there a guide on interpreting the output of SVD anywhere?
> >  Intuitively, I believe the output should be a lot smaller?   I mean
> that's
> > the point, right?
> > >
> > > I can share the vector if you want.
> > >
> > > -Grant
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
> > >
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
> 7-8
> >
> >
>
>
> --
> Akshay Uday Bhat.
> Graduate Student, Computer Science, Cornell University
> Website: http://www.akshaybhat.com
>

Re: SVD Expectations

Posted by Jake Mannix <ja...@gmail.com>.

Grant, Akshay has it right:

  If your input vectors (N of them) have average number of nonzero entries
being "d", then the size of your input is N*d*12bytes (in our case,
with int keys and double values).  The output is the left singular vectors,
which is k *dense* vectors of size M, where M is your row-size (for text:
the size of your dictionary), which is then k*M*8bytes (dense means you
don't need to store the keys).  If you want to project the original inputs
onto the latent factor vectors, the size of this will be k * N * 8bytes.

  So in general, comparing input to output, it's N * d vs. N * k.  In
general, these could be of the same order of size, unless k (the reduced
rank) is small, or d (the document size, roughly) is large (more than a
couple hundred or a thousand unique terms per document).

  In short: SVD should not be thought of as "compression", in most cases.
Reduced dimensionality means a smaller basis you can use, but it's dense
now, so documents don't necessairly get "reduced".  In fact, projecting
individual terms onto the SVD basis *inflates* them from size O(1) to size
O(k).

  -jake

On Sun, Aug 29, 2010 at 3:29 PM, Akshay Bhat <ak...@gmail.com> wrote:

> Even though the SVD is supposed to reduce dimensionality it does not means
> that your results will have smaller size [in terms of memory], since U , S
> and V are dense matrices. except if you are using too few eigenvectors.
> Your
> input matrix is a sparse, had it been represented as a dense matrix it
> would
> have far large size.
>
>
> On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> > Should be noted, that cranking the rank down to 20 produces a
> significantly
> > smaller result.
> >
> >
> > On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:
> >
> > > I'm running SVD as:
> > > ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir
> > /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200
> > --numCols 65458 --numRows  130103
> > >  ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut --corpusInput
> > /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal
> > --maxError 0.1 --minEigenvalue 10.0
> > >
> > > part-out.vec is 52 MB.  The output from SVD  (svdOut) is 104 MB and
> > largestCleanEigens is 88 MB.  For some reason, this really doesn't feel
> > right.
> > >
> > > Is there a guide on interpreting the output of SVD anywhere?
> >  Intuitively, I believe the output should be a lot smaller?   I mean
> that's
> > the point, right?
> > >
> > > I can share the vector if you want.
> > >
> > > -Grant
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
> > >
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
> 7-8
> >
> >
>
>
> --
> Akshay Uday Bhat.
> Graduate Student, Computer Science, Cornell University
> Website: http://www.akshaybhat.com
>

Re: SVD Expectations

Posted by Akshay Bhat <ak...@gmail.com>.

Even though the SVD is supposed to reduce dimensionality it does not means
that your results will have smaller size [in terms of memory], since U , S
and V are dense matrices. except if you are using too few eigenvectors. Your
input matrix is a sparse, had it been represented as a dense matrix it would
have far large size.


On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Should be noted, that cranking the rank down to 20 produces a significantly
> smaller result.
>
>
> On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:
>
> > I'm running SVD as:
> > ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir
> /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200
> --numCols 65458 --numRows  130103
> >  ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut --corpusInput
> /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal
> --maxError 0.1 --minEigenvalue 10.0
> >
> > part-out.vec is 52 MB.  The output from SVD  (svdOut) is 104 MB and
> largestCleanEigens is 88 MB.  For some reason, this really doesn't feel
> right.
> >
> > Is there a guide on interpreting the output of SVD anywhere?
>  Intuitively, I believe the output should be a lot smaller?   I mean that's
> the point, right?
> >
> > I can share the vector if you want.
> >
> > -Grant
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
> >
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>


-- 
Akshay Uday Bhat.
Graduate Student, Computer Science, Cornell University
Website: http://www.akshaybhat.com

Re: SVD Expectations

Posted by Grant Ingersoll <gs...@apache.org>.

Should be noted, that cranking the rank down to 20 produces a significantly smaller result.


On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:

> I'm running SVD as:
> ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200 --numCols 65458 --numRows  130103
>  ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut --corpusInput /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal --maxError 0.1 --minEigenvalue 10.0
> 
> part-out.vec is 52 MB.  The output from SVD  (svdOut) is 104 MB and largestCleanEigens is 88 MB.  For some reason, this really doesn't feel right.
> 
> Is there a guide on interpreting the output of SVD anywhere?  Intuitively, I believe the output should be a lot smaller?   I mean that's the point, right?  
> 
> I can share the vector if you want.
> 
> -Grant
> 
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
> 

--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8