You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2010/08/29 22:38:40 UTC
SVD Expectations
I'm running SVD as:
./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200 --numCols 65458 --numRows 130103
./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut --corpusInput /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal --maxError 0.1 --minEigenvalue 10.0
part-out.vec is 52 MB. The output from SVD (svdOut) is 104 MB and largestCleanEigens is 88 MB. For some reason, this really doesn't feel right.
Is there a guide on interpreting the output of SVD anywhere? Intuitively, I believe the output should be a lot smaller? I mean that's the point, right?
I can share the vector if you want.
-Grant
--------------------------
Grant Ingersoll
http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Re: SVD Expectations
Posted by Grant Ingersoll <gs...@apache.org>.
Thanks to all for the explanations.
On Aug 29, 2010, at 7:49 PM, Ted Dunning wrote:
> Like Jake said.
>
> On Sun, Aug 29, 2010 at 4:48 PM, Ted Dunning <te...@gmail.com> wrote:
>
>>
>> In particular, since our sparse representation requires an int (4 bytes)
>> and a double (8 bytes) to store one non-zero entry while a dense row
>> requires only 8 bytes per entry then your original data would require less
>> storage if it has less than 200 * 8 / 12 = 133 non-zero
>> entries per row on average. Depending on the data-set, this could be very
>> likely or totally implausible.
>>
>> SVD is still useful in these cases because it can provide useful smoothing.
>>
>>
>> On Sun, Aug 29, 2010 at 3:29 PM, Akshay Bhat <ak...@gmail.com>wrote:
>>
>>> Even though the SVD is supposed to reduce dimensionality it does not means
>>> that your results will have smaller size [in terms of memory], since U , S
>>> and V are dense matrices. except if you are using too few eigenvectors.
>>> Your
>>> input matrix is a sparse, had it been represented as a dense matrix it
>>> would
>>> have far large size.
>>>
>>>
>>> On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <gsingers@apache.org
>>>> wrote:
>>>
>>>> Should be noted, that cranking the rank down to 20 produces a
>>> significantly
>>>> smaller result.
>>>>
>>>>
>>>> On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:
>>>>
>>>>> I'm running SVD as:
>>>>> ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir
>>>> /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200
>>>> --numCols 65458 --numRows 130103
>>>>> ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut
>>> --corpusInput
>>>> /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal
>>>> --maxError 0.1 --minEigenvalue 10.0
>>>>>
>>>>> part-out.vec is 52 MB. The output from SVD (svdOut) is 104 MB and
>>>> largestCleanEigens is 88 MB. For some reason, this really doesn't feel
>>>> right.
>>>>>
>>>>> Is there a guide on interpreting the output of SVD anywhere?
>>>> Intuitively, I believe the output should be a lot smaller? I mean
>>> that's
>>>> the point, right?
>>>>>
>>>>> I can share the vector if you want.
>>>>>
>>>>> -Grant
>>>>>
>>>>> --------------------------
>>>>> Grant Ingersoll
>>>>> http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>>> 7-8
>>>>
>>>>
>>>
>>>
>>> --
>>> Akshay Uday Bhat.
>>> Graduate Student, Computer Science, Cornell University
>>> Website: http://www.akshaybhat.com
>>>
>>
>>
--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
Re: SVD Expectations
Posted by Ted Dunning <te...@gmail.com>.
Like Jake said.
On Sun, Aug 29, 2010 at 4:48 PM, Ted Dunning <te...@gmail.com> wrote:
>
> In particular, since our sparse representation requires an int (4 bytes)
> and a double (8 bytes) to store one non-zero entry while a dense row
> requires only 8 bytes per entry then your original data would require less
> storage if it has less than 200 * 8 / 12 = 133 non-zero
> entries per row on average. Depending on the data-set, this could be very
> likely or totally implausible.
>
> SVD is still useful in these cases because it can provide useful smoothing.
>
>
> On Sun, Aug 29, 2010 at 3:29 PM, Akshay Bhat <ak...@gmail.com>wrote:
>
>> Even though the SVD is supposed to reduce dimensionality it does not means
>> that your results will have smaller size [in terms of memory], since U , S
>> and V are dense matrices. except if you are using too few eigenvectors.
>> Your
>> input matrix is a sparse, had it been represented as a dense matrix it
>> would
>> have far large size.
>>
>>
>> On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <gsingers@apache.org
>> >wrote:
>>
>> > Should be noted, that cranking the rank down to 20 produces a
>> significantly
>> > smaller result.
>> >
>> >
>> > On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:
>> >
>> > > I'm running SVD as:
>> > > ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir
>> > /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200
>> > --numCols 65458 --numRows 130103
>> > > ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut
>> --corpusInput
>> > /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal
>> > --maxError 0.1 --minEigenvalue 10.0
>> > >
>> > > part-out.vec is 52 MB. The output from SVD (svdOut) is 104 MB and
>> > largestCleanEigens is 88 MB. For some reason, this really doesn't feel
>> > right.
>> > >
>> > > Is there a guide on interpreting the output of SVD anywhere?
>> > Intuitively, I believe the output should be a lot smaller? I mean
>> that's
>> > the point, right?
>> > >
>> > > I can share the vector if you want.
>> > >
>> > > -Grant
>> > >
>> > > --------------------------
>> > > Grant Ingersoll
>> > > http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
>> > >
>> >
>> > --------------------------
>> > Grant Ingersoll
>> > http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
>> 7-8
>> >
>> >
>>
>>
>> --
>> Akshay Uday Bhat.
>> Graduate Student, Computer Science, Cornell University
>> Website: http://www.akshaybhat.com
>>
>
>
Re: SVD Expectations
Posted by Ted Dunning <te...@gmail.com>.
In particular, since our sparse representation requires an int (4 bytes) and
a double (8 bytes) to store one non-zero entry while a dense row requires
only 8 bytes per entry then your original data would require less storage if
it has less than 200 * 8 / 12 = 133 non-zero
entries per row on average. Depending on the data-set, this could be very
likely or totally implausible.
SVD is still useful in these cases because it can provide useful smoothing.
On Sun, Aug 29, 2010 at 3:29 PM, Akshay Bhat <ak...@gmail.com> wrote:
> Even though the SVD is supposed to reduce dimensionality it does not means
> that your results will have smaller size [in terms of memory], since U , S
> and V are dense matrices. except if you are using too few eigenvectors.
> Your
> input matrix is a sparse, had it been represented as a dense matrix it
> would
> have far large size.
>
>
> On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> > Should be noted, that cranking the rank down to 20 produces a
> significantly
> > smaller result.
> >
> >
> > On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:
> >
> > > I'm running SVD as:
> > > ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir
> > /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200
> > --numCols 65458 --numRows 130103
> > > ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut --corpusInput
> > /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal
> > --maxError 0.1 --minEigenvalue 10.0
> > >
> > > part-out.vec is 52 MB. The output from SVD (svdOut) is 104 MB and
> > largestCleanEigens is 88 MB. For some reason, this really doesn't feel
> > right.
> > >
> > > Is there a guide on interpreting the output of SVD anywhere?
> > Intuitively, I believe the output should be a lot smaller? I mean
> that's
> > the point, right?
> > >
> > > I can share the vector if you want.
> > >
> > > -Grant
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
> > >
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
> 7-8
> >
> >
>
>
> --
> Akshay Uday Bhat.
> Graduate Student, Computer Science, Cornell University
> Website: http://www.akshaybhat.com
>
Re: SVD Expectations
Posted by Jake Mannix <ja...@gmail.com>.
Grant, Akshay has it right:
If your input vectors (N of them) have average number of nonzero entries
being "d", then the size of your input is N*d*12bytes (in our case,
with int keys and double values). The output is the left singular vectors,
which is k *dense* vectors of size M, where M is your row-size (for text:
the size of your dictionary), which is then k*M*8bytes (dense means you
don't need to store the keys). If you want to project the original inputs
onto the latent factor vectors, the size of this will be k * N * 8bytes.
So in general, comparing input to output, it's N * d vs. N * k. In
general, these could be of the same order of size, unless k (the reduced
rank) is small, or d (the document size, roughly) is large (more than a
couple hundred or a thousand unique terms per document).
In short: SVD should not be thought of as "compression", in most cases.
Reduced dimensionality means a smaller basis you can use, but it's dense
now, so documents don't necessairly get "reduced". In fact, projecting
individual terms onto the SVD basis *inflates* them from size O(1) to size
O(k).
-jake
On Sun, Aug 29, 2010 at 3:29 PM, Akshay Bhat <ak...@gmail.com> wrote:
> Even though the SVD is supposed to reduce dimensionality it does not means
> that your results will have smaller size [in terms of memory], since U , S
> and V are dense matrices. except if you are using too few eigenvectors.
> Your
> input matrix is a sparse, had it been represented as a dense matrix it
> would
> have far large size.
>
>
> On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <gsingers@apache.org
> >wrote:
>
> > Should be noted, that cranking the rank down to 20 produces a
> significantly
> > smaller result.
> >
> >
> > On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:
> >
> > > I'm running SVD as:
> > > ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir
> > /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200
> > --numCols 65458 --numRows 130103
> > > ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut --corpusInput
> > /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal
> > --maxError 0.1 --minEigenvalue 10.0
> > >
> > > part-out.vec is 52 MB. The output from SVD (svdOut) is 104 MB and
> > largestCleanEigens is 88 MB. For some reason, this really doesn't feel
> > right.
> > >
> > > Is there a guide on interpreting the output of SVD anywhere?
> > Intuitively, I believe the output should be a lot smaller? I mean
> that's
> > the point, right?
> > >
> > > I can share the vector if you want.
> > >
> > > -Grant
> > >
> > > --------------------------
> > > Grant Ingersoll
> > > http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
> > >
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct
> 7-8
> >
> >
>
>
> --
> Akshay Uday Bhat.
> Graduate Student, Computer Science, Cornell University
> Website: http://www.akshaybhat.com
>
Re: SVD Expectations
Posted by Akshay Bhat <ak...@gmail.com>.
Even though the SVD is supposed to reduce dimensionality it does not means
that your results will have smaller size [in terms of memory], since U , S
and V are dense matrices. except if you are using too few eigenvectors. Your
input matrix is a sparse, had it been represented as a dense matrix it would
have far large size.
On Sun, Aug 29, 2010 at 5:13 PM, Grant Ingersoll <gs...@apache.org>wrote:
> Should be noted, that cranking the rank down to 20 produces a significantly
> smaller result.
>
>
> On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:
>
> > I'm running SVD as:
> > ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir
> /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200
> --numCols 65458 --numRows 130103
> > ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut --corpusInput
> /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal
> --maxError 0.1 --minEigenvalue 10.0
> >
> > part-out.vec is 52 MB. The output from SVD (svdOut) is 104 MB and
> largestCleanEigens is 88 MB. For some reason, this really doesn't feel
> right.
> >
> > Is there a guide on interpreting the output of SVD anywhere?
> Intuitively, I believe the output should be a lot smaller? I mean that's
> the point, right?
> >
> > I can share the vector if you want.
> >
> > -Grant
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
> >
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
>
>
--
Akshay Uday Bhat.
Graduate Student, Computer Science, Cornell University
Website: http://www.akshaybhat.com
Re: SVD Expectations
Posted by Grant Ingersoll <gs...@apache.org>.
Should be noted, that cranking the rank down to 20 produces a significantly smaller result.
On Aug 29, 2010, at 4:38 PM, Grant Ingersoll wrote:
> I'm running SVD as:
> ./mahout svd --input /tmp/solr-clust-n2/part-out.vec --tempDir /tmp/solr-clust-n2/svdTemp --output /tmp/solr-clust-n2/svdOut --rank 200 --numCols 65458 --numRows 130103
> ./mahout cleansvd --eigenInput /tmp/solr-clust-n2/svdOut --corpusInput /tmp/solr-clust-n2/part-out.vec --output /tmp/solr-clust-n2/svdFinal --maxError 0.1 --minEigenvalue 10.0
>
> part-out.vec is 52 MB. The output from SVD (svdOut) is 104 MB and largestCleanEigens is 88 MB. For some reason, this really doesn't feel right.
>
> Is there a guide on interpreting the output of SVD anywhere? Intuitively, I believe the output should be a lot smaller? I mean that's the point, right?
>
> I can share the vector if you want.
>
> -Grant
>
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
>
--------------------------
Grant Ingersoll
http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8