You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Eshwaran Vijaya Kumar <ev...@mozilla.com> on 2011/06/04 01:48:48 UTC

Computing SVD Of "Large Sparse Data"

Hello all,  
  We are trying to build a clustering system which will have an SVD component. I believe Mahout has two SVD solvers: DistributedLanczosSolver and SSVD. Could someone give me some tips on which would be a better choice of a solver given that the size of the data will be roughly 100 million rows with each row having roughly 50 K dimensions (100 million X 50000 ). We will be working with text data so the resultant matrix should be relatively sparse to begin with. 

Thanks
Eshwaran

Re: Computing SVD Of "Large Sparse Data"

Posted by Hector Yee <he...@gmail.com>.

If its very sparse you can try
https://issues.apache.org/jira/browse/MAHOUT-703

Instead of minimizing reconstruction error, it tries to enforce that your
words rank higher than other words not present in your document.

Example of some results from this approach:

https://docs.google.com/present/edit?id=0AQC247eq7Jp5ZGZ6NXpyOWhfMjlmM2pzdjRkZw&authkey=CNj2h98P&hl=en_US


On Fri, Jun 3, 2011 at 4:48 PM, Eshwaran Vijaya Kumar <
evijayakumar@mozilla.com> wrote:

> Hello all,
>  We are trying to build a clustering system which will have an SVD
> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver
> and SSVD. Could someone give me some tips on which would be a better choice
> of a solver given that the size of the data will be roughly 100 million rows
> with each row having roughly 50 K dimensions (100 million X 50000 ). We will
> be working with text data so the resultant matrix should be relatively
> sparse to begin with.
>
> Thanks
> Eshwaran




-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)

Re: Computing SVD Of "Large Sparse Data"

Posted by Jake Mannix <ja...@gmail.com>.

The overall number of columns matters to Lanczos - too many and you run out
of RAM.

  -jake

On Jun 3, 2011 6:26 PM, "Dmitriy Lyubimov" <dl...@gmail.com> wrote:

What you really probably need to worry is not the number of
dimensions, but only avg number of non-zero elements per row
(density). How dense is the data?

On Fri, Jun 3, 2011 at 4:48 PM, Eshwaran Vijaya Kumar <
evijayakumar@mozilla.com> wrote: > Hello a...

Re: Computing SVD Of "Large Sparse Data"

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

What you really probably need to worry is not the number of
dimensions, but only avg number of non-zero elements per row
(density). How dense is the data?



On Fri, Jun 3, 2011 at 4:48 PM, Eshwaran Vijaya Kumar
<ev...@mozilla.com> wrote:
> Hello all,
>  We are trying to build a clustering system which will have an SVD component. I believe Mahout has two SVD solvers: DistributedLanczosSolver and SSVD. Could someone give me some tips on which would be a better choice of a solver given that the size of the data will be roughly 100 million rows with each row having roughly 50 K dimensions (100 million X 50000 ). We will be working with text data so the resultant matrix should be relatively sparse to begin with.
>
> Thanks
> Eshwaran

Re: Computing SVD Of "Large Sparse Data"

Posted by Ted Dunning <te...@gmail.com>.

I would encourage you to take a stab at a patch on this.  You aren't
the only person to have expressed interest in scaling PCA, but you
aren't a member of a large horde, either.

On Wed, Jun 8, 2011 at 7:39 AM, Eshwaran Vijaya Kumar
<ev...@mozilla.com> wrote:
> Thanks Ted. That is good news.
> On Jun 7, 2011, at 11:12 PM, Ted Dunning wrote:
>
>> I think that incorporating mean subtraction into the SSVD code should
>> be relatively straightforward.  The trick is that you have to project
>> the orginal matrix and the mean separately and then combine the
>> results.  There are later steps that require similar mods, but since
>> it is all based on a simple product that can always be factored, it
>> should be straightforward.
>>
>> On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar
>> <ev...@mozilla.com> wrote:
>>> One of the algorithms we would like to try out is PCA where it seems to be good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say A,  to avoid destroying sparsity ( Refer : http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 ). From what I understand about the Lanczos algorithm, it shouldn't be too hard to modify the solver code so that I can pass A and m*m^T without combining them into a single matrix and then do repeated multiplications.  Unfortunately, I have not yet had time to look at SSVD; So it would be extremely helpful if someone who has looked at the problem more closer can comment on how to do these (potential?) modifications to the SSVD code to avoid having to deal with dense matrices.
>>>
>>>
>>> Thanks in advance
>>> Eshwaran
>>>
>>>
>>>
>>> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:
>>>
>>>> I would push for SSVD as well if you want a real SVD.
>>>>
>>>> Also, I don't think that you lose information about which vectors are which
>>>> (or as Jake put it "what they mean").  The stochastic decomposition gives a
>>>> very accurate estimate of the top-k singular vectors.  It does this by using
>>>> the random projection to project the top singular vectors into a sub-space
>>>> and then correcting the results obtained back into the original space.  This
>>>> is not the same as simply doing the decomposition on the random projection
>>>> and then using that decomposition.
>>>>
>>>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <
>>>> evijayakumar@mozilla.com> wrote:
>>>>
>>>>> Hi Jake,
>>>>> Thank you for your reply. Good to know that we can use Lanczos. I will
>>>>> have to look into SSVD algorithm closer to figure out whether the
>>>>> information loss is worth the gain in speed (and computational efficiency).
>>>>> I guess We will have to run more tests to see which works best to decide on
>>>>> which path to go by.
>>>>>
>>>>>
>>>>> Esh
>>>>>
>>>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:
>>>>>
>>>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD
>>>>>> via Lanczos, so give it a try.
>>>>>>
>>>>>> SSVD will probably run faster, but you lose some information on what the
>>>>>> singular vectors "mean".  If you don't need this information, SSVD may be
>>>>>> better for you.
>>>>>>
>>>>>> What would be awesome for *us* is if you tried both and told us what you
>>>>>> found, in terms of performance and relevance.  :)
>>>>>>
>>>>>> -jake
>>>>>>
>>>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <
>>>>> evijayakumar@mozilla.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hello all,
>>>>>> We are trying to build a clustering system which will have an SVD
>>>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver
>>>>>> and SSVD. Could someone give me some tips on which would be a better
>>>>> choice
>>>>>> of a solver given that the size of the data will be roughly 100 million
>>>>> rows
>>>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We
>>>>> will
>>>>>> be working with text data so the resultant matrix should be relatively
>>>>>> sparse to begin with.
>>>>>>
>>>>>> Thanks
>>>>>> Eshwaran
>>>>>
>>>>>
>>>
>>>
>
>

Re: Computing SVD Of "Large Sparse Data"

Posted by Eshwaran Vijaya Kumar <ev...@mozilla.com>.

Thanks Ted. That is good news.
On Jun 7, 2011, at 11:12 PM, Ted Dunning wrote:

> I think that incorporating mean subtraction into the SSVD code should
> be relatively straightforward.  The trick is that you have to project
> the orginal matrix and the mean separately and then combine the
> results.  There are later steps that require similar mods, but since
> it is all based on a simple product that can always be factored, it
> should be straightforward.
> 
> On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar
> <ev...@mozilla.com> wrote:
>> One of the algorithms we would like to try out is PCA where it seems to be good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say A,  to avoid destroying sparsity ( Refer : http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 ). From what I understand about the Lanczos algorithm, it shouldn't be too hard to modify the solver code so that I can pass A and m*m^T without combining them into a single matrix and then do repeated multiplications.  Unfortunately, I have not yet had time to look at SSVD; So it would be extremely helpful if someone who has looked at the problem more closer can comment on how to do these (potential?) modifications to the SSVD code to avoid having to deal with dense matrices.
>> 
>> 
>> Thanks in advance
>> Eshwaran
>> 
>> 
>> 
>> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:
>> 
>>> I would push for SSVD as well if you want a real SVD.
>>> 
>>> Also, I don't think that you lose information about which vectors are which
>>> (or as Jake put it "what they mean").  The stochastic decomposition gives a
>>> very accurate estimate of the top-k singular vectors.  It does this by using
>>> the random projection to project the top singular vectors into a sub-space
>>> and then correcting the results obtained back into the original space.  This
>>> is not the same as simply doing the decomposition on the random projection
>>> and then using that decomposition.
>>> 
>>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <
>>> evijayakumar@mozilla.com> wrote:
>>> 
>>>> Hi Jake,
>>>> Thank you for your reply. Good to know that we can use Lanczos. I will
>>>> have to look into SSVD algorithm closer to figure out whether the
>>>> information loss is worth the gain in speed (and computational efficiency).
>>>> I guess We will have to run more tests to see which works best to decide on
>>>> which path to go by.
>>>> 
>>>> 
>>>> Esh
>>>> 
>>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:
>>>> 
>>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD
>>>>> via Lanczos, so give it a try.
>>>>> 
>>>>> SSVD will probably run faster, but you lose some information on what the
>>>>> singular vectors "mean".  If you don't need this information, SSVD may be
>>>>> better for you.
>>>>> 
>>>>> What would be awesome for *us* is if you tried both and told us what you
>>>>> found, in terms of performance and relevance.  :)
>>>>> 
>>>>> -jake
>>>>> 
>>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <
>>>> evijayakumar@mozilla.com>
>>>>> wrote:
>>>>> 
>>>>> Hello all,
>>>>> We are trying to build a clustering system which will have an SVD
>>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver
>>>>> and SSVD. Could someone give me some tips on which would be a better
>>>> choice
>>>>> of a solver given that the size of the data will be roughly 100 million
>>>> rows
>>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We
>>>> will
>>>>> be working with text data so the resultant matrix should be relatively
>>>>> sparse to begin with.
>>>>> 
>>>>> Thanks
>>>>> Eshwaran
>>>> 
>>>> 
>> 
>>

Re: Computing SVD Of "Large Sparse Data"

Posted by Ted Dunning <te...@gmail.com>.

I think that incorporating mean subtraction into the SSVD code should
be relatively straightforward.  The trick is that you have to project
the orginal matrix and the mean separately and then combine the
results.  There are later steps that require similar mods, but since
it is all based on a simple product that can always be factored, it
should be straightforward.

On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar
<ev...@mozilla.com> wrote:
> One of the algorithms we would like to try out is PCA where it seems to be good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say A,  to avoid destroying sparsity ( Refer : http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 ). From what I understand about the Lanczos algorithm, it shouldn't be too hard to modify the solver code so that I can pass A and m*m^T without combining them into a single matrix and then do repeated multiplications.  Unfortunately, I have not yet had time to look at SSVD; So it would be extremely helpful if someone who has looked at the problem more closer can comment on how to do these (potential?) modifications to the SSVD code to avoid having to deal with dense matrices.
>
>
> Thanks in advance
> Eshwaran
>
>
>
> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:
>
>> I would push for SSVD as well if you want a real SVD.
>>
>> Also, I don't think that you lose information about which vectors are which
>> (or as Jake put it "what they mean").  The stochastic decomposition gives a
>> very accurate estimate of the top-k singular vectors.  It does this by using
>> the random projection to project the top singular vectors into a sub-space
>> and then correcting the results obtained back into the original space.  This
>> is not the same as simply doing the decomposition on the random projection
>> and then using that decomposition.
>>
>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <
>> evijayakumar@mozilla.com> wrote:
>>
>>> Hi Jake,
>>> Thank you for your reply. Good to know that we can use Lanczos. I will
>>> have to look into SSVD algorithm closer to figure out whether the
>>> information loss is worth the gain in speed (and computational efficiency).
>>> I guess We will have to run more tests to see which works best to decide on
>>> which path to go by.
>>>
>>>
>>> Esh
>>>
>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:
>>>
>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD
>>>> via Lanczos, so give it a try.
>>>>
>>>> SSVD will probably run faster, but you lose some information on what the
>>>> singular vectors "mean".  If you don't need this information, SSVD may be
>>>> better for you.
>>>>
>>>> What would be awesome for *us* is if you tried both and told us what you
>>>> found, in terms of performance and relevance.  :)
>>>>
>>>> -jake
>>>>
>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <
>>> evijayakumar@mozilla.com>
>>>> wrote:
>>>>
>>>> Hello all,
>>>> We are trying to build a clustering system which will have an SVD
>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver
>>>> and SSVD. Could someone give me some tips on which would be a better
>>> choice
>>>> of a solver given that the size of the data will be roughly 100 million
>>> rows
>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We
>>> will
>>>> be working with text data so the resultant matrix should be relatively
>>>> sparse to begin with.
>>>>
>>>> Thanks
>>>> Eshwaran
>>>
>>>
>
>

Re: Computing SVD Of "Large Sparse Data"

Posted by Eshwaran Vijaya Kumar <ev...@mozilla.com>.

One of the algorithms we would like to try out is PCA where it seems to be good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say A,  to avoid destroying sparsity ( Refer : http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 ). From what I understand about the Lanczos algorithm, it shouldn't be too hard to modify the solver code so that I can pass A and m*m^T without combining them into a single matrix and then do repeated multiplications.  Unfortunately, I have not yet had time to look at SSVD; So it would be extremely helpful if someone who has looked at the problem more closer can comment on how to do these (potential?) modifications to the SSVD code to avoid having to deal with dense matrices.


Thanks in advance
Eshwaran



On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote:

> I would push for SSVD as well if you want a real SVD.
> 
> Also, I don't think that you lose information about which vectors are which
> (or as Jake put it "what they mean").  The stochastic decomposition gives a
> very accurate estimate of the top-k singular vectors.  It does this by using
> the random projection to project the top singular vectors into a sub-space
> and then correcting the results obtained back into the original space.  This
> is not the same as simply doing the decomposition on the random projection
> and then using that decomposition.
> 
> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <
> evijayakumar@mozilla.com> wrote:
> 
>> Hi Jake,
>> Thank you for your reply. Good to know that we can use Lanczos. I will
>> have to look into SSVD algorithm closer to figure out whether the
>> information loss is worth the gain in speed (and computational efficiency).
>> I guess We will have to run more tests to see which works best to decide on
>> which path to go by.
>> 
>> 
>> Esh
>> 
>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:
>> 
>>> With 50k columns, you're well within the "sweet spot" for traditional SVD
>>> via Lanczos, so give it a try.
>>> 
>>> SSVD will probably run faster, but you lose some information on what the
>>> singular vectors "mean".  If you don't need this information, SSVD may be
>>> better for you.
>>> 
>>> What would be awesome for *us* is if you tried both and told us what you
>>> found, in terms of performance and relevance.  :)
>>> 
>>> -jake
>>> 
>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <
>> evijayakumar@mozilla.com>
>>> wrote:
>>> 
>>> Hello all,
>>> We are trying to build a clustering system which will have an SVD
>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver
>>> and SSVD. Could someone give me some tips on which would be a better
>> choice
>>> of a solver given that the size of the data will be roughly 100 million
>> rows
>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We
>> will
>>> be working with text data so the resultant matrix should be relatively
>>> sparse to begin with.
>>> 
>>> Thanks
>>> Eshwaran
>> 
>>

Re: Computing SVD Of "Large Sparse Data"

Posted by Ted Dunning <te...@gmail.com>.

I would push for SSVD as well if you want a real SVD.

Also, I don't think that you lose information about which vectors are which
(or as Jake put it "what they mean").  The stochastic decomposition gives a
very accurate estimate of the top-k singular vectors.  It does this by using
the random projection to project the top singular vectors into a sub-space
and then correcting the results obtained back into the original space.  This
is not the same as simply doing the decomposition on the random projection
and then using that decomposition.

On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar <
evijayakumar@mozilla.com> wrote:

> Hi Jake,
>  Thank you for your reply. Good to know that we can use Lanczos. I will
> have to look into SSVD algorithm closer to figure out whether the
> information loss is worth the gain in speed (and computational efficiency).
> I guess We will have to run more tests to see which works best to decide on
> which path to go by.
>
>
> Esh
>
> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:
>
> > With 50k columns, you're well within the "sweet spot" for traditional SVD
> > via Lanczos, so give it a try.
> >
> > SSVD will probably run faster, but you lose some information on what the
> > singular vectors "mean".  If you don't need this information, SSVD may be
> > better for you.
> >
> > What would be awesome for *us* is if you tried both and told us what you
> > found, in terms of performance and relevance.  :)
> >
> >  -jake
> >
> > On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <
> evijayakumar@mozilla.com>
> > wrote:
> >
> > Hello all,
> > We are trying to build a clustering system which will have an SVD
> > component. I believe Mahout has two SVD solvers: DistributedLanczosSolver
> > and SSVD. Could someone give me some tips on which would be a better
> choice
> > of a solver given that the size of the data will be roughly 100 million
> rows
> > with each row having roughly 50 K dimensions (100 million X 50000 ). We
> will
> > be working with text data so the resultant matrix should be relatively
> > sparse to begin with.
> >
> > Thanks
> > Eshwaran
>
>

Re: Computing SVD Of "Large Sparse Data"

Posted by Eshwaran Vijaya Kumar <ev...@mozilla.com>.

Hi Jake,
  Thank you for your reply. Good to know that we can use Lanczos. I will have to look into SSVD algorithm closer to figure out whether the information loss is worth the gain in speed (and computational efficiency). I guess We will have to run more tests to see which works best to decide on which path to go by.


Esh

On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote:

> With 50k columns, you're well within the "sweet spot" for traditional SVD
> via Lanczos, so give it a try.
> 
> SSVD will probably run faster, but you lose some information on what the
> singular vectors "mean".  If you don't need this information, SSVD may be
> better for you.
> 
> What would be awesome for *us* is if you tried both and told us what you
> found, in terms of performance and relevance.  :)
> 
>  -jake
> 
> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <ev...@mozilla.com>
> wrote:
> 
> Hello all,
> We are trying to build a clustering system which will have an SVD
> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver
> and SSVD. Could someone give me some tips on which would be a better choice
> of a solver given that the size of the data will be roughly 100 million rows
> with each row having roughly 50 K dimensions (100 million X 50000 ). We will
> be working with text data so the resultant matrix should be relatively
> sparse to begin with.
> 
> Thanks
> Eshwaran

Re: Computing SVD Of "Large Sparse Data"

Posted by Jake Mannix <ja...@gmail.com>.

With 50k columns, you're well within the "sweet spot" for traditional SVD
via Lanczos, so give it a try.

SSVD will probably run faster, but you lose some information on what the
singular vectors "mean".  If you don't need this information, SSVD may be
better for you.

What would be awesome for *us* is if you tried both and told us what you
found, in terms of performance and relevance.  :)

  -jake

On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" <ev...@mozilla.com>
wrote:

Hello all,
 We are trying to build a clustering system which will have an SVD
component. I believe Mahout has two SVD solvers: DistributedLanczosSolver
and SSVD. Could someone give me some tips on which would be a better choice
of a solver given that the size of the data will be roughly 100 million rows
with each row having roughly 50 K dimensions (100 million X 50000 ). We will
be working with text data so the resultant matrix should be relatively
sparse to begin with.

Thanks
Eshwaran