You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Akshay Bhat <ak...@gmail.com> on 2010/09/05 02:08:22 UTC

Regarding the scalability of SVD code in Mahout

Hello,
Has anyone attempted SVD of a with a really large matrix (~40 million rows
and columns to be specific) using mahout.
I am planning to perform SVD using mahout on Twitter Follower network (it
contains information about ~35 Million users following ~45 million users
http://an.kaist.ac.kr/traces/WWW2010.html ) and I should have access to
Cornell hadoop cluster (55 Quad core nodes with 16-18GB ram per node). Can
anyone estimate how long the job will run?
Also is it possible to perform regularized SVD, or will I need to add
functionality by modifying the code.
Thank you


-- 
Akshay Uday Bhat.
Graduate Student, Computer Science, Cornell University
Website: http://www.akshaybhat.com

Re: Regarding the scalability of SVD code in Mahout

Posted by Jake Mannix <ja...@gmail.com>.
I think that the data set that Ashkay is referring to actually has 1.5B
nonzero
entries total.

The primary issue with running the distributed SVD code on this set is the
numColumns issue.  Required memory on the driving box (not the hadoop
nodes) will be (as mentioned in a recent dev@ thread) roughly

   numFactors * numCols * 16bytes.

That would be a lot of GB on the server node, at present.

There is a simple tweak which could bring that factor of 16 down to an 8,
but beyond that, more significant refactoring is probably needed.

But try it, and see! :)

  -jake


On Tue, Sep 7, 2010 at 6:50 PM, Ted Dunning <te...@gmail.com> wrote:

> Just to cross-check, is it true that your data has 35 x 100 million
> non-zeros in it?
>
> On Tue, Sep 7, 2010 at 6:16 PM, Akshay Bhat <ak...@gmail.com> wrote:
>
> > > - the total number of non-zero elements.  This drives the scan time
> and,
> > to
> > > some extent the cost of the multiplies.
> > >
> > The total number of non-zero elements are small since, most of the
> twitter
> > users follow on average around 100 other users
> >
> > ...
> > > - the number of rows in the original matrix.  This is a secondary
> factor
> > > that can drive some intermediate products in the random projection.
> > >
> > > The number of rows is around 35 Million
>

Re: Regarding the scalability of SVD code in Mahout

Posted by Ted Dunning <te...@gmail.com>.
Just to cross-check, is it true that your data has 35 x 100 million
non-zeros in it?

On Tue, Sep 7, 2010 at 6:16 PM, Akshay Bhat <ak...@gmail.com> wrote:

> > - the total number of non-zero elements.  This drives the scan time and,
> to
> > some extent the cost of the multiplies.
> >
> The total number of non-zero elements are small since, most of the twitter
> users follow on average around 100 other users
>
> ...
> > - the number of rows in the original matrix.  This is a secondary factor
> > that can drive some intermediate products in the random projection.
> >
> > The number of rows is around 35 Million

Re: Regarding the scalability of SVD code in Mahout

Posted by Akshay Bhat <ak...@gmail.com>.
Thanks Ted,

On Sun, Sep 5, 2010 at 2:05 AM, Ted Dunning <te...@gmail.com> wrote:

> I don't think anybody has done anything on quite that scale, though Jake
> may
> have come relatively close.
>
> There are several scaling limits.  These include:
>
> - the total number of non-zero elements.  This drives the scan time and, to
> some extent the cost of the multiplies.
>
> The total number of non-zero elements are small since, most of the twitter
users follow on average around 100 other users


> - the total number of singular vectors desired.  This directly drives the
> number of iterations in the Hebbian approach and drives the size of
> intermediate products in the random projection techniques.  It also causes
> product scaling with the next factor.
>
I plant to calculate around 50-200 singular vectors

>
> - the number of columns in the original matrix.  This, multiplied by the
> number of singular vectors drives the memory cost of some approaches in the
> final step or in the SVD step for the random projection.
>
The number of columns in the matrix are ~ 47 million


> - the number of rows in the original matrix.  This is a secondary factor
> that can drive some intermediate products in the random projection.
>
> The number of rows is around 35 Million


> Which of these will hang you up in your problem is an open question.  There
> is always the factor I haven't thought about yet.
>
> Jake, do you have any thoughts on this?
>
>
I believe that the twitter data set would be good stress test for the SVD
algorithm.
I should hopefully get access to cluster by next week.

On Sat, Sep 4, 2010 at 5:08 PM, Akshay Bhat <ak...@gmail.com> wrote:
>
> > Hello,
> > Has anyone attempted SVD of a with a really large matrix (~40 million
> rows
> > and columns to be specific) using mahout.
> > I am planning to perform SVD using mahout on Twitter Follower network (it
> > contains information about ~35 Million users following ~45 million users
> > http://an.kaist.ac.kr/traces/WWW2010.html ) and I should have access to
> > Cornell hadoop cluster (55 Quad core nodes with 16-18GB ram per node).
> Can
> > anyone estimate how long the job will run?
> > Also is it possible to perform regularized SVD, or will I need to add
> > functionality by modifying the code.
> > Thank you
> >
> >
> > --
> > Akshay Uday Bhat.
> > Graduate Student, Computer Science, Cornell University
> > Website: http://www.akshaybhat.com
> >
>

Thanks

-- 
Akshay Uday Bhat.
Graduate Student, Computer Science, Cornell University
Website: http://www.akshaybhat.com

Re: Regarding the scalability of SVD code in Mahout

Posted by Ted Dunning <te...@gmail.com>.
I don't think anybody has done anything on quite that scale, though Jake may
have come relatively close.

There are several scaling limits.  These include:

- the total number of non-zero elements.  This drives the scan time and, to
some extent the cost of the multiplies.

- the total number of singular vectors desired.  This directly drives the
number of iterations in the Hebbian approach and drives the size of
intermediate products in the random projection techniques.  It also causes
product scaling with the next factor.

- the number of columns in the original matrix.  This, multiplied by the
number of singular vectors drives the memory cost of some approaches in the
final step or in the SVD step for the random projection.

- the number of rows in the original matrix.  This is a secondary factor
that can drive some intermediate products in the random projection.

Which of these will hang you up in your problem is an open question.  There
is always the factor I haven't thought about yet.

Jake, do you have any thoughts on this?

On Sat, Sep 4, 2010 at 5:08 PM, Akshay Bhat <ak...@gmail.com> wrote:

> Hello,
> Has anyone attempted SVD of a with a really large matrix (~40 million rows
> and columns to be specific) using mahout.
> I am planning to perform SVD using mahout on Twitter Follower network (it
> contains information about ~35 Million users following ~45 million users
> http://an.kaist.ac.kr/traces/WWW2010.html ) and I should have access to
> Cornell hadoop cluster (55 Quad core nodes with 16-18GB ram per node). Can
> anyone estimate how long the job will run?
> Also is it possible to perform regularized SVD, or will I need to add
> functionality by modifying the code.
> Thank you
>
>
> --
> Akshay Uday Bhat.
> Graduate Student, Computer Science, Cornell University
> Website: http://www.akshaybhat.com
>