You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Glitch <at...@datacratic.com> on 2014/09/18 22:02:51 UTC

SVD on larger than taller matrix

I have a matrix of about 2 millions+ rows with 3 millions + columns in svm
format* and it's sparse. As I understand it, running SVD on such a matrix
shouldn't be a problem since version 1.1.

I'm using 10 worker nodes on EC2, each with 30G of RAM (r3.xlarge). I was
able to compute the SVD for 20 singular values, but it fails with a Java
Heap Size error for 200 singular values. I'm currently trying 100. 

So my question is this, what kind of cluster do you need to perform this
task? 
As I do not have any measurable experience with Spark I can't say if this is
normal: my test for 100 singular values has been running for over an hour.

I'm using this dataset
http://archive.ics.uci.edu/ml/datasets/URL+Reputation

I'm using the spark-shell with --executor-memory 15G --driver-memory 15G


And the few lines of codes are
/import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.util.MLUtils
val data = MLUtils.loadLibSVMFile(sc, "all.svm",3231961)
val features = data.map(line => line.features)
val mat = new RowMatrix(features)
val svd = mat.computeSVD(200, computeU= true)/


svm format: <label> <column number>:value



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SVD-on-larger-than-taller-matrix-tp14611.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: SVD on larger than taller matrix

Posted by Li Pu <lp...@twitter.com.INVALID>.

The main bottleneck of current SVD implementation is on the memory of
driver node. It requires at least 5*n*k doubles in driver memory because
all right singular vectors are stored in driver memory and there are some
working memory required. So it is bounded by the smaller dimension of your
matrix and k. For the worker nodes, memory requirements should be much
smallers as long as you can distribute your sparse matrix in worker's
memory. If possible, you can ask for more memory for driver node while
keeping worker nodes memory small.

Meanwhile we are working on removing this limitation by implementing
distributed QR and Lanczos in Spark.

On Thu, Sep 18, 2014 at 1:26 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Did you cache `features`? Without caching it is slow because we need
> O(k) iterations. The storage requirement on the driver is about 2 * n
> * k = 2 * 3 million * 200 ~= 9GB, not considering any overhead.
> Computing U is also an expensive task in your case. We should use some
> randomized SVD implementation for your data, but this is not available
> now. I would recommend setting driver-memory 25g, caching `features`,
> and using a smaller k. -Xiangrui
>
> On Thu, Sep 18, 2014 at 1:02 PM, Glitch <at...@datacratic.com> wrote:
> > I have a matrix of about 2 millions+ rows with 3 millions + columns in
> svm
> > format* and it's sparse. As I understand it, running SVD on such a matrix
> > shouldn't be a problem since version 1.1.
> >
> > I'm using 10 worker nodes on EC2, each with 30G of RAM (r3.xlarge). I was
> > able to compute the SVD for 20 singular values, but it fails with a Java
> > Heap Size error for 200 singular values. I'm currently trying 100.
> >
> > So my question is this, what kind of cluster do you need to perform this
> > task?
> > As I do not have any measurable experience with Spark I can't say if
> this is
> > normal: my test for 100 singular values has been running for over an
> hour.
> >
> > I'm using this dataset
> > http://archive.ics.uci.edu/ml/datasets/URL+Reputation
> >
> > I'm using the spark-shell with --executor-memory 15G --driver-memory 15G
> >
> >
> > And the few lines of codes are
> > /import org.apache.spark.mllib.linalg.distributed.RowMatrix
> > import org.apache.spark.mllib.util.MLUtils
> > val data = MLUtils.loadLibSVMFile(sc, "all.svm",3231961)
> > val features = data.map(line => line.features)
> > val mat = new RowMatrix(features)
> > val svd = mat.computeSVD(200, computeU= true)/
> >
> >
> > svm format: <label> <column number>:value
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SVD-on-larger-than-taller-matrix-tp14611.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Li
@vrilleup

Re: SVD on larger than taller matrix

Posted by Xiangrui Meng <me...@gmail.com>.

Did you cache `features`? Without caching it is slow because we need
O(k) iterations. The storage requirement on the driver is about 2 * n
* k = 2 * 3 million * 200 ~= 9GB, not considering any overhead.
Computing U is also an expensive task in your case. We should use some
randomized SVD implementation for your data, but this is not available
now. I would recommend setting driver-memory 25g, caching `features`,
and using a smaller k. -Xiangrui

On Thu, Sep 18, 2014 at 1:02 PM, Glitch <at...@datacratic.com> wrote:
> I have a matrix of about 2 millions+ rows with 3 millions + columns in svm
> format* and it's sparse. As I understand it, running SVD on such a matrix
> shouldn't be a problem since version 1.1.
>
> I'm using 10 worker nodes on EC2, each with 30G of RAM (r3.xlarge). I was
> able to compute the SVD for 20 singular values, but it fails with a Java
> Heap Size error for 200 singular values. I'm currently trying 100.
>
> So my question is this, what kind of cluster do you need to perform this
> task?
> As I do not have any measurable experience with Spark I can't say if this is
> normal: my test for 100 singular values has been running for over an hour.
>
> I'm using this dataset
> http://archive.ics.uci.edu/ml/datasets/URL+Reputation
>
> I'm using the spark-shell with --executor-memory 15G --driver-memory 15G
>
>
> And the few lines of codes are
> /import org.apache.spark.mllib.linalg.distributed.RowMatrix
> import org.apache.spark.mllib.util.MLUtils
> val data = MLUtils.loadLibSVMFile(sc, "all.svm",3231961)
> val features = data.map(line => line.features)
> val mat = new RowMatrix(features)
> val svd = mat.computeSVD(200, computeU= true)/
>
>
> svm format: <label> <column number>:value
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SVD-on-larger-than-taller-matrix-tp14611.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org