You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by Janardhan Pulivarthi <ja...@gmail.com> on 2017/07/20 23:02:48 UTC

svd( ) implementation

Hi Nakul,

Can you help me understand how gpu implementations help scale it. Whether
the user always have GPUs in use when using this fn or is it an optional
feature.

The problem with implementing the scalable svd() in dml is due to the spark
backend issues, otherwise there is no problem scaling w/o a local svd()
function.

May be we could try another algorithm for the scalable svd( ), if the
present algorithm is misleading as Imran Younus pointed out.

Thank you,
Janardhan

Re: svd( ) implementation

Posted by Imran Younus <im...@gmail.com>.
Hi Janardhan,

By tall-skinny matrix I mean that the number of columns should be
reasonably small (doesn't matter how many rows are there) . When you do QR
decomposition, R would a square matrix (upper triangular) of dimensions
equal to the number of columns of matrix A. But then R has to fit on
driver's memory, and thats why number of columns of A cannot be be very
large. (10k x 10k matrix will close to 1 GB). Once you've R, a local SVD
implementation will be needed to to compute SVD of R. Now, this is not a
very general method, but I think this is good enough for most of the cases.

imran

On Fri, Jul 28, 2017 at 10:37 PM, Janardhan Pulivarthi <
janardhan.pulivarthi@gmail.com> wrote:

> Thanks, Imran. As per the paper, at first we perform QR decomposition of
> input matrix (A), from which we obtain R. And then, we compute the svd(R)
> using the builtin local function (??). I'll try this.
>
> Tall-skinny matrix: so, do we have problem with square matrices?. or do we
> have to partition the matrix into tall-skinny matrices if we have a square
> one?.
>
> Thanks,
>
> Janardhan
>
> On Fri, Jul 28, 2017 at 11:52 PM, Imran Younus <im...@gmail.com>
> wrote:
>
> > Just to clarify one thing. For QR based, method, you can assume that R
> > matrix is small enough to fit on driver memory and them perform SVD on
> the
> > driver. That means your actual matrix has to tall-skinny matrix.
> >
> > imran
> >
> > On Fri, Jul 28, 2017 at 11:15 AM, Imran Younus <im...@gmail.com>
> > wrote:
> >
> > > Janardhan,
> > >
> > > The papers you're referring may not be relevant. The first paper, as
> far
> > > as I can tell, is about updating an existing svd decomposition as new
> > data
> > > comes in. The 3rd paper in this list is the one I used, but that method
> > is
> > > not good.
> > >
> > > There is also a method that uses QR decomposition and then calculates
> SVD
> > > from R matrix. Please have a look at equation 1.3 in this paper:
> > >
> > > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.115&rank=1
> > >
> > > I think this is worth trying out. The distributed QR is already
> > > implemented in SystemlML, so it may quick to try out.
> > >
> > > imran
> > >
> > >
> > >
> > > On Fri, Jul 28, 2017 at 10:10 AM, Janardhan Pulivarthi <
> > > janardhan.pulivarthi@gmail.com> wrote:
> > >
> > >> Hi Nakul & all the committers,
> > >>
> > >> Till now I am half way through the literature. But, for now a couple
> of
> > >> things to mention, in SVD there are three stages
> > >>   1. Bidiagonal reduction step
> > >>   2. Computation of the singular values
> > >>   3. Computation of the singular vectors
> > >>
> > >> of these three, The* Bidiagonal reduction* step is very expensive, so
> is
> > >> our focus on this( when considering GPU, at times where handling with
> > CPU
> > >> is infeasible).
> > >>
> > >> About literature:
> > >>
> > >>    - I took some time to go through " A Stable and Fast Algorithm for
> > >>    Updating the Singular Value Decomposition" by "Gu & Stanley", to
> > >> understand
> > >>    the numerical stability and round-off errors when we are
> partitioning
> > >> the
> > >>    matrix in this distributed algorithm. The author has assured that
> > each
> > >>    component computed will be of high absolute accuracy. And also, the
> > >>    properties that the resultant matrix support do not have any
> > conflicts
> > >> with
> > >>    parent matrix. [pdf
> > >>    <http://www.cs.yale.edu/publications/techreports/tr966.pdf>]
> > >>
> > >>
> > >>    - "High performance bidiagonal reduction using the tile algorithms
> on
> > >>    homogeneous multicore clusters ", by "Ltaief et. al", this paper
> has
> > >>    focused on the first stage mainly and has discussed a good about
> tile
> > >>    algorithms and their runtime implementations.(although off-topic, I
> > >> read
> > >>    this just to understand.) [pdf
> > >>    <http://www.netlib.org/lapack/lawnspdf/lawn247.pdf>]
> > >>
> > >>
> > >>    -  "A distributed and incremental svd algorithm for agglomerative
> > data
> > >>    analysis on large networks", by "Iwen & Ong", *Please go through*
> the
> > >>    (a). TABLE 1, TABLE 2 . (b). APPENDIX A. RAW DATA FROM NUMERICAL
> > >>    EXPERIMENTS. [pdf <https://arxiv.org/pdf/1601.07010.pdf>]
> > >>
> > >> Thanks,
> > >>
> > >> Janardhan
> > >>
> > >> On Wed, Jul 26, 2017 at 12:29 AM, Nakul Jindal <na...@gmail.com>
> > wrote:
> > >>
> > >> > Hi Janardhan,
> > >> >
> > >> > The images you've used as attachments haven't reached my inbox.
> > >> > Could you please send them to me directly, rather than through the
> dev
> > >> > mailing list.
> > >> > (Or upload it to a image hosting site like imgur and paste the links
> > in
> > >> the
> > >> > email)
> > >> >
> > >> > I would like to point out that my knowledge of machine learning is
> > >> limited.
> > >> > Still, how would you want to test the algorithm?
> > >> >
> > >> >
> > >> > Sparse matrices in SystemML (in Spark Execution Mode)
> > >> > Sparse matrix support in SystemML is implemented at a block level.
> > Each
> > >> > (potentially giant) matrix is stored as blocks (in Spark execution
> > >> mode).
> > >> > The matrix itself doesn't store knowledge of whether it is sparse or
> > >> not.
> > >> > Each block does. Each block of this giant matrix can be stored in
> > dense
> > >> or
> > >> > spare format, depending on how many non-zeroes are in that block.
> > >> > The sparsity of a block is recomputed every so often, and the block
> > >> > representation can be converted from dense to sparse and vice-versa.
> > >> > When operations are performed on blocks, internally, a sparse aware
> > >> > operation maybe performed, depending on how the blocks are stored.
> > >> >
> > >> > (One of the other contributors can clarify, if I've missed something
> > or
> > >> > have said something wrong).
> > >> >
> > >> > Given this context, can you please think about whether missing
> sparse
> > >> > matrix support is still a problem?
> > >> >
> > >> >
> > >> > -Nakul
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Tue, Jul 25, 2017 at 11:14 AM, Janardhan Pulivarthi <
> > >> > janardhan.pulivarthi@gmail.com> wrote:
> > >> >
> > >> > > Hi Nakul,
> > >> > >
> > >> > > Thanks for explaining me about pros and cons of the two
> approaches.
> > >> For
> > >> > > now, I have gone through the paper carefully over a couple of days
> > and
> > >> > > found the following interesting things.
> > >> > >
> > >> > > 1. This is the algorithm we implemented.
> > >> > >
> > >> > > ​
> > >> > > 2. In the above algorithm the input matrix A is approximated to
> > >> another
> > >> > > matrix B with the following relation with the error of chi(p, i) [
> > as
> > >> > shown
> > >> > > in (3) ] which the author argues will be in an acceptable limit.
> So,
> > >> we
> > >> > can
> > >> > > go with this algorithm.
> > >> > >
> > >> > >
> > >> > > But, one bad thing is that author is not sure about whether the
> > >> algorithm
> > >> > > supports the sparse matrices or not. So, we may need to test it
> > here.
> > >> > >
> > >> > > For the time being we need to test the present algorithm
> implemented
> > >> by
> > >> > > Imran Younus again. Can you help me with the testing?
> > >> > >
> > >> > > Thanks,
> > >> > > Janardhan
> > >> > > ​
> > >> > >
> > >> > > On Fri, Jul 21, 2017 at 6:07 AM, Nakul Jindal <na...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > >> Hi Janardhan,
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> How will GPU implementation help scale distributed SVD:
> > >> > >>
> > >> > >> Imran implemented an algorithm he found out about in the paper "A
> > >> > >> Distributed and Incremental SVD Algorithm for Agglomerative Data
> > >> > Analysis
> > >> > >> on Large Networks" (
> > >> > >> https://github.com/apache/systemml/pull/273/files#diff-488f0
> > >> > >> 6e290f7a54db2e125f7bc608971R27
> > >> > >> ).
> > >> > >> The idea there was to build up a distributed SVD using
> invocations
> > of
> > >> > svd
> > >> > >> on your local machine. He tried to achieve the multi-level
> > >> parallelism
> > >> > >> through the parfor construct.
> > >> > >> Each local invocation of svd was done using the Apache Commons
> Math
> > >> > >> library.
> > >> > >> If each invocation of this local svd can instead be done on a
> GPU,
> > >> the
> > >> > >> overall wall time for the distributed version would be decreased.
> > >> > >>
> > >> > >> Users may not always have a GPU. In that case, the svd falls back
> > to
> > >> the
> > >> > >> Apache Comons Math implementation. But if they do and if we have
> a
> > >> "svd"
> > >> > >> builtin function, then it would be easier to take advantage of
> the
> > >> GPU.
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> Problem with scalable svd in dml is due to spark backed issues,
> > >> > otherwise
> > >> > >> there is not problem scaling w/o a local svd():
> > >> > >>
> > >> > >> There maybe spark backend issues and more may come to light and
> > more
> > >> > >> workloads are executed on SystemML.
> > >> > >> For any given operation - we can implement it as a DML bodied
> > >> function
> > >> > or
> > >> > >> a
> > >> > >> builtin function.
> > >> > >> For DML Bodied functions:
> > >> > >> Pros:
> > >> > >> - The SystemML optimizer can be applied to it
> > >> > >> - Distribution of SVD is then taken care of by SystemML. It will
> > >> > generate
> > >> > >> and run the spark primitives needed.
> > >> > >>
> > >> > >> Cons:
> > >> > >> - Implementing SVD, whether in DML or C, is a fair amount of work
> > >> > >> - There would not be a straightforward call to the svd gpu
> library.
> > >> In
> > >> > >> fact, each of the linear algebra primitives would be accelerated
> on
> > >> the
> > >> > >> GPU, but not the entire operation itself. This would involve many
> > >> more
> > >> > JNI
> > >> > >> calls.
> > >> > >>
> > >> > >> For builtin functions:
> > >> > >> Pros:
> > >> > >> - Use of GPU libraries (cuSolver) and CPU libraries (Apache
> Commons
> > >> > Math)
> > >> > >> can be made, these are already optimized (in case of the GPU)
> > >> > >> - If a better SVD implementation is available via a library, that
> > can
> > >> > >> easily be plugged in.
> > >> > >>
> > >> > >> Cons:
> > >> > >> - Would have to come up with an algorithm to implement
> distributed
> > >> SVD
> > >> > >> with
> > >> > >> spark primitives
> > >> > >>
> > >> > >> Pick your battle.
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> Maybe we could try another algorithm for scalable svd() :
> > >> > >>
> > >> > >> Sure. But before you do that, it may be worth our while to
> > understand
> > >> > what
> > >> > >> is exactly misleading about the paper that Imran talks about.
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> -Nakul
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> On Thu, Jul 20, 2017 at 4:02 PM, Janardhan Pulivarthi <
> > >> > >> janardhan.pulivarthi@gmail.com> wrote:
> > >> > >>
> > >> > >> > Hi Nakul,
> > >> > >> >
> > >> > >> > Can you help me understand how gpu implementations help scale
> it.
> > >> > >> Whether
> > >> > >> > the user always have GPUs in use when using this fn or is it an
> > >> > optional
> > >> > >> > feature.
> > >> > >>
> > >> > >>
> > >> > >> > The problem with implementing the scalable svd() in dml is due
> to
> > >> the
> > >> > >> spark
> > >> > >> > backend issues, otherwise there is no problem scaling w/o a
> local
> > >> > svd()
> > >> > >> > function.
> > >> > >> >
> > >> > >> > May be we could try another algorithm for the scalable svd( ),
> if
> > >> the
> > >> > >> > present algorithm is misleading as Imran Younus pointed out.
> > >> > >> >
> > >> > >> > Thank you,
> > >> > >> > Janardhan
> > >> > >> >
> > >> > >>
> > >> > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: svd( ) implementation

Posted by Janardhan Pulivarthi <ja...@gmail.com>.
Thanks, Imran. As per the paper, at first we perform QR decomposition of
input matrix (A), from which we obtain R. And then, we compute the svd(R)
using the builtin local function (??). I'll try this.

Tall-skinny matrix: so, do we have problem with square matrices?. or do we
have to partition the matrix into tall-skinny matrices if we have a square
one?.

Thanks,

Janardhan

On Fri, Jul 28, 2017 at 11:52 PM, Imran Younus <im...@gmail.com>
wrote:

> Just to clarify one thing. For QR based, method, you can assume that R
> matrix is small enough to fit on driver memory and them perform SVD on the
> driver. That means your actual matrix has to tall-skinny matrix.
>
> imran
>
> On Fri, Jul 28, 2017 at 11:15 AM, Imran Younus <im...@gmail.com>
> wrote:
>
> > Janardhan,
> >
> > The papers you're referring may not be relevant. The first paper, as far
> > as I can tell, is about updating an existing svd decomposition as new
> data
> > comes in. The 3rd paper in this list is the one I used, but that method
> is
> > not good.
> >
> > There is also a method that uses QR decomposition and then calculates SVD
> > from R matrix. Please have a look at equation 1.3 in this paper:
> >
> > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.115&rank=1
> >
> > I think this is worth trying out. The distributed QR is already
> > implemented in SystemlML, so it may quick to try out.
> >
> > imran
> >
> >
> >
> > On Fri, Jul 28, 2017 at 10:10 AM, Janardhan Pulivarthi <
> > janardhan.pulivarthi@gmail.com> wrote:
> >
> >> Hi Nakul & all the committers,
> >>
> >> Till now I am half way through the literature. But, for now a couple of
> >> things to mention, in SVD there are three stages
> >>   1. Bidiagonal reduction step
> >>   2. Computation of the singular values
> >>   3. Computation of the singular vectors
> >>
> >> of these three, The* Bidiagonal reduction* step is very expensive, so is
> >> our focus on this( when considering GPU, at times where handling with
> CPU
> >> is infeasible).
> >>
> >> About literature:
> >>
> >>    - I took some time to go through " A Stable and Fast Algorithm for
> >>    Updating the Singular Value Decomposition" by "Gu & Stanley", to
> >> understand
> >>    the numerical stability and round-off errors when we are partitioning
> >> the
> >>    matrix in this distributed algorithm. The author has assured that
> each
> >>    component computed will be of high absolute accuracy. And also, the
> >>    properties that the resultant matrix support do not have any
> conflicts
> >> with
> >>    parent matrix. [pdf
> >>    <http://www.cs.yale.edu/publications/techreports/tr966.pdf>]
> >>
> >>
> >>    - "High performance bidiagonal reduction using the tile algorithms on
> >>    homogeneous multicore clusters ", by "Ltaief et. al", this paper has
> >>    focused on the first stage mainly and has discussed a good about tile
> >>    algorithms and their runtime implementations.(although off-topic, I
> >> read
> >>    this just to understand.) [pdf
> >>    <http://www.netlib.org/lapack/lawnspdf/lawn247.pdf>]
> >>
> >>
> >>    -  "A distributed and incremental svd algorithm for agglomerative
> data
> >>    analysis on large networks", by "Iwen & Ong", *Please go through* the
> >>    (a). TABLE 1, TABLE 2 . (b). APPENDIX A. RAW DATA FROM NUMERICAL
> >>    EXPERIMENTS. [pdf <https://arxiv.org/pdf/1601.07010.pdf>]
> >>
> >> Thanks,
> >>
> >> Janardhan
> >>
> >> On Wed, Jul 26, 2017 at 12:29 AM, Nakul Jindal <na...@gmail.com>
> wrote:
> >>
> >> > Hi Janardhan,
> >> >
> >> > The images you've used as attachments haven't reached my inbox.
> >> > Could you please send them to me directly, rather than through the dev
> >> > mailing list.
> >> > (Or upload it to a image hosting site like imgur and paste the links
> in
> >> the
> >> > email)
> >> >
> >> > I would like to point out that my knowledge of machine learning is
> >> limited.
> >> > Still, how would you want to test the algorithm?
> >> >
> >> >
> >> > Sparse matrices in SystemML (in Spark Execution Mode)
> >> > Sparse matrix support in SystemML is implemented at a block level.
> Each
> >> > (potentially giant) matrix is stored as blocks (in Spark execution
> >> mode).
> >> > The matrix itself doesn't store knowledge of whether it is sparse or
> >> not.
> >> > Each block does. Each block of this giant matrix can be stored in
> dense
> >> or
> >> > spare format, depending on how many non-zeroes are in that block.
> >> > The sparsity of a block is recomputed every so often, and the block
> >> > representation can be converted from dense to sparse and vice-versa.
> >> > When operations are performed on blocks, internally, a sparse aware
> >> > operation maybe performed, depending on how the blocks are stored.
> >> >
> >> > (One of the other contributors can clarify, if I've missed something
> or
> >> > have said something wrong).
> >> >
> >> > Given this context, can you please think about whether missing sparse
> >> > matrix support is still a problem?
> >> >
> >> >
> >> > -Nakul
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Tue, Jul 25, 2017 at 11:14 AM, Janardhan Pulivarthi <
> >> > janardhan.pulivarthi@gmail.com> wrote:
> >> >
> >> > > Hi Nakul,
> >> > >
> >> > > Thanks for explaining me about pros and cons of the two approaches.
> >> For
> >> > > now, I have gone through the paper carefully over a couple of days
> and
> >> > > found the following interesting things.
> >> > >
> >> > > 1. This is the algorithm we implemented.
> >> > >
> >> > > ​
> >> > > 2. In the above algorithm the input matrix A is approximated to
> >> another
> >> > > matrix B with the following relation with the error of chi(p, i) [
> as
> >> > shown
> >> > > in (3) ] which the author argues will be in an acceptable limit. So,
> >> we
> >> > can
> >> > > go with this algorithm.
> >> > >
> >> > >
> >> > > But, one bad thing is that author is not sure about whether the
> >> algorithm
> >> > > supports the sparse matrices or not. So, we may need to test it
> here.
> >> > >
> >> > > For the time being we need to test the present algorithm implemented
> >> by
> >> > > Imran Younus again. Can you help me with the testing?
> >> > >
> >> > > Thanks,
> >> > > Janardhan
> >> > > ​
> >> > >
> >> > > On Fri, Jul 21, 2017 at 6:07 AM, Nakul Jindal <na...@gmail.com>
> >> wrote:
> >> > >
> >> > >> Hi Janardhan,
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >> How will GPU implementation help scale distributed SVD:
> >> > >>
> >> > >> Imran implemented an algorithm he found out about in the paper "A
> >> > >> Distributed and Incremental SVD Algorithm for Agglomerative Data
> >> > Analysis
> >> > >> on Large Networks" (
> >> > >> https://github.com/apache/systemml/pull/273/files#diff-488f0
> >> > >> 6e290f7a54db2e125f7bc608971R27
> >> > >> ).
> >> > >> The idea there was to build up a distributed SVD using invocations
> of
> >> > svd
> >> > >> on your local machine. He tried to achieve the multi-level
> >> parallelism
> >> > >> through the parfor construct.
> >> > >> Each local invocation of svd was done using the Apache Commons Math
> >> > >> library.
> >> > >> If each invocation of this local svd can instead be done on a GPU,
> >> the
> >> > >> overall wall time for the distributed version would be decreased.
> >> > >>
> >> > >> Users may not always have a GPU. In that case, the svd falls back
> to
> >> the
> >> > >> Apache Comons Math implementation. But if they do and if we have a
> >> "svd"
> >> > >> builtin function, then it would be easier to take advantage of the
> >> GPU.
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >> Problem with scalable svd in dml is due to spark backed issues,
> >> > otherwise
> >> > >> there is not problem scaling w/o a local svd():
> >> > >>
> >> > >> There maybe spark backend issues and more may come to light and
> more
> >> > >> workloads are executed on SystemML.
> >> > >> For any given operation - we can implement it as a DML bodied
> >> function
> >> > or
> >> > >> a
> >> > >> builtin function.
> >> > >> For DML Bodied functions:
> >> > >> Pros:
> >> > >> - The SystemML optimizer can be applied to it
> >> > >> - Distribution of SVD is then taken care of by SystemML. It will
> >> > generate
> >> > >> and run the spark primitives needed.
> >> > >>
> >> > >> Cons:
> >> > >> - Implementing SVD, whether in DML or C, is a fair amount of work
> >> > >> - There would not be a straightforward call to the svd gpu library.
> >> In
> >> > >> fact, each of the linear algebra primitives would be accelerated on
> >> the
> >> > >> GPU, but not the entire operation itself. This would involve many
> >> more
> >> > JNI
> >> > >> calls.
> >> > >>
> >> > >> For builtin functions:
> >> > >> Pros:
> >> > >> - Use of GPU libraries (cuSolver) and CPU libraries (Apache Commons
> >> > Math)
> >> > >> can be made, these are already optimized (in case of the GPU)
> >> > >> - If a better SVD implementation is available via a library, that
> can
> >> > >> easily be plugged in.
> >> > >>
> >> > >> Cons:
> >> > >> - Would have to come up with an algorithm to implement distributed
> >> SVD
> >> > >> with
> >> > >> spark primitives
> >> > >>
> >> > >> Pick your battle.
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >> Maybe we could try another algorithm for scalable svd() :
> >> > >>
> >> > >> Sure. But before you do that, it may be worth our while to
> understand
> >> > what
> >> > >> is exactly misleading about the paper that Imran talks about.
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >> -Nakul
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >> On Thu, Jul 20, 2017 at 4:02 PM, Janardhan Pulivarthi <
> >> > >> janardhan.pulivarthi@gmail.com> wrote:
> >> > >>
> >> > >> > Hi Nakul,
> >> > >> >
> >> > >> > Can you help me understand how gpu implementations help scale it.
> >> > >> Whether
> >> > >> > the user always have GPUs in use when using this fn or is it an
> >> > optional
> >> > >> > feature.
> >> > >>
> >> > >>
> >> > >> > The problem with implementing the scalable svd() in dml is due to
> >> the
> >> > >> spark
> >> > >> > backend issues, otherwise there is no problem scaling w/o a local
> >> > svd()
> >> > >> > function.
> >> > >> >
> >> > >> > May be we could try another algorithm for the scalable svd( ), if
> >> the
> >> > >> > present algorithm is misleading as Imran Younus pointed out.
> >> > >> >
> >> > >> > Thank you,
> >> > >> > Janardhan
> >> > >> >
> >> > >>
> >> > >
> >> > >
> >> >
> >>
> >
> >
>

RE: svd( ) implementation

Posted by "Boucher, Jimmy" <Ji...@liberty.co.za>.
unsubscribe

-----Original Message-----
From: Imran Younus [mailto:imranyounus@gmail.com]
Sent: Friday, July 28, 2017 2:22 PM
To: dev@systemml.apache.org<ma...@systemml.apache.org>
Subject: Re: svd( ) implementation

Just to clarify one thing. For QR based, method, you can assume that R matrix is small enough to fit on driver memory and them perform SVD on the driver. That means your actual matrix has to tall-skinny matrix.

imran

On Fri, Jul 28, 2017 at 11:15 AM, Imran Younus <im...@gmail.com>>
wrote:

> Janardhan,
>
> The papers you're referring may not be relevant. The first paper, as
> far as I can tell, is about updating an existing svd decomposition as
> new data comes in. The 3rd paper in this list is the one I used, but
> that method is not good.
>
> There is also a method that uses QR decomposition and then calculates
> SVD from R matrix. Please have a look at equation 1.3 in this paper:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.115&rank=1<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.115&rank=1>
>
> I think this is worth trying out. The distributed QR is already
> implemented in SystemlML, so it may quick to try out.
>
> imran
>
>
>
> On Fri, Jul 28, 2017 at 10:10 AM, Janardhan Pulivarthi <
> janardhan.pulivarthi@gmail.com<ma...@gmail.com>> wrote:
>
>> Hi Nakul & all the committers,
>>
>> Till now I am half way through the literature. But, for now a couple
>> of things to mention, in SVD there are three stages
>> 1. Bidiagonal reduction step
>> 2. Computation of the singular values
>> 3. Computation of the singular vectors
>>
>> of these three, The* Bidiagonal reduction* step is very expensive, so
>> is our focus on this( when considering GPU, at times where handling
>> with CPU is infeasible).
>>
>> About literature:
>>
>> - I took some time to go through " A Stable and Fast Algorithm for
>> Updating the Singular Value Decomposition" by "Gu & Stanley", to
>> understand
>> the numerical stability and round-off errors when we are
>> partitioning the
>> matrix in this distributed algorithm. The author has assured that each
>> component computed will be of high absolute accuracy. And also, the
>> properties that the resultant matrix support do not have any
>> conflicts with
>> parent matrix. [pdf
>> <http://www.cs.yale.edu/publications/techreports/tr966.pdf<http://www.cs.yale.edu/publications/techreports/tr966.pdf>>]
>>
>>
>> - "High performance bidiagonal reduction using the tile algorithms on
>> homogeneous multicore clusters ", by "Ltaief et. al", this paper has
>> focused on the first stage mainly and has discussed a good about tile
>> algorithms and their runtime implementations.(although off-topic,
>> I read
>> this just to understand.) [pdf
>> <http://www.netlib.org/lapack/lawnspdf/lawn247.pdf<http://www.netlib.org/lapack/lawnspdf/lawn247.pdf>>]
>>
>>
>> - "A distributed and incremental svd algorithm for agglomerative data
>> analysis on large networks", by "Iwen & Ong", *Please go through* the
>> (a). TABLE 1, TABLE 2 . (b). APPENDIX A. RAW DATA FROM NUMERICAL
>> EXPERIMENTS. [pdf <https://arxiv.org/pdf/1601.07010.pdf<https://arxiv.org/pdf/1601.07010.pdf>>]
>>
>> Thanks,
>>
>> Janardhan
>>
>> On Wed, Jul 26, 2017 at 12:29 AM, Nakul Jindal <na...@gmail.com>> wrote:
>>
>> > Hi Janardhan,
>> >
>> > The images you've used as attachments haven't reached my inbox.
>> > Could you please send them to me directly, rather than through the
>> > dev mailing list.
>> > (Or upload it to a image hosting site like imgur and paste the
>> > links in
>> the
>> > email)
>> >
>> > I would like to point out that my knowledge of machine learning is
>> limited.
>> > Still, how would you want to test the algorithm?
>> >
>> >
>> > Sparse matrices in SystemML (in Spark Execution Mode) Sparse matrix
>> > support in SystemML is implemented at a block level. Each
>> > (potentially giant) matrix is stored as blocks (in Spark execution
>> mode).
>> > The matrix itself doesn't store knowledge of whether it is sparse
>> > or
>> not.
>> > Each block does. Each block of this giant matrix can be stored in
>> > dense
>> or
>> > spare format, depending on how many non-zeroes are in that block.
>> > The sparsity of a block is recomputed every so often, and the block
>> > representation can be converted from dense to sparse and vice-versa.
>> > When operations are performed on blocks, internally, a sparse aware
>> > operation maybe performed, depending on how the blocks are stored.
>> >
>> > (One of the other contributors can clarify, if I've missed
>> > something or have said something wrong).
>> >
>> > Given this context, can you please think about whether missing
>> > sparse matrix support is still a problem?
>> >
>> >
>> > -Nakul
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Jul 25, 2017 at 11:14 AM, Janardhan Pulivarthi <
>> > janardhan.pulivarthi@gmail.com<ma...@gmail.com>> wrote:
>> >
>> > > Hi Nakul,
>> > >
>> > > Thanks for explaining me about pros and cons of the two approaches.
>> For
>> > > now, I have gone through the paper carefully over a couple of
>> > > days and found the following interesting things.
>> > >
>> > > 1. This is the algorithm we implemented.
>> > >
>> > > ​
>> > > 2. In the above algorithm the input matrix A is approximated to
>> another
>> > > matrix B with the following relation with the error of chi(p, i)
>> > > [ as
>> > shown
>> > > in (3) ] which the author argues will be in an acceptable limit.
>> > > So,
>> we
>> > can
>> > > go with this algorithm.
>> > >
>> > >
>> > > But, one bad thing is that author is not sure about whether the
>> algorithm
>> > > supports the sparse matrices or not. So, we may need to test it here.
>> > >
>> > > For the time being we need to test the present algorithm
>> > > implemented
>> by
>> > > Imran Younus again. Can you help me with the testing?
>> > >
>> > > Thanks,
>> > > Janardhan
>> > > ​
>> > >
>> > > On Fri, Jul 21, 2017 at 6:07 AM, Nakul Jindal <na...@gmail.com>>
>> wrote:
>> > >
>> > >> Hi Janardhan,
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> How will GPU implementation help scale distributed SVD:
>> > >>
>> > >> Imran implemented an algorithm he found out about in the paper
>> > >> "A Distributed and Incremental SVD Algorithm for Agglomerative
>> > >> Data
>> > Analysis
>> > >> on Large Networks" (
>> > >> https://github.com/apache/systemml/pull/273/files#diff-488f0<https://github.com/apache/systemml/pull/273/files#diff-488f0>
>> > >> 6e290f7a54db2e125f7bc608971R27
>> > >> ).
>> > >> The idea there was to build up a distributed SVD using
>> > >> invocations of
>> > svd
>> > >> on your local machine. He tried to achieve the multi-level
>> parallelism
>> > >> through the parfor construct.
>> > >> Each local invocation of svd was done using the Apache Commons
>> > >> Math library.
>> > >> If each invocation of this local svd can instead be done on a
>> > >> GPU,
>> the
>> > >> overall wall time for the distributed version would be decreased.
>> > >>
>> > >> Users may not always have a GPU. In that case, the svd falls
>> > >> back to
>> the
>> > >> Apache Comons Math implementation. But if they do and if we have
>> > >> a
>> "svd"
>> > >> builtin function, then it would be easier to take advantage of
>> > >> the
>> GPU.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> Problem with scalable svd in dml is due to spark backed issues,
>> > otherwise
>> > >> there is not problem scaling w/o a local svd():
>> > >>
>> > >> There maybe spark backend issues and more may come to light and
>> > >> more workloads are executed on SystemML.
>> > >> For any given operation - we can implement it as a DML bodied
>> function
>> > or
>> > >> a
>> > >> builtin function.
>> > >> For DML Bodied functions:
>> > >> Pros:
>> > >> - The SystemML optimizer can be applied to it
>> > >> - Distribution of SVD is then taken care of by SystemML. It will
>> > generate
>> > >> and run the spark primitives needed.
>> > >>
>> > >> Cons:
>> > >> - Implementing SVD, whether in DML or C, is a fair amount of
>> > >> work
>> > >> - There would not be a straightforward call to the svd gpu library.
>> In
>> > >> fact, each of the linear algebra primitives would be accelerated
>> > >> on
>> the
>> > >> GPU, but not the entire operation itself. This would involve
>> > >> many
>> more
>> > JNI
>> > >> calls.
>> > >>
>> > >> For builtin functions:
>> > >> Pros:
>> > >> - Use of GPU libraries (cuSolver) and CPU libraries (Apache
>> > >> Commons
>> > Math)
>> > >> can be made, these are already optimized (in case of the GPU)
>> > >> - If a better SVD implementation is available via a library,
>> > >> that can easily be plugged in.
>> > >>
>> > >> Cons:
>> > >> - Would have to come up with an algorithm to implement
>> > >> distributed
>> SVD
>> > >> with
>> > >> spark primitives
>> > >>
>> > >> Pick your battle.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> Maybe we could try another algorithm for scalable svd() :
>> > >>
>> > >> Sure. But before you do that, it may be worth our while to
>> > >> understand
>> > what
>> > >> is exactly misleading about the paper that Imran talks about.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> -Nakul
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On Thu, Jul 20, 2017 at 4:02 PM, Janardhan Pulivarthi <
>> > >> janardhan.pulivarthi@gmail.com<ma...@gmail.com>> wrote:
>> > >>
>> > >> > Hi Nakul,
>> > >> >
>> > >> > Can you help me understand how gpu implementations help scale it.
>> > >> Whether
>> > >> > the user always have GPUs in use when using this fn or is it
>> > >> > an
>> > optional
>> > >> > feature.
>> > >>
>> > >>
>> > >> > The problem with implementing the scalable svd() in dml is due
>> > >> > to
>> the
>> > >> spark
>> > >> > backend issues, otherwise there is no problem scaling w/o a
>> > >> > local
>> > svd()
>> > >> > function.
>> > >> >
>> > >> > May be we could try another algorithm for the scalable svd( ),
>> > >> > if
>> the
>> > >> > present algorithm is misleading as Imran Younus pointed out.
>> > >> >
>> > >> > Thank you,
>> > >> > Janardhan
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

__________________________________________________________________________________________________________________________________________________________________________________________
Not Financial Advice
Liberty Group Ltd is an Authorised Financial services Provider in terms of the FAIS Act (Licence no. 2409), and a wholly owned subsidiary of Liberty Holdings Limited.
The information contained in this communication, including attachments, is not to be construed as advice in terms of the Financial Advisory and Intermediary Services
Act of 2002 ('FAIS') as the writer is neither an appointed representative of the Liberty Group Limited, nor a licensed financial services provider as contemplated in FAIS.
Please consult your financial adviser should you require advice of a financial nature and/or intermediary services.

Confidentiality
The e-mail and attachments are confidential and intended only for selected recipients. If you have received it in error, you may not in any way disclose or rely on the contents.
You may not keep, copy or distribute the e-mail. Should you receive it, immediately notify the sender of the error and delete the e-mail. Also note that this form of communication
is not secure, it can be intercepted, and may not necessarily be free of errors and viruses in spite of reasonable efforts to secure this medium. Any views and opinions expressed
herein may not necessarily be those of the Liberty group of companies.  The aforementioned does not accept any liability for any damage, loss or expense arising from this communication
and/or from accessing any attachment.   
__________________________________________________________________________________________________________________________________________________________________________________________

RE: svd( ) implementation

Posted by Matthew Plourde <mp...@cadent.tv>.
unsubscribe

-----Original Message-----
From: Imran Younus [mailto:imranyounus@gmail.com] 
Sent: Friday, July 28, 2017 2:22 PM
To: dev@systemml.apache.org
Subject: Re: svd( ) implementation

Just to clarify one thing. For QR based, method, you can assume that R matrix is small enough to fit on driver memory and them perform SVD on the driver. That means your actual matrix has to tall-skinny matrix.

imran

On Fri, Jul 28, 2017 at 11:15 AM, Imran Younus <im...@gmail.com>
wrote:

> Janardhan,
>
> The papers you're referring may not be relevant. The first paper, as 
> far as I can tell, is about updating an existing svd decomposition as 
> new data comes in. The 3rd paper in this list is the one I used, but 
> that method is not good.
>
> There is also a method that uses QR decomposition and then calculates 
> SVD from R matrix. Please have a look at equation 1.3 in this paper:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.115&rank=1
>
> I think this is worth trying out. The distributed QR is already 
> implemented in SystemlML, so it may quick to try out.
>
> imran
>
>
>
> On Fri, Jul 28, 2017 at 10:10 AM, Janardhan Pulivarthi < 
> janardhan.pulivarthi@gmail.com> wrote:
>
>> Hi Nakul & all the committers,
>>
>> Till now I am half way through the literature. But, for now a couple 
>> of things to mention, in SVD there are three stages
>>   1. Bidiagonal reduction step
>>   2. Computation of the singular values
>>   3. Computation of the singular vectors
>>
>> of these three, The* Bidiagonal reduction* step is very expensive, so 
>> is our focus on this( when considering GPU, at times where handling 
>> with CPU is infeasible).
>>
>> About literature:
>>
>>    - I took some time to go through " A Stable and Fast Algorithm for
>>    Updating the Singular Value Decomposition" by "Gu & Stanley", to 
>> understand
>>    the numerical stability and round-off errors when we are 
>> partitioning the
>>    matrix in this distributed algorithm. The author has assured that each
>>    component computed will be of high absolute accuracy. And also, the
>>    properties that the resultant matrix support do not have any 
>> conflicts with
>>    parent matrix. [pdf
>>    <http://www.cs.yale.edu/publications/techreports/tr966.pdf>]
>>
>>
>>    - "High performance bidiagonal reduction using the tile algorithms on
>>    homogeneous multicore clusters ", by "Ltaief et. al", this paper has
>>    focused on the first stage mainly and has discussed a good about tile
>>    algorithms and their runtime implementations.(although off-topic, 
>> I read
>>    this just to understand.) [pdf
>>    <http://www.netlib.org/lapack/lawnspdf/lawn247.pdf>]
>>
>>
>>    -  "A distributed and incremental svd algorithm for agglomerative data
>>    analysis on large networks", by "Iwen & Ong", *Please go through* the
>>    (a). TABLE 1, TABLE 2 . (b). APPENDIX A. RAW DATA FROM NUMERICAL
>>    EXPERIMENTS. [pdf <https://arxiv.org/pdf/1601.07010.pdf>]
>>
>> Thanks,
>>
>> Janardhan
>>
>> On Wed, Jul 26, 2017 at 12:29 AM, Nakul Jindal <na...@gmail.com> wrote:
>>
>> > Hi Janardhan,
>> >
>> > The images you've used as attachments haven't reached my inbox.
>> > Could you please send them to me directly, rather than through the 
>> > dev mailing list.
>> > (Or upload it to a image hosting site like imgur and paste the 
>> > links in
>> the
>> > email)
>> >
>> > I would like to point out that my knowledge of machine learning is
>> limited.
>> > Still, how would you want to test the algorithm?
>> >
>> >
>> > Sparse matrices in SystemML (in Spark Execution Mode) Sparse matrix 
>> > support in SystemML is implemented at a block level. Each 
>> > (potentially giant) matrix is stored as blocks (in Spark execution
>> mode).
>> > The matrix itself doesn't store knowledge of whether it is sparse 
>> > or
>> not.
>> > Each block does. Each block of this giant matrix can be stored in 
>> > dense
>> or
>> > spare format, depending on how many non-zeroes are in that block.
>> > The sparsity of a block is recomputed every so often, and the block 
>> > representation can be converted from dense to sparse and vice-versa.
>> > When operations are performed on blocks, internally, a sparse aware 
>> > operation maybe performed, depending on how the blocks are stored.
>> >
>> > (One of the other contributors can clarify, if I've missed 
>> > something or have said something wrong).
>> >
>> > Given this context, can you please think about whether missing 
>> > sparse matrix support is still a problem?
>> >
>> >
>> > -Nakul
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Jul 25, 2017 at 11:14 AM, Janardhan Pulivarthi < 
>> > janardhan.pulivarthi@gmail.com> wrote:
>> >
>> > > Hi Nakul,
>> > >
>> > > Thanks for explaining me about pros and cons of the two approaches.
>> For
>> > > now, I have gone through the paper carefully over a couple of 
>> > > days and found the following interesting things.
>> > >
>> > > 1. This is the algorithm we implemented.
>> > >
>> > > ​
>> > > 2. In the above algorithm the input matrix A is approximated to
>> another
>> > > matrix B with the following relation with the error of chi(p, i) 
>> > > [ as
>> > shown
>> > > in (3) ] which the author argues will be in an acceptable limit. 
>> > > So,
>> we
>> > can
>> > > go with this algorithm.
>> > >
>> > >
>> > > But, one bad thing is that author is not sure about whether the
>> algorithm
>> > > supports the sparse matrices or not. So, we may need to test it here.
>> > >
>> > > For the time being we need to test the present algorithm 
>> > > implemented
>> by
>> > > Imran Younus again. Can you help me with the testing?
>> > >
>> > > Thanks,
>> > > Janardhan
>> > > ​
>> > >
>> > > On Fri, Jul 21, 2017 at 6:07 AM, Nakul Jindal <na...@gmail.com>
>> wrote:
>> > >
>> > >> Hi Janardhan,
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> How will GPU implementation help scale distributed SVD:
>> > >>
>> > >> Imran implemented an algorithm he found out about in the paper 
>> > >> "A Distributed and Incremental SVD Algorithm for Agglomerative 
>> > >> Data
>> > Analysis
>> > >> on Large Networks" (
>> > >> https://github.com/apache/systemml/pull/273/files#diff-488f0
>> > >> 6e290f7a54db2e125f7bc608971R27
>> > >> ).
>> > >> The idea there was to build up a distributed SVD using 
>> > >> invocations of
>> > svd
>> > >> on your local machine. He tried to achieve the multi-level
>> parallelism
>> > >> through the parfor construct.
>> > >> Each local invocation of svd was done using the Apache Commons 
>> > >> Math library.
>> > >> If each invocation of this local svd can instead be done on a 
>> > >> GPU,
>> the
>> > >> overall wall time for the distributed version would be decreased.
>> > >>
>> > >> Users may not always have a GPU. In that case, the svd falls 
>> > >> back to
>> the
>> > >> Apache Comons Math implementation. But if they do and if we have 
>> > >> a
>> "svd"
>> > >> builtin function, then it would be easier to take advantage of 
>> > >> the
>> GPU.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> Problem with scalable svd in dml is due to spark backed issues,
>> > otherwise
>> > >> there is not problem scaling w/o a local svd():
>> > >>
>> > >> There maybe spark backend issues and more may come to light and 
>> > >> more workloads are executed on SystemML.
>> > >> For any given operation - we can implement it as a DML bodied
>> function
>> > or
>> > >> a
>> > >> builtin function.
>> > >> For DML Bodied functions:
>> > >> Pros:
>> > >> - The SystemML optimizer can be applied to it
>> > >> - Distribution of SVD is then taken care of by SystemML. It will
>> > generate
>> > >> and run the spark primitives needed.
>> > >>
>> > >> Cons:
>> > >> - Implementing SVD, whether in DML or C, is a fair amount of 
>> > >> work
>> > >> - There would not be a straightforward call to the svd gpu library.
>> In
>> > >> fact, each of the linear algebra primitives would be accelerated 
>> > >> on
>> the
>> > >> GPU, but not the entire operation itself. This would involve 
>> > >> many
>> more
>> > JNI
>> > >> calls.
>> > >>
>> > >> For builtin functions:
>> > >> Pros:
>> > >> - Use of GPU libraries (cuSolver) and CPU libraries (Apache 
>> > >> Commons
>> > Math)
>> > >> can be made, these are already optimized (in case of the GPU)
>> > >> - If a better SVD implementation is available via a library, 
>> > >> that can easily be plugged in.
>> > >>
>> > >> Cons:
>> > >> - Would have to come up with an algorithm to implement 
>> > >> distributed
>> SVD
>> > >> with
>> > >> spark primitives
>> > >>
>> > >> Pick your battle.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> Maybe we could try another algorithm for scalable svd() :
>> > >>
>> > >> Sure. But before you do that, it may be worth our while to 
>> > >> understand
>> > what
>> > >> is exactly misleading about the paper that Imran talks about.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> -Nakul
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On Thu, Jul 20, 2017 at 4:02 PM, Janardhan Pulivarthi < 
>> > >> janardhan.pulivarthi@gmail.com> wrote:
>> > >>
>> > >> > Hi Nakul,
>> > >> >
>> > >> > Can you help me understand how gpu implementations help scale it.
>> > >> Whether
>> > >> > the user always have GPUs in use when using this fn or is it 
>> > >> > an
>> > optional
>> > >> > feature.
>> > >>
>> > >>
>> > >> > The problem with implementing the scalable svd() in dml is due 
>> > >> > to
>> the
>> > >> spark
>> > >> > backend issues, otherwise there is no problem scaling w/o a 
>> > >> > local
>> > svd()
>> > >> > function.
>> > >> >
>> > >> > May be we could try another algorithm for the scalable svd( ), 
>> > >> > if
>> the
>> > >> > present algorithm is misleading as Imran Younus pointed out.
>> > >> >
>> > >> > Thank you,
>> > >> > Janardhan
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: svd( ) implementation

Posted by Imran Younus <im...@gmail.com>.
Just to clarify one thing. For QR based, method, you can assume that R
matrix is small enough to fit on driver memory and them perform SVD on the
driver. That means your actual matrix has to tall-skinny matrix.

imran

On Fri, Jul 28, 2017 at 11:15 AM, Imran Younus <im...@gmail.com>
wrote:

> Janardhan,
>
> The papers you're referring may not be relevant. The first paper, as far
> as I can tell, is about updating an existing svd decomposition as new data
> comes in. The 3rd paper in this list is the one I used, but that method is
> not good.
>
> There is also a method that uses QR decomposition and then calculates SVD
> from R matrix. Please have a look at equation 1.3 in this paper:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.115&rank=1
>
> I think this is worth trying out. The distributed QR is already
> implemented in SystemlML, so it may quick to try out.
>
> imran
>
>
>
> On Fri, Jul 28, 2017 at 10:10 AM, Janardhan Pulivarthi <
> janardhan.pulivarthi@gmail.com> wrote:
>
>> Hi Nakul & all the committers,
>>
>> Till now I am half way through the literature. But, for now a couple of
>> things to mention, in SVD there are three stages
>>   1. Bidiagonal reduction step
>>   2. Computation of the singular values
>>   3. Computation of the singular vectors
>>
>> of these three, The* Bidiagonal reduction* step is very expensive, so is
>> our focus on this( when considering GPU, at times where handling with CPU
>> is infeasible).
>>
>> About literature:
>>
>>    - I took some time to go through " A Stable and Fast Algorithm for
>>    Updating the Singular Value Decomposition" by "Gu & Stanley", to
>> understand
>>    the numerical stability and round-off errors when we are partitioning
>> the
>>    matrix in this distributed algorithm. The author has assured that each
>>    component computed will be of high absolute accuracy. And also, the
>>    properties that the resultant matrix support do not have any conflicts
>> with
>>    parent matrix. [pdf
>>    <http://www.cs.yale.edu/publications/techreports/tr966.pdf>]
>>
>>
>>    - "High performance bidiagonal reduction using the tile algorithms on
>>    homogeneous multicore clusters ", by "Ltaief et. al", this paper has
>>    focused on the first stage mainly and has discussed a good about tile
>>    algorithms and their runtime implementations.(although off-topic, I
>> read
>>    this just to understand.) [pdf
>>    <http://www.netlib.org/lapack/lawnspdf/lawn247.pdf>]
>>
>>
>>    -  "A distributed and incremental svd algorithm for agglomerative data
>>    analysis on large networks", by "Iwen & Ong", *Please go through* the
>>    (a). TABLE 1, TABLE 2 . (b). APPENDIX A. RAW DATA FROM NUMERICAL
>>    EXPERIMENTS. [pdf <https://arxiv.org/pdf/1601.07010.pdf>]
>>
>> Thanks,
>>
>> Janardhan
>>
>> On Wed, Jul 26, 2017 at 12:29 AM, Nakul Jindal <na...@gmail.com> wrote:
>>
>> > Hi Janardhan,
>> >
>> > The images you've used as attachments haven't reached my inbox.
>> > Could you please send them to me directly, rather than through the dev
>> > mailing list.
>> > (Or upload it to a image hosting site like imgur and paste the links in
>> the
>> > email)
>> >
>> > I would like to point out that my knowledge of machine learning is
>> limited.
>> > Still, how would you want to test the algorithm?
>> >
>> >
>> > Sparse matrices in SystemML (in Spark Execution Mode)
>> > Sparse matrix support in SystemML is implemented at a block level. Each
>> > (potentially giant) matrix is stored as blocks (in Spark execution
>> mode).
>> > The matrix itself doesn't store knowledge of whether it is sparse or
>> not.
>> > Each block does. Each block of this giant matrix can be stored in dense
>> or
>> > spare format, depending on how many non-zeroes are in that block.
>> > The sparsity of a block is recomputed every so often, and the block
>> > representation can be converted from dense to sparse and vice-versa.
>> > When operations are performed on blocks, internally, a sparse aware
>> > operation maybe performed, depending on how the blocks are stored.
>> >
>> > (One of the other contributors can clarify, if I've missed something or
>> > have said something wrong).
>> >
>> > Given this context, can you please think about whether missing sparse
>> > matrix support is still a problem?
>> >
>> >
>> > -Nakul
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Tue, Jul 25, 2017 at 11:14 AM, Janardhan Pulivarthi <
>> > janardhan.pulivarthi@gmail.com> wrote:
>> >
>> > > Hi Nakul,
>> > >
>> > > Thanks for explaining me about pros and cons of the two approaches.
>> For
>> > > now, I have gone through the paper carefully over a couple of days and
>> > > found the following interesting things.
>> > >
>> > > 1. This is the algorithm we implemented.
>> > >
>> > > ​
>> > > 2. In the above algorithm the input matrix A is approximated to
>> another
>> > > matrix B with the following relation with the error of chi(p, i) [ as
>> > shown
>> > > in (3) ] which the author argues will be in an acceptable limit. So,
>> we
>> > can
>> > > go with this algorithm.
>> > >
>> > >
>> > > But, one bad thing is that author is not sure about whether the
>> algorithm
>> > > supports the sparse matrices or not. So, we may need to test it here.
>> > >
>> > > For the time being we need to test the present algorithm implemented
>> by
>> > > Imran Younus again. Can you help me with the testing?
>> > >
>> > > Thanks,
>> > > Janardhan
>> > > ​
>> > >
>> > > On Fri, Jul 21, 2017 at 6:07 AM, Nakul Jindal <na...@gmail.com>
>> wrote:
>> > >
>> > >> Hi Janardhan,
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> How will GPU implementation help scale distributed SVD:
>> > >>
>> > >> Imran implemented an algorithm he found out about in the paper "A
>> > >> Distributed and Incremental SVD Algorithm for Agglomerative Data
>> > Analysis
>> > >> on Large Networks" (
>> > >> https://github.com/apache/systemml/pull/273/files#diff-488f0
>> > >> 6e290f7a54db2e125f7bc608971R27
>> > >> ).
>> > >> The idea there was to build up a distributed SVD using invocations of
>> > svd
>> > >> on your local machine. He tried to achieve the multi-level
>> parallelism
>> > >> through the parfor construct.
>> > >> Each local invocation of svd was done using the Apache Commons Math
>> > >> library.
>> > >> If each invocation of this local svd can instead be done on a GPU,
>> the
>> > >> overall wall time for the distributed version would be decreased.
>> > >>
>> > >> Users may not always have a GPU. In that case, the svd falls back to
>> the
>> > >> Apache Comons Math implementation. But if they do and if we have a
>> "svd"
>> > >> builtin function, then it would be easier to take advantage of the
>> GPU.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> Problem with scalable svd in dml is due to spark backed issues,
>> > otherwise
>> > >> there is not problem scaling w/o a local svd():
>> > >>
>> > >> There maybe spark backend issues and more may come to light and more
>> > >> workloads are executed on SystemML.
>> > >> For any given operation - we can implement it as a DML bodied
>> function
>> > or
>> > >> a
>> > >> builtin function.
>> > >> For DML Bodied functions:
>> > >> Pros:
>> > >> - The SystemML optimizer can be applied to it
>> > >> - Distribution of SVD is then taken care of by SystemML. It will
>> > generate
>> > >> and run the spark primitives needed.
>> > >>
>> > >> Cons:
>> > >> - Implementing SVD, whether in DML or C, is a fair amount of work
>> > >> - There would not be a straightforward call to the svd gpu library.
>> In
>> > >> fact, each of the linear algebra primitives would be accelerated on
>> the
>> > >> GPU, but not the entire operation itself. This would involve many
>> more
>> > JNI
>> > >> calls.
>> > >>
>> > >> For builtin functions:
>> > >> Pros:
>> > >> - Use of GPU libraries (cuSolver) and CPU libraries (Apache Commons
>> > Math)
>> > >> can be made, these are already optimized (in case of the GPU)
>> > >> - If a better SVD implementation is available via a library, that can
>> > >> easily be plugged in.
>> > >>
>> > >> Cons:
>> > >> - Would have to come up with an algorithm to implement distributed
>> SVD
>> > >> with
>> > >> spark primitives
>> > >>
>> > >> Pick your battle.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> Maybe we could try another algorithm for scalable svd() :
>> > >>
>> > >> Sure. But before you do that, it may be worth our while to understand
>> > what
>> > >> is exactly misleading about the paper that Imran talks about.
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> -Nakul
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >>
>> > >> On Thu, Jul 20, 2017 at 4:02 PM, Janardhan Pulivarthi <
>> > >> janardhan.pulivarthi@gmail.com> wrote:
>> > >>
>> > >> > Hi Nakul,
>> > >> >
>> > >> > Can you help me understand how gpu implementations help scale it.
>> > >> Whether
>> > >> > the user always have GPUs in use when using this fn or is it an
>> > optional
>> > >> > feature.
>> > >>
>> > >>
>> > >> > The problem with implementing the scalable svd() in dml is due to
>> the
>> > >> spark
>> > >> > backend issues, otherwise there is no problem scaling w/o a local
>> > svd()
>> > >> > function.
>> > >> >
>> > >> > May be we could try another algorithm for the scalable svd( ), if
>> the
>> > >> > present algorithm is misleading as Imran Younus pointed out.
>> > >> >
>> > >> > Thank you,
>> > >> > Janardhan
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: svd( ) implementation

Posted by Imran Younus <im...@gmail.com>.
Janardhan,

The papers you're referring may not be relevant. The first paper, as far as
I can tell, is about updating an existing svd decomposition as new data
comes in. The 3rd paper in this list is the one I used, but that method is
not good.

There is also a method that uses QR decomposition and then calculates SVD
from R matrix. Please have a look at equation 1.3 in this paper:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.127.115&rank=1

I think this is worth trying out. The distributed QR is already implemented
in SystemlML, so it may quick to try out.

imran



On Fri, Jul 28, 2017 at 10:10 AM, Janardhan Pulivarthi <
janardhan.pulivarthi@gmail.com> wrote:

> Hi Nakul & all the committers,
>
> Till now I am half way through the literature. But, for now a couple of
> things to mention, in SVD there are three stages
>   1. Bidiagonal reduction step
>   2. Computation of the singular values
>   3. Computation of the singular vectors
>
> of these three, The* Bidiagonal reduction* step is very expensive, so is
> our focus on this( when considering GPU, at times where handling with CPU
> is infeasible).
>
> About literature:
>
>    - I took some time to go through " A Stable and Fast Algorithm for
>    Updating the Singular Value Decomposition" by "Gu & Stanley", to
> understand
>    the numerical stability and round-off errors when we are partitioning
> the
>    matrix in this distributed algorithm. The author has assured that each
>    component computed will be of high absolute accuracy. And also, the
>    properties that the resultant matrix support do not have any conflicts
> with
>    parent matrix. [pdf
>    <http://www.cs.yale.edu/publications/techreports/tr966.pdf>]
>
>
>    - "High performance bidiagonal reduction using the tile algorithms on
>    homogeneous multicore clusters ", by "Ltaief et. al", this paper has
>    focused on the first stage mainly and has discussed a good about tile
>    algorithms and their runtime implementations.(although off-topic, I read
>    this just to understand.) [pdf
>    <http://www.netlib.org/lapack/lawnspdf/lawn247.pdf>]
>
>
>    -  "A distributed and incremental svd algorithm for agglomerative data
>    analysis on large networks", by "Iwen & Ong", *Please go through* the
>    (a). TABLE 1, TABLE 2 . (b). APPENDIX A. RAW DATA FROM NUMERICAL
>    EXPERIMENTS. [pdf <https://arxiv.org/pdf/1601.07010.pdf>]
>
> Thanks,
>
> Janardhan
>
> On Wed, Jul 26, 2017 at 12:29 AM, Nakul Jindal <na...@gmail.com> wrote:
>
> > Hi Janardhan,
> >
> > The images you've used as attachments haven't reached my inbox.
> > Could you please send them to me directly, rather than through the dev
> > mailing list.
> > (Or upload it to a image hosting site like imgur and paste the links in
> the
> > email)
> >
> > I would like to point out that my knowledge of machine learning is
> limited.
> > Still, how would you want to test the algorithm?
> >
> >
> > Sparse matrices in SystemML (in Spark Execution Mode)
> > Sparse matrix support in SystemML is implemented at a block level. Each
> > (potentially giant) matrix is stored as blocks (in Spark execution mode).
> > The matrix itself doesn't store knowledge of whether it is sparse or not.
> > Each block does. Each block of this giant matrix can be stored in dense
> or
> > spare format, depending on how many non-zeroes are in that block.
> > The sparsity of a block is recomputed every so often, and the block
> > representation can be converted from dense to sparse and vice-versa.
> > When operations are performed on blocks, internally, a sparse aware
> > operation maybe performed, depending on how the blocks are stored.
> >
> > (One of the other contributors can clarify, if I've missed something or
> > have said something wrong).
> >
> > Given this context, can you please think about whether missing sparse
> > matrix support is still a problem?
> >
> >
> > -Nakul
> >
> >
> >
> >
> >
> >
> >
> > On Tue, Jul 25, 2017 at 11:14 AM, Janardhan Pulivarthi <
> > janardhan.pulivarthi@gmail.com> wrote:
> >
> > > Hi Nakul,
> > >
> > > Thanks for explaining me about pros and cons of the two approaches. For
> > > now, I have gone through the paper carefully over a couple of days and
> > > found the following interesting things.
> > >
> > > 1. This is the algorithm we implemented.
> > >
> > > ​
> > > 2. In the above algorithm the input matrix A is approximated to another
> > > matrix B with the following relation with the error of chi(p, i) [ as
> > shown
> > > in (3) ] which the author argues will be in an acceptable limit. So, we
> > can
> > > go with this algorithm.
> > >
> > >
> > > But, one bad thing is that author is not sure about whether the
> algorithm
> > > supports the sparse matrices or not. So, we may need to test it here.
> > >
> > > For the time being we need to test the present algorithm implemented by
> > > Imran Younus again. Can you help me with the testing?
> > >
> > > Thanks,
> > > Janardhan
> > > ​
> > >
> > > On Fri, Jul 21, 2017 at 6:07 AM, Nakul Jindal <na...@gmail.com>
> wrote:
> > >
> > >> Hi Janardhan,
> > >>
> > >>
> > >>
> > >>
> > >> How will GPU implementation help scale distributed SVD:
> > >>
> > >> Imran implemented an algorithm he found out about in the paper "A
> > >> Distributed and Incremental SVD Algorithm for Agglomerative Data
> > Analysis
> > >> on Large Networks" (
> > >> https://github.com/apache/systemml/pull/273/files#diff-488f0
> > >> 6e290f7a54db2e125f7bc608971R27
> > >> ).
> > >> The idea there was to build up a distributed SVD using invocations of
> > svd
> > >> on your local machine. He tried to achieve the multi-level parallelism
> > >> through the parfor construct.
> > >> Each local invocation of svd was done using the Apache Commons Math
> > >> library.
> > >> If each invocation of this local svd can instead be done on a GPU, the
> > >> overall wall time for the distributed version would be decreased.
> > >>
> > >> Users may not always have a GPU. In that case, the svd falls back to
> the
> > >> Apache Comons Math implementation. But if they do and if we have a
> "svd"
> > >> builtin function, then it would be easier to take advantage of the
> GPU.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Problem with scalable svd in dml is due to spark backed issues,
> > otherwise
> > >> there is not problem scaling w/o a local svd():
> > >>
> > >> There maybe spark backend issues and more may come to light and more
> > >> workloads are executed on SystemML.
> > >> For any given operation - we can implement it as a DML bodied function
> > or
> > >> a
> > >> builtin function.
> > >> For DML Bodied functions:
> > >> Pros:
> > >> - The SystemML optimizer can be applied to it
> > >> - Distribution of SVD is then taken care of by SystemML. It will
> > generate
> > >> and run the spark primitives needed.
> > >>
> > >> Cons:
> > >> - Implementing SVD, whether in DML or C, is a fair amount of work
> > >> - There would not be a straightforward call to the svd gpu library. In
> > >> fact, each of the linear algebra primitives would be accelerated on
> the
> > >> GPU, but not the entire operation itself. This would involve many more
> > JNI
> > >> calls.
> > >>
> > >> For builtin functions:
> > >> Pros:
> > >> - Use of GPU libraries (cuSolver) and CPU libraries (Apache Commons
> > Math)
> > >> can be made, these are already optimized (in case of the GPU)
> > >> - If a better SVD implementation is available via a library, that can
> > >> easily be plugged in.
> > >>
> > >> Cons:
> > >> - Would have to come up with an algorithm to implement distributed SVD
> > >> with
> > >> spark primitives
> > >>
> > >> Pick your battle.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Maybe we could try another algorithm for scalable svd() :
> > >>
> > >> Sure. But before you do that, it may be worth our while to understand
> > what
> > >> is exactly misleading about the paper that Imran talks about.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> -Nakul
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Thu, Jul 20, 2017 at 4:02 PM, Janardhan Pulivarthi <
> > >> janardhan.pulivarthi@gmail.com> wrote:
> > >>
> > >> > Hi Nakul,
> > >> >
> > >> > Can you help me understand how gpu implementations help scale it.
> > >> Whether
> > >> > the user always have GPUs in use when using this fn or is it an
> > optional
> > >> > feature.
> > >>
> > >>
> > >> > The problem with implementing the scalable svd() in dml is due to
> the
> > >> spark
> > >> > backend issues, otherwise there is no problem scaling w/o a local
> > svd()
> > >> > function.
> > >> >
> > >> > May be we could try another algorithm for the scalable svd( ), if
> the
> > >> > present algorithm is misleading as Imran Younus pointed out.
> > >> >
> > >> > Thank you,
> > >> > Janardhan
> > >> >
> > >>
> > >
> > >
> >
>

Re: svd( ) implementation

Posted by Janardhan Pulivarthi <ja...@gmail.com>.
Hi Nakul & all the committers,

Till now I am half way through the literature. But, for now a couple of
things to mention, in SVD there are three stages
  1. Bidiagonal reduction step
  2. Computation of the singular values
  3. Computation of the singular vectors

of these three, The* Bidiagonal reduction* step is very expensive, so is
our focus on this( when considering GPU, at times where handling with CPU
is infeasible).

About literature:

   - I took some time to go through " A Stable and Fast Algorithm for
   Updating the Singular Value Decomposition" by "Gu & Stanley", to understand
   the numerical stability and round-off errors when we are partitioning the
   matrix in this distributed algorithm. The author has assured that each
   component computed will be of high absolute accuracy. And also, the
   properties that the resultant matrix support do not have any conflicts with
   parent matrix. [pdf
   <http://www.cs.yale.edu/publications/techreports/tr966.pdf>]


   - "High performance bidiagonal reduction using the tile algorithms on
   homogeneous multicore clusters ", by "Ltaief et. al", this paper has
   focused on the first stage mainly and has discussed a good about tile
   algorithms and their runtime implementations.(although off-topic, I read
   this just to understand.) [pdf
   <http://www.netlib.org/lapack/lawnspdf/lawn247.pdf>]


   -  "A distributed and incremental svd algorithm for agglomerative data
   analysis on large networks", by "Iwen & Ong", *Please go through* the
   (a). TABLE 1, TABLE 2 . (b). APPENDIX A. RAW DATA FROM NUMERICAL
   EXPERIMENTS. [pdf <https://arxiv.org/pdf/1601.07010.pdf>]

Thanks,

Janardhan

On Wed, Jul 26, 2017 at 12:29 AM, Nakul Jindal <na...@gmail.com> wrote:

> Hi Janardhan,
>
> The images you've used as attachments haven't reached my inbox.
> Could you please send them to me directly, rather than through the dev
> mailing list.
> (Or upload it to a image hosting site like imgur and paste the links in the
> email)
>
> I would like to point out that my knowledge of machine learning is limited.
> Still, how would you want to test the algorithm?
>
>
> Sparse matrices in SystemML (in Spark Execution Mode)
> Sparse matrix support in SystemML is implemented at a block level. Each
> (potentially giant) matrix is stored as blocks (in Spark execution mode).
> The matrix itself doesn't store knowledge of whether it is sparse or not.
> Each block does. Each block of this giant matrix can be stored in dense or
> spare format, depending on how many non-zeroes are in that block.
> The sparsity of a block is recomputed every so often, and the block
> representation can be converted from dense to sparse and vice-versa.
> When operations are performed on blocks, internally, a sparse aware
> operation maybe performed, depending on how the blocks are stored.
>
> (One of the other contributors can clarify, if I've missed something or
> have said something wrong).
>
> Given this context, can you please think about whether missing sparse
> matrix support is still a problem?
>
>
> -Nakul
>
>
>
>
>
>
>
> On Tue, Jul 25, 2017 at 11:14 AM, Janardhan Pulivarthi <
> janardhan.pulivarthi@gmail.com> wrote:
>
> > Hi Nakul,
> >
> > Thanks for explaining me about pros and cons of the two approaches. For
> > now, I have gone through the paper carefully over a couple of days and
> > found the following interesting things.
> >
> > 1. This is the algorithm we implemented.
> >
> > ​
> > 2. In the above algorithm the input matrix A is approximated to another
> > matrix B with the following relation with the error of chi(p, i) [ as
> shown
> > in (3) ] which the author argues will be in an acceptable limit. So, we
> can
> > go with this algorithm.
> >
> >
> > But, one bad thing is that author is not sure about whether the algorithm
> > supports the sparse matrices or not. So, we may need to test it here.
> >
> > For the time being we need to test the present algorithm implemented by
> > Imran Younus again. Can you help me with the testing?
> >
> > Thanks,
> > Janardhan
> > ​
> >
> > On Fri, Jul 21, 2017 at 6:07 AM, Nakul Jindal <na...@gmail.com> wrote:
> >
> >> Hi Janardhan,
> >>
> >>
> >>
> >>
> >> How will GPU implementation help scale distributed SVD:
> >>
> >> Imran implemented an algorithm he found out about in the paper "A
> >> Distributed and Incremental SVD Algorithm for Agglomerative Data
> Analysis
> >> on Large Networks" (
> >> https://github.com/apache/systemml/pull/273/files#diff-488f0
> >> 6e290f7a54db2e125f7bc608971R27
> >> ).
> >> The idea there was to build up a distributed SVD using invocations of
> svd
> >> on your local machine. He tried to achieve the multi-level parallelism
> >> through the parfor construct.
> >> Each local invocation of svd was done using the Apache Commons Math
> >> library.
> >> If each invocation of this local svd can instead be done on a GPU, the
> >> overall wall time for the distributed version would be decreased.
> >>
> >> Users may not always have a GPU. In that case, the svd falls back to the
> >> Apache Comons Math implementation. But if they do and if we have a "svd"
> >> builtin function, then it would be easier to take advantage of the GPU.
> >>
> >>
> >>
> >>
> >>
> >>
> >> Problem with scalable svd in dml is due to spark backed issues,
> otherwise
> >> there is not problem scaling w/o a local svd():
> >>
> >> There maybe spark backend issues and more may come to light and more
> >> workloads are executed on SystemML.
> >> For any given operation - we can implement it as a DML bodied function
> or
> >> a
> >> builtin function.
> >> For DML Bodied functions:
> >> Pros:
> >> - The SystemML optimizer can be applied to it
> >> - Distribution of SVD is then taken care of by SystemML. It will
> generate
> >> and run the spark primitives needed.
> >>
> >> Cons:
> >> - Implementing SVD, whether in DML or C, is a fair amount of work
> >> - There would not be a straightforward call to the svd gpu library. In
> >> fact, each of the linear algebra primitives would be accelerated on the
> >> GPU, but not the entire operation itself. This would involve many more
> JNI
> >> calls.
> >>
> >> For builtin functions:
> >> Pros:
> >> - Use of GPU libraries (cuSolver) and CPU libraries (Apache Commons
> Math)
> >> can be made, these are already optimized (in case of the GPU)
> >> - If a better SVD implementation is available via a library, that can
> >> easily be plugged in.
> >>
> >> Cons:
> >> - Would have to come up with an algorithm to implement distributed SVD
> >> with
> >> spark primitives
> >>
> >> Pick your battle.
> >>
> >>
> >>
> >>
> >>
> >>
> >> Maybe we could try another algorithm for scalable svd() :
> >>
> >> Sure. But before you do that, it may be worth our while to understand
> what
> >> is exactly misleading about the paper that Imran talks about.
> >>
> >>
> >>
> >>
> >>
> >> -Nakul
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, Jul 20, 2017 at 4:02 PM, Janardhan Pulivarthi <
> >> janardhan.pulivarthi@gmail.com> wrote:
> >>
> >> > Hi Nakul,
> >> >
> >> > Can you help me understand how gpu implementations help scale it.
> >> Whether
> >> > the user always have GPUs in use when using this fn or is it an
> optional
> >> > feature.
> >>
> >>
> >> > The problem with implementing the scalable svd() in dml is due to the
> >> spark
> >> > backend issues, otherwise there is no problem scaling w/o a local
> svd()
> >> > function.
> >> >
> >> > May be we could try another algorithm for the scalable svd( ), if the
> >> > present algorithm is misleading as Imran Younus pointed out.
> >> >
> >> > Thank you,
> >> > Janardhan
> >> >
> >>
> >
> >
>

Re: svd( ) implementation

Posted by Nakul Jindal <na...@gmail.com>.
Hi Janardhan,

The images you've used as attachments haven't reached my inbox.
Could you please send them to me directly, rather than through the dev
mailing list.
(Or upload it to a image hosting site like imgur and paste the links in the
email)

I would like to point out that my knowledge of machine learning is limited.
Still, how would you want to test the algorithm?


Sparse matrices in SystemML (in Spark Execution Mode)
Sparse matrix support in SystemML is implemented at a block level. Each
(potentially giant) matrix is stored as blocks (in Spark execution mode).
The matrix itself doesn't store knowledge of whether it is sparse or not.
Each block does. Each block of this giant matrix can be stored in dense or
spare format, depending on how many non-zeroes are in that block.
The sparsity of a block is recomputed every so often, and the block
representation can be converted from dense to sparse and vice-versa.
When operations are performed on blocks, internally, a sparse aware
operation maybe performed, depending on how the blocks are stored.

(One of the other contributors can clarify, if I've missed something or
have said something wrong).

Given this context, can you please think about whether missing sparse
matrix support is still a problem?


-Nakul







On Tue, Jul 25, 2017 at 11:14 AM, Janardhan Pulivarthi <
janardhan.pulivarthi@gmail.com> wrote:

> Hi Nakul,
>
> Thanks for explaining me about pros and cons of the two approaches. For
> now, I have gone through the paper carefully over a couple of days and
> found the following interesting things.
>
> 1. This is the algorithm we implemented.
>
> ​
> 2. In the above algorithm the input matrix A is approximated to another
> matrix B with the following relation with the error of chi(p, i) [ as shown
> in (3) ] which the author argues will be in an acceptable limit. So, we can
> go with this algorithm.
>
>
> But, one bad thing is that author is not sure about whether the algorithm
> supports the sparse matrices or not. So, we may need to test it here.
>
> For the time being we need to test the present algorithm implemented by
> Imran Younus again. Can you help me with the testing?
>
> Thanks,
> Janardhan
> ​
>
> On Fri, Jul 21, 2017 at 6:07 AM, Nakul Jindal <na...@gmail.com> wrote:
>
>> Hi Janardhan,
>>
>>
>>
>>
>> How will GPU implementation help scale distributed SVD:
>>
>> Imran implemented an algorithm he found out about in the paper "A
>> Distributed and Incremental SVD Algorithm for Agglomerative Data Analysis
>> on Large Networks" (
>> https://github.com/apache/systemml/pull/273/files#diff-488f0
>> 6e290f7a54db2e125f7bc608971R27
>> ).
>> The idea there was to build up a distributed SVD using invocations of svd
>> on your local machine. He tried to achieve the multi-level parallelism
>> through the parfor construct.
>> Each local invocation of svd was done using the Apache Commons Math
>> library.
>> If each invocation of this local svd can instead be done on a GPU, the
>> overall wall time for the distributed version would be decreased.
>>
>> Users may not always have a GPU. In that case, the svd falls back to the
>> Apache Comons Math implementation. But if they do and if we have a "svd"
>> builtin function, then it would be easier to take advantage of the GPU.
>>
>>
>>
>>
>>
>>
>> Problem with scalable svd in dml is due to spark backed issues, otherwise
>> there is not problem scaling w/o a local svd():
>>
>> There maybe spark backend issues and more may come to light and more
>> workloads are executed on SystemML.
>> For any given operation - we can implement it as a DML bodied function or
>> a
>> builtin function.
>> For DML Bodied functions:
>> Pros:
>> - The SystemML optimizer can be applied to it
>> - Distribution of SVD is then taken care of by SystemML. It will generate
>> and run the spark primitives needed.
>>
>> Cons:
>> - Implementing SVD, whether in DML or C, is a fair amount of work
>> - There would not be a straightforward call to the svd gpu library. In
>> fact, each of the linear algebra primitives would be accelerated on the
>> GPU, but not the entire operation itself. This would involve many more JNI
>> calls.
>>
>> For builtin functions:
>> Pros:
>> - Use of GPU libraries (cuSolver) and CPU libraries (Apache Commons Math)
>> can be made, these are already optimized (in case of the GPU)
>> - If a better SVD implementation is available via a library, that can
>> easily be plugged in.
>>
>> Cons:
>> - Would have to come up with an algorithm to implement distributed SVD
>> with
>> spark primitives
>>
>> Pick your battle.
>>
>>
>>
>>
>>
>>
>> Maybe we could try another algorithm for scalable svd() :
>>
>> Sure. But before you do that, it may be worth our while to understand what
>> is exactly misleading about the paper that Imran talks about.
>>
>>
>>
>>
>>
>> -Nakul
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Jul 20, 2017 at 4:02 PM, Janardhan Pulivarthi <
>> janardhan.pulivarthi@gmail.com> wrote:
>>
>> > Hi Nakul,
>> >
>> > Can you help me understand how gpu implementations help scale it.
>> Whether
>> > the user always have GPUs in use when using this fn or is it an optional
>> > feature.
>>
>>
>> > The problem with implementing the scalable svd() in dml is due to the
>> spark
>> > backend issues, otherwise there is no problem scaling w/o a local svd()
>> > function.
>> >
>> > May be we could try another algorithm for the scalable svd( ), if the
>> > present algorithm is misleading as Imran Younus pointed out.
>> >
>> > Thank you,
>> > Janardhan
>> >
>>
>
>

Re: svd( ) implementation

Posted by Janardhan Pulivarthi <ja...@gmail.com>.
Hi Nakul,

Thanks for explaining me about pros and cons of the two approaches. For
now, I have gone through the paper carefully over a couple of days and
found the following interesting things.

1. This is the algorithm we implemented.

​
2. In the above algorithm the input matrix A is approximated to another
matrix B with the following relation with the error of chi(p, i) [ as shown
in (3) ] which the author argues will be in an acceptable limit. So, we can
go with this algorithm.


But, one bad thing is that author is not sure about whether the algorithm
supports the sparse matrices or not. So, we may need to test it here.

For the time being we need to test the present algorithm implemented by
Imran Younus again. Can you help me with the testing?

Thanks,
Janardhan
​

On Fri, Jul 21, 2017 at 6:07 AM, Nakul Jindal <na...@gmail.com> wrote:

> Hi Janardhan,
>
>
>
>
> How will GPU implementation help scale distributed SVD:
>
> Imran implemented an algorithm he found out about in the paper "A
> Distributed and Incremental SVD Algorithm for Agglomerative Data Analysis
> on Large Networks" (
> https://github.com/apache/systemml/pull/273/files#diff-
> 488f06e290f7a54db2e125f7bc608971R27
> ).
> The idea there was to build up a distributed SVD using invocations of svd
> on your local machine. He tried to achieve the multi-level parallelism
> through the parfor construct.
> Each local invocation of svd was done using the Apache Commons Math
> library.
> If each invocation of this local svd can instead be done on a GPU, the
> overall wall time for the distributed version would be decreased.
>
> Users may not always have a GPU. In that case, the svd falls back to the
> Apache Comons Math implementation. But if they do and if we have a "svd"
> builtin function, then it would be easier to take advantage of the GPU.
>
>
>
>
>
>
> Problem with scalable svd in dml is due to spark backed issues, otherwise
> there is not problem scaling w/o a local svd():
>
> There maybe spark backend issues and more may come to light and more
> workloads are executed on SystemML.
> For any given operation - we can implement it as a DML bodied function or a
> builtin function.
> For DML Bodied functions:
> Pros:
> - The SystemML optimizer can be applied to it
> - Distribution of SVD is then taken care of by SystemML. It will generate
> and run the spark primitives needed.
>
> Cons:
> - Implementing SVD, whether in DML or C, is a fair amount of work
> - There would not be a straightforward call to the svd gpu library. In
> fact, each of the linear algebra primitives would be accelerated on the
> GPU, but not the entire operation itself. This would involve many more JNI
> calls.
>
> For builtin functions:
> Pros:
> - Use of GPU libraries (cuSolver) and CPU libraries (Apache Commons Math)
> can be made, these are already optimized (in case of the GPU)
> - If a better SVD implementation is available via a library, that can
> easily be plugged in.
>
> Cons:
> - Would have to come up with an algorithm to implement distributed SVD with
> spark primitives
>
> Pick your battle.
>
>
>
>
>
>
> Maybe we could try another algorithm for scalable svd() :
>
> Sure. But before you do that, it may be worth our while to understand what
> is exactly misleading about the paper that Imran talks about.
>
>
>
>
>
> -Nakul
>
>
>
>
>
>
>
> On Thu, Jul 20, 2017 at 4:02 PM, Janardhan Pulivarthi <
> janardhan.pulivarthi@gmail.com> wrote:
>
> > Hi Nakul,
> >
> > Can you help me understand how gpu implementations help scale it. Whether
> > the user always have GPUs in use when using this fn or is it an optional
> > feature.
>
>
> > The problem with implementing the scalable svd() in dml is due to the
> spark
> > backend issues, otherwise there is no problem scaling w/o a local svd()
> > function.
> >
> > May be we could try another algorithm for the scalable svd( ), if the
> > present algorithm is misleading as Imran Younus pointed out.
> >
> > Thank you,
> > Janardhan
> >
>

Re: svd( ) implementation

Posted by Nakul Jindal <na...@gmail.com>.
Hi Janardhan,




How will GPU implementation help scale distributed SVD:

Imran implemented an algorithm he found out about in the paper "A
Distributed and Incremental SVD Algorithm for Agglomerative Data Analysis
on Large Networks" (
https://github.com/apache/systemml/pull/273/files#diff-488f06e290f7a54db2e125f7bc608971R27
).
The idea there was to build up a distributed SVD using invocations of svd
on your local machine. He tried to achieve the multi-level parallelism
through the parfor construct.
Each local invocation of svd was done using the Apache Commons Math library.
If each invocation of this local svd can instead be done on a GPU, the
overall wall time for the distributed version would be decreased.

Users may not always have a GPU. In that case, the svd falls back to the
Apache Comons Math implementation. But if they do and if we have a "svd"
builtin function, then it would be easier to take advantage of the GPU.






Problem with scalable svd in dml is due to spark backed issues, otherwise
there is not problem scaling w/o a local svd():

There maybe spark backend issues and more may come to light and more
workloads are executed on SystemML.
For any given operation - we can implement it as a DML bodied function or a
builtin function.
For DML Bodied functions:
Pros:
- The SystemML optimizer can be applied to it
- Distribution of SVD is then taken care of by SystemML. It will generate
and run the spark primitives needed.

Cons:
- Implementing SVD, whether in DML or C, is a fair amount of work
- There would not be a straightforward call to the svd gpu library. In
fact, each of the linear algebra primitives would be accelerated on the
GPU, but not the entire operation itself. This would involve many more JNI
calls.

For builtin functions:
Pros:
- Use of GPU libraries (cuSolver) and CPU libraries (Apache Commons Math)
can be made, these are already optimized (in case of the GPU)
- If a better SVD implementation is available via a library, that can
easily be plugged in.

Cons:
- Would have to come up with an algorithm to implement distributed SVD with
spark primitives

Pick your battle.






Maybe we could try another algorithm for scalable svd() :

Sure. But before you do that, it may be worth our while to understand what
is exactly misleading about the paper that Imran talks about.





-Nakul







On Thu, Jul 20, 2017 at 4:02 PM, Janardhan Pulivarthi <
janardhan.pulivarthi@gmail.com> wrote:

> Hi Nakul,
>
> Can you help me understand how gpu implementations help scale it. Whether
> the user always have GPUs in use when using this fn or is it an optional
> feature.


> The problem with implementing the scalable svd() in dml is due to the spark
> backend issues, otherwise there is no problem scaling w/o a local svd()
> function.
>
> May be we could try another algorithm for the scalable svd( ), if the
> present algorithm is misleading as Imran Younus pointed out.
>
> Thank you,
> Janardhan
>