You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by liaoyuxi <li...@huawei.com> on 2014/11/18 04:24:20 UTC

matrix computation in spark

Hi,
Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on.
For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix.

We have been working on establishing distributed matrix computation APIs based on data structures in MLlib.
The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper.
http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
In our experiment, it's communication-optimal.
But operations like factorization may not be appropriate to carry out in blocks.

Any suggestions and guidance are welcome.

Thanks,
Yuxi

答复: matrix computation in spark

Posted by liaoyuxi <li...@huawei.com>.

Hi,
I checked the work of ml-matrix. For now, it doesn’t include matrix multiply and LU decomposition. What’s your plan? Can we contribute our work to these parts?
Otherwise, the block number of row/column is decided manually, As we mentioned, the CARMA method in paper is communication-optimal.

发件人: Zongheng Yang [mailto:zongheng.y@gmail.com]
发送时间: 2014年11月18日 11:37
收件人: liaoyuxi; dev@spark.incubator.apache.org
抄送: Shivaram Venkataraman
主题: Re: matrix computation in spark

There's been some work at the AMPLab on a distributed matrix library on top of Spark; see here [1]. In particular, the repo contains a couple factorization algorithms.

[1] https://github.com/amplab/ml-matrix

Zongheng

On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi <li...@huawei.com>> wrote:
Hi,
Matrix computation is critical for algorithm efficiency like least square, Kalman filter and so on.
For now, the mllib module offers limited linear algebra on matrix, especially for distributed matrix.

We have been working on establishing distributed matrix computation APIs based on data structures in MLlib.
The main idea is to partition the matrix into sub-blocks, based on the strategy in the following paper.
http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
In our experiment, it's communication-optimal.
But operations like factorization may not be appropriate to carry out in blocks.

Any suggestions and guidance are welcome.

Thanks,
Yuxi

Re: matrix computation in spark

Posted by Reza Zadeh <re...@databricks.com>.

Hi Yuxi,

We are integrating the ml-matrix from the AMPlab repo into MLlib, tracked
by this JIRA: https://issues.apache.org/jira/browse/SPARK-3434

We already have matrix multiply, but are missing LU decomposition. Could
you please track that JIRA, once the initial design is in, we can sync on
how to contribute LU decomposition.

Let's move the discussion to the JIRA.

Thanks!

On Mon, Nov 17, 2014 at 9:49 PM, 顾荣 <gu...@gmail.com> wrote:

> Hey Yuxi,
>
> We also have implemented a distributed matrix multiplication library in
> PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We
> implemented three distributed matrix multiplication algorithms on Spark. As
> we see, communication-optimal does not always means the total-optimal.
> Thus, besides the CARMA matrix multiplication you mentioned, we also
> implemented the Block-splitting matrix multiplication and Broadcast matrix
> multiplication. They are more efficient than the CARMA matrix
> multiplication for some situations, for example a large matrix multiplies a
> small matrix.
>
> Actually, We have shared the work on Spark Meetup@Beijing on October
> 26th.(
> http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The
> slide can be download from the archive here
> http://pan.baidu.com/s/1dDoyHX3#path=%252Fmeetup-3rd
>
> Best,
> Rong
>
> 2014-11-18 13:11 GMT+08:00 顾荣 <gu...@gmail.com>:
>
> > Hey Yuxi,
> >
> > We also have implemented a distributed matrix multiplication library in
> > PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We
> > implemented three distributed matrix multiplication algorithms on Spark.
> As
> > we see, communication-optimal does not always means the total-optimal.
> > Thus, besides the CARMA matrix multiplication you mentioned, we also
> > implemented the Block-splitting matrix multiplication and Broadcast
> matrix
> > multiplication. They are more efficient than the CARMA matrix
> > multiplication for some situations, for example a large matrix
> multiplies a
> > small matrix.
> >
> > Actually, We have shared the work on Spark Meetup@Beijing on October
> > 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/
> > ). The slide is also attached in this mail.
> >
> > Best,
> > Rong
> >
> > 2014-11-18 11:36 GMT+08:00 Zongheng Yang <zo...@gmail.com>:
> >
> >> There's been some work at the AMPLab on a distributed matrix library on
> >> top
> >> of Spark; see here [1]. In particular, the repo contains a couple
> >> factorization algorithms.
> >>
> >> [1] https://github.com/amplab/ml-matrix
> >>
> >> Zongheng
> >>
> >> On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi <li...@huawei.com> wrote:
> >>
> >> > Hi,
> >> > Matrix computation is critical for algorithm efficiency like least
> >> square,
> >> > Kalman filter and so on.
> >> > For now, the mllib module offers limited linear algebra on matrix,
> >> > especially for distributed matrix.
> >> >
> >> > We have been working on establishing distributed matrix computation
> APIs
> >> > based on data structures in MLlib.
> >> > The main idea is to partition the matrix into sub-blocks, based on the
> >> > strategy in the following paper.
> >> > http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
> >> > In our experiment, it's communication-optimal.
> >> > But operations like factorization may not be appropriate to carry out
> in
> >> > blocks.
> >> >
> >> > Any suggestions and guidance are welcome.
> >> >
> >> > Thanks,
> >> > Yuxi
> >> >
> >> >
> >>
> >
> >
> >
> > --
> > ------------------
> > Rong Gu
> > Department of Computer Science and Technology
> > State Key Laboratory for Novel Software Technology
> > Nanjing University
> > Phone: +86 15850682791
> > Email: gurongwalker@gmail.com
> > Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/
> >
>
>
>
> --
> ------------------
> Rong Gu
> Department of Computer Science and Technology
> State Key Laboratory for Novel Software Technology
> Nanjing University
> Phone: +86 15850682791
> Email: gurongwalker@gmail.com
> Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/
>

Re: matrix computation in spark

Posted by 顾荣 <gu...@gmail.com>.

Hey Yuxi,

We also have implemented a distributed matrix multiplication library in
PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We
implemented three distributed matrix multiplication algorithms on Spark. As
we see, communication-optimal does not always means the total-optimal.
Thus, besides the CARMA matrix multiplication you mentioned, we also
implemented the Block-splitting matrix multiplication and Broadcast matrix
multiplication. They are more efficient than the CARMA matrix
multiplication for some situations, for example a large matrix multiplies a
small matrix.

Actually, We have shared the work on Spark Meetup@Beijing on October 26th.(
http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/ ). The
slide can be download from the archive here
http://pan.baidu.com/s/1dDoyHX3#path=%252Fmeetup-3rd

Best,
Rong

2014-11-18 13:11 GMT+08:00 顾荣 <gu...@gmail.com>:

> Hey Yuxi,
>
> We also have implemented a distributed matrix multiplication library in
> PasaLab. The repo is host on here https://github.com/PasaLab/marlin . We
> implemented three distributed matrix multiplication algorithms on Spark. As
> we see, communication-optimal does not always means the total-optimal.
> Thus, besides the CARMA matrix multiplication you mentioned, we also
> implemented the Block-splitting matrix multiplication and Broadcast matrix
> multiplication. They are more efficient than the CARMA matrix
> multiplication for some situations, for example a large matrix multiplies a
> small matrix.
>
> Actually, We have shared the work on Spark Meetup@Beijing on October
> 26th.( http://www.meetup.com/spark-user-beijing-Meetup/events/210422112/
> ). The slide is also attached in this mail.
>
> Best,
> Rong
>
> 2014-11-18 11:36 GMT+08:00 Zongheng Yang <zo...@gmail.com>:
>
>> There's been some work at the AMPLab on a distributed matrix library on
>> top
>> of Spark; see here [1]. In particular, the repo contains a couple
>> factorization algorithms.
>>
>> [1] https://github.com/amplab/ml-matrix
>>
>> Zongheng
>>
>> On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi <li...@huawei.com> wrote:
>>
>> > Hi,
>> > Matrix computation is critical for algorithm efficiency like least
>> square,
>> > Kalman filter and so on.
>> > For now, the mllib module offers limited linear algebra on matrix,
>> > especially for distributed matrix.
>> >
>> > We have been working on establishing distributed matrix computation APIs
>> > based on data structures in MLlib.
>> > The main idea is to partition the matrix into sub-blocks, based on the
>> > strategy in the following paper.
>> > http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
>> > In our experiment, it's communication-optimal.
>> > But operations like factorization may not be appropriate to carry out in
>> > blocks.
>> >
>> > Any suggestions and guidance are welcome.
>> >
>> > Thanks,
>> > Yuxi
>> >
>> >
>>
>
>
>
> --
> ------------------
> Rong Gu
> Department of Computer Science and Technology
> State Key Laboratory for Novel Software Technology
> Nanjing University
> Phone: +86 15850682791
> Email: gurongwalker@gmail.com
> Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/
>



-- 
------------------
Rong Gu
Department of Computer Science and Technology
State Key Laboratory for Novel Software Technology
Nanjing University
Phone: +86 15850682791
Email: gurongwalker@gmail.com
Homepage: http://pasa-bigdata.nju.edu.cn/people/ronggu/

Re: matrix computation in spark

Posted by Zongheng Yang <zo...@gmail.com>.

There's been some work at the AMPLab on a distributed matrix library on top
of Spark; see here [1]. In particular, the repo contains a couple
factorization algorithms.

[1] https://github.com/amplab/ml-matrix

Zongheng

On Mon Nov 17 2014 at 7:34:17 PM liaoyuxi <li...@huawei.com> wrote:

> Hi,
> Matrix computation is critical for algorithm efficiency like least square,
> Kalman filter and so on.
> For now, the mllib module offers limited linear algebra on matrix,
> especially for distributed matrix.
>
> We have been working on establishing distributed matrix computation APIs
> based on data structures in MLlib.
> The main idea is to partition the matrix into sub-blocks, based on the
> strategy in the following paper.
> http://www.cs.berkeley.edu/~odedsc/papers/bfsdfs-mm-ipdps13.pdf
> In our experiment, it's communication-optimal.
> But operations like factorization may not be appropriate to carry out in
> blocks.
>
> Any suggestions and guidance are welcome.
>
> Thanks,
> Yuxi
>
>