You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Matthias Boehm (JIRA)" <ji...@apache.org> on 2016/06/12 00:16:20 UTC

[jira] [Commented] (SPARK-15882) Discuss distributed linear algebra in spark.ml package

    [ https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326101#comment-15326101 ] 

Matthias Boehm commented on SPARK-15882:
----------------------------------------

I really like this direction and think it has the potential to become a higher level API for Spark ML, as data frames and data sets have become for Spark SQL.

If there is interest, we'd like to help contributing to this feature by porting over a subset of distributed linear algebra operations from SystemML.

General Goals: From my perspective, we should aim for an API that hides the underlying data representation (e.g., RDD/Dataset, sparse/dense, blocking configurations, block/row/coordinate, partitioning etc). Furthermore, it would be great to make it easy to swap out the used local matrix library. This approach would allow people to plug in their custom operations (e.g., native BLAS libraries/kernels or compressed block operations), while still relying on a common API and scheme for distributing blocks.

RDDs over Datasets: For the internal implementation, I would favor RDDs over Datasets because (1) RDDs allow for more flexibility (e.g., reduceByKey, combineByKey, partitioning-preserving operations), and (2) encoders don't offer much benefit for blocked representations as the per-block overhead is typically negligible. 

Basic Operations: Initially, I would start with a small well-defined set of operations including matrix multiplications, unary and binary operations (e.g., arithmetic/comparison), unary aggregates (e.g., sum/rowSums/colSums, min/max/mean/sd), reorg operations (transpose/diag/reshape/order), and cumulative aggregates (e.g., cumsum).

Towards Optimization: Internally, we could implement alternative operations but hide them under a common interface. For example, matrix multiplication would be exposed as 'multiply' (consistent with local linalg) - internally, however, we would select between alternative operations (see https://github.com/apache/incubator-systemml/blob/master/docs/devdocs/MatrixMultiplicationOperators.txt), based on a simple rule set or user-provided hints as done in Spark SQL. Later, we could think about a more sophisticated optimizer, potentially relying on the existing catalyst infrastructure. What do you think? 

> Discuss distributed linear algebra in spark.ml package
> ------------------------------------------------------
>
>                 Key: SPARK-15882
>                 URL: https://issues.apache.org/jira/browse/SPARK-15882
>             Project: Spark
>          Issue Type: Brainstorming
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* should be migrated to org.apache.spark.ml.
> Initial questions:
> * Should we use Datasets or RDDs underneath?
> * If Datasets, are there missing features needed for the migration?
> * Do we want to redesign any aspects of the distributed matrices during this move?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org