You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@systemml.apache.org by "Matthias Boehm (JIRA)" <ji...@apache.org> on 2017/03/09 07:27:38 UTC

[jira] [Resolved] (SYSTEMML-1378) Native dataset support in parfor spark datapartition-execute

     [ https://issues.apache.org/jira/browse/SYSTEMML-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthias Boehm resolved SYSTEMML-1378.
--------------------------------------
    Resolution: Done

> Native dataset support in parfor spark datapartition-execute
> ------------------------------------------------------------
>
>                 Key: SYSTEMML-1378
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1378
>             Project: SystemML
>          Issue Type: Sub-task
>          Components: APIs, Runtime
>            Reporter: Matthias Boehm
>            Assignee: Matthias Boehm
>             Fix For: SystemML 1.0
>
>
> This task aims for a deeper integration of Spark Datasets into SystemML. Consider the following example scenario, invoked through MLContext with X being a DataSet<Row>:
> {code}
> X = read(...)
> parfor( i in 1:nrow(X) ) {
>     Xi = X[i, ]
>     v[i, 1] = ... some computation over Xi
> }
> {code}
> Currently, we would convert the input dataset to binary block (1st shuffle) at API level and subsequently pass it into SystemML. For large data, we would then compile a single parfor data-partition execute job that slices row fragments, collects row fragments int partitions (2nd shuffle), and finally executes the parfor body per partition. 
> Native dataset support would allow us to avoid these two shuffles and compute the entire parfor in a data-local manner. In detail, this involves the following extensions:
> * API level: Keep lineage of input dataset leveraging our existing lineage mechanism in {{MatrixObject}}
> * Parfor datapartition-execute: SYSTEMML-1367 already introduced the data-local processing for special cases (if ncol<=blocksize). Given the lineage, we can simply probe the input to datapartition-execute and, for row partitioning, use directly the dataset instead of the reblocked matrix rdd in a data-local manner. This does not just avoid the 2nd shuffle but due to lazy evaluation also the 1st shuffle if no operation other than parfor accesses X (except zipwithindex if no ids are passed in, as this transformation triggers computation)
> * Cleanup: Prevent cleanup (unpersist) of lineage objects of type dataset as they are passed from outside.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)