You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Matthias Boehm (JIRA)" <ji...@apache.org> on 2017/03/09 03:28:37 UTC

[jira] [Updated] (SYSTEMML-1378) Native dataset support in parfor spark datapartition-execute

     [ https://issues.apache.org/jira/browse/SYSTEMML-1378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthias Boehm updated SYSTEMML-1378:
-------------------------------------
    Description: 
This task aims for a deeper integration of Spark Datasets into SystemML. Consider the following example scenario, invoked through MLContext with X being a DataSet<Row>:

{code}
X = read(...)
parfor( i in 1:nrow(X) ) {
    Xi = X[i, ]
    v[i, 1] = ... some computation over Xi
}
{code}

Currently, we would convert the input dataset to binary block (1st shuffle) at API level and subsequently pass it into SystemML. For large data, we would then compile a single parfor data-partition execute job that slices row fragments, collects row fragments int partitions (2nd shuffle), and finally executes the parfor body per partition. 

Native dataset support would allow us to avoid these two shuffles and compute the entire parfor in data-local manner. In detail, this involves the following minor extensions:
* API level: Keep lineage of input dataset leveraging our existing lineage mechanism in {{MatrixObject}}
* Parfor datapartition-execute: SYSTEMML-1367 already introduced the data-local processing for special cases (if ncol<=blocksize). Given the lineage, we can simply probe the input to datapartition-execute and, for row partitioning, use directly the dataset instead of the reblocked matrix rdd in a data-local manner. This does not just avoid the 2nd shuffle but due to lazy evaluation also the 1st shuffle (except zipwithindex if no ids are passed in, as this transformation triggers computation)
* Cleanup: Prevent cleanup (unpersist) of rdd objects of type dataset as they are passed in from outside.



> Native dataset support in parfor spark datapartition-execute
> ------------------------------------------------------------
>
>                 Key: SYSTEMML-1378
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1378
>             Project: SystemML
>          Issue Type: Sub-task
>          Components: APIs, Runtime
>            Reporter: Matthias Boehm
>             Fix For: SystemML 1.0
>
>
> This task aims for a deeper integration of Spark Datasets into SystemML. Consider the following example scenario, invoked through MLContext with X being a DataSet<Row>:
> {code}
> X = read(...)
> parfor( i in 1:nrow(X) ) {
>     Xi = X[i, ]
>     v[i, 1] = ... some computation over Xi
> }
> {code}
> Currently, we would convert the input dataset to binary block (1st shuffle) at API level and subsequently pass it into SystemML. For large data, we would then compile a single parfor data-partition execute job that slices row fragments, collects row fragments int partitions (2nd shuffle), and finally executes the parfor body per partition. 
> Native dataset support would allow us to avoid these two shuffles and compute the entire parfor in data-local manner. In detail, this involves the following minor extensions:
> * API level: Keep lineage of input dataset leveraging our existing lineage mechanism in {{MatrixObject}}
> * Parfor datapartition-execute: SYSTEMML-1367 already introduced the data-local processing for special cases (if ncol<=blocksize). Given the lineage, we can simply probe the input to datapartition-execute and, for row partitioning, use directly the dataset instead of the reblocked matrix rdd in a data-local manner. This does not just avoid the 2nd shuffle but due to lazy evaluation also the 1st shuffle (except zipwithindex if no ids are passed in, as this transformation triggers computation)
> * Cleanup: Prevent cleanup (unpersist) of rdd objects of type dataset as they are passed in from outside.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)