You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Matthias Boehm (JIRA)" <ji...@apache.org> on 2016/07/11 03:50:11 UTC

[jira] [Closed] (SYSTEMML-560) Distributed frame representation

     [ https://issues.apache.org/jira/browse/SYSTEMML-560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthias Boehm closed SYSTEMML-560.
-----------------------------------

> Distributed frame representation
> --------------------------------
>
>                 Key: SYSTEMML-560
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-560
>             Project: SystemML
>          Issue Type: Task
>            Reporter: Matthias Boehm
>            Assignee: Arvind Surve
>
> The major design goals for our distributed binary block frame representation are twofold:
> * Seamless integration: First, we aim for a seamless integration with (1) Spark's DataFrame and DataSet representations, (2) csv text formats, and  (3) SystemML's binary block matrix representations.
> * Memory efficiency: Second, we are still interested in a block representation to exploit the column-wise native array storage of FrameBlocks.  
> As a good compromise with regard to both design goals, the initial design proposal is 
> {code}
> FRAME := JavaPairRDD<Long, FrameBlock> 
> {code}
> where the keys represents the row offsets of frameblock values, a frameblock value covers one or multiple rows and all columns of the frame, and most importantly frameblock values do not exhibit a fixed block size. NOTE that in comparison to Spark's data frames, SystemML's frames are row-indexed (no a set of rows) in order to allow well-defined indexing operations over frames (as possible in R).  
> This representation would allow a shuffle-free conversion from DataFrames, DataSets, CSV to SystemML's Frames and vice versa while still exploiting a block structure whenever possible (moderate numbers of columns). Similar, binary block matrix to frame conversions can also be done without shuffle in the common case ncol <= blocksize (default 1k). Finally, this representation also seems to be advantageous with regard to the common frame operations of transform, transform apply, indexing, append, and transform decode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)