You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Niketan Pansare (JIRA)" <ji...@apache.org> on 2016/05/11 21:09:12 UTC
[jira] [Commented] (SYSTEMML-593) MLContext Redesign

    [ https://issues.apache.org/jira/browse/SYSTEMML-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280832#comment-15280832 ] 

Niketan Pansare commented on SYSTEMML-593:
------------------------------------------

Thanks [~deron] for creating the design document. It improves the usability of MLContext a lot.

I like the common interface "in" that allows users to pass both data as well as command-line arguments. I also like that we use $prefix for commandline variables in the "in" method. Thereby, in(String, RDD/DataFrame) maps to registerInput and in(String, boolean/double/float/int/string) maps to command-line arguments. I also like that this design avoids the need to cast boolean/double/float/int into String.

I also like the Script abstraction as it avoids overloaded execute methods (for example: PyDML, DML, ...).

Few thoughts/suggestions:
1. Current MLContext allows the users to pass RDD/DataFrame to the script using "registerInput". In the proposed document, we pass the RDD/DataFrame through ".in(...)". In addition, registerInput method allows for passing the format and the meta-data information. In some cases, the format is required but meta-data is optional and in some other case both are required. We need to add appropriate guards in our new MLContext.
For example: we should not support `script.in("A", sc.textFile("m.csv"))` as RDD<String> can refer to either "csv" or "text" format. Also, `script.in("A", sc.textFile("m.text"), "text")` should throw an error stating meta-data is required.

2.  The DML language semantics should be respected. For example: if script has following line `X = read($fileX)`, then providing .in("X", ...), but not .in("$fileX", ...) should throw an error.

3. Please remember that DataFrame is unordered collection and we return matrix which is an ordered structure. So, please remember to return DataFrame with an "ID" column as we do in our current MLOutput class, else we are potentially breaking the contract. 

4. Please support following different types of DataFrame:
- With an ID column and one DF column of type double for every column of matrix. This is safe way for user to pass a DataFrame to SystemML and still be able to do pre-processing.
- Without an ID column, but with one DF column of type double for every column of matrix.  This is potentially unsafe and user ensures that rows are sorted.
- With an ID column and DF with a column of Vector DataType. This is often used in MLPipeline wrappers.
- Without an ID column, but with DF with a column of Vector DataType. This is often used in MLPipeline wrappers.

5. With exception of DataFrame, all the RDDs that we pass map to the format we support in read(): RDD<String>/JavaRDD<String>/JavaPairRDD<LongWritable, Text>/... for csv and text format + RDD<MI, MB>/JavaPairRDD<MI,MB> for binaryblock. For non-read formats, we implement RDDConverterUtils.

Please support all the read-formats either directly or via an abstraction (for example: proposed BinaryBlockMatrix which is wrapper of JavaPairRDD<MI,MB> and MC). In particular, users might prefer to stick with BinaryBlockMatrix if they want to pass it to another DML script but might want DataFrame if they want to apply SQL. Why ? For extremely wide matrices, DataFrame is extremely inefficient format. 

An alternate suggestion: You can only support registering one type of DataFrame/RDD and have many constructors/factory methods for them. For example: Please see org.apache.sysml.api.MLMatrix (for reference implementation of BinaryBlockMatrix) which essentially is a two column DataFrame that supports simple Matrix algebra. It also fits well into Spark Datasource API: ml.read(sqlContext, "W_small.mtx", "binary").

[~reinwald] [~mboehm7] [~mwdusenb@us.ibm.com]

> MLContext Redesign
> ------------------
>
>                 Key: SYSTEMML-593
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-593
>             Project: SystemML
>          Issue Type: Improvement
>          Components: APIs
>            Reporter: Deron Eriksson
>            Assignee: Deron Eriksson
>         Attachments: Design Document - MLContext API Redesign.pdf
>
>
> This JIRA proposes a redesign of the Java MLContext API with several goals:
> •	Simplify the user experience
> •	Encapsulate primary entities using object-oriented concepts
> •	Make API extensible for external users
> •	Make API extensible for SystemML developers
> •	Locate all user-interaction classes, interfaces, etc under a single API package
> •	Extensive Javadocs for all classes in the API
> •	Potentially fold JMLC API into MLContext so as to have a single programmatic API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)