You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@systemml.apache.org by "Deron Eriksson (JIRA)" <ji...@apache.org> on 2017/09/22 18:02:00 UTC

[jira] [Commented] (SYSTEMML-493) Modularize Existing DML Algorithms

    [ https://issues.apache.org/jira/browse/SYSTEMML-493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16176824#comment-16176824 ] 

Deron Eriksson commented on SYSTEMML-493:
-----------------------------------------

PCA functionalized by [PR653|https://github.com/apache/systemml/pull/653].

> Modularize Existing DML Algorithms
> ----------------------------------
>
>                 Key: SYSTEMML-493
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-493
>             Project: SystemML
>          Issue Type: Epic
>          Components: Algorithms
>            Reporter: Mike Dusenberry
>
> Currently, our provided DML algorithms come in the form of single, long scripts that contain the read and write statements, are usually not broken up into modular UDFs, and require the user to supply all arguments via the command line or bash scripts.  As a high-level example:
> {code}
> // read statements, parameter parsing, etc.
> X = read(...)
> hyperparam1 = $1
> anotherHyperparam = $2
> ...
> // core part of the algorithm
> // note: this is not broken up by a udf, and instead is just a continuation of the script
> while(!converged) {
>  // do stuff
> }
> // outputs, test results, stats, etc
> write(...)
> print(...)
> {code}
> The issue here is that many ML algorithms require hyperparameter tuning, and are part of a general data flow (data ingestion, cleaning, splitting, etc.).  Due to this, it would be ideal if our algorithm scripts were modularized so that the core parts of the algorithms were wrapped in UDFs (i.e. {{train(...)}}, {{test(...)}}, etc.).  Then, rather than having to perform these additional steps from a bash script, a user could instead import our algorithm scripts from DML, and make calls to the UDFs as necessary.  As an example of the modification to our scripts:
> {code}
> // read statements, parameter parsing, etc.
> X = read(...)
> hyperparam1 = $1
> anotherHyperparam = $2
> ...
> // core part of the algorithm
> // note: this is wrapped in a udf, thus allowing the user to import and supply arguments from another DML script if desired
> train = function (matrix[double] X, double hyperparam1, double hyperparam2) return (matrix[double] model) {
>     while(!converged) {
>      // do stuff
>     }
> }
> // when run as a script, this will invoke the `train(...)` function, thus achieving the same result as the previous script design
> model = train(X, hyperparam1, anotherHyperparam)
> // outputs, test results, stats, etc
> write(...)
> print(...)
> {code}
> By modularizing the core parts of the algorithms into UDFs, yet still keep the surrounding read/write statements, this will allow our provided scripts to be executed as scripts in the (currently) normal fashion, while also allowing them to be imported from other DML scripts for the use of the UDFs directly.  As an example of a custom DML workflow script:
> {code}
> // import
> source("LinearReg.dml") as lr
> // ingest data
> X_dirty = read(...)
> // clean data
> X = ...
> // split
> X_train = ...
> X_val = ...
> X_test = ...
> // hyperparameter tuning
> while(tuning) {
>     hyperparam1= ...
>     hyperparam2 = ...
>     model = lr::train(X, hyperparam1, hyperparam2)
>     error = lr::test(X_val, ...)
>     ...
> }
> // use best hyperparameters
> ...
> // save model
> write(model)
> {code}
> This change could be applied to all of our provided DML algorithms, and many could be broken up into {{train(...)}}, {{test(...)}}, {{stats(...)}}, etc. functions.  The goal here is to promote the use of DML for the entire ML pipeline (i.e. the way Python, R, Scala, etc. are currently being used), rather than encouraging the use of cumbersome bash scripts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)