You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@systemml.apache.org by Nakul Jindal <na...@gmail.com> on 2016/04/12 11:10:53 UTC

Re: DML in Zeppelin

Hi All,

Niketan, this feedback in much appreciated and I will continue to work on
this. In the meantime, some of the other (offline) feedback I got for this
included making DML variables accessible across DML cells. Towards that
end, I've made some improvements to the Zeppelin-DML integration. There is
also a convenient (albeit large ~2GB ) docker image to test this out with.

All the information is on the JIRA :
https://issues.apache.org/jira/browse/SYSTEMML-542
It has screenshots, docker instructions and steps to recreate the dev
environment to play with.

These are the features (thus far):

Launch a standalone DML cell which runs the DML interpreter locally (using
%dml)
- This has rudimentary features and will be developed if there is demand

Launch a DML cell which runs on Spark (using %spark.dml)
- Transfer data between Spark, PySpark, etc and DML Cells (as Dataframes)
      -- Read data in a Spark cell (as a DataFrame) and use it in a DML cell
      -- Write a DML matrix in a DML cell and read it as a DataFrame in a
Spark Cell
      -- This is done using ZeppelinContext (
https://zeppelin.incubator.apache.org/docs/latest/interpreter/spark.html)
- Transfer data between DML cells - scalar types (booleans, strings,
floats, integers) and matrices
      -- Any variable defined in a cell can be used (read from/written to)
in subsequent cells.
      -- This is very similar to how spark cells operate.


Any feedback is greatly appreciated.

Thanks,
Nakul Jindal



On Tue, Mar 8, 2016 at 10:30 AM, Niketan Pansare <np...@us.ibm.com> wrote:

> Hi Nakul,
>
> This is good work !
>
> My 2 cents, we should add missing features (such as command-line
> arguments), document the API for this POC, come up with examples for
> existing algorithms with open-source datasets and put them in
> https://github.com/apache/incubator-systemml/tree/master/samples/zeppelin-notebooks
>
> This way, people are encouraged to try out (and may be even modify
> on-the-fly the) existing DML algorithms with specific datasets. Borrowing
> an example from
> http://scikit-learn.org/stable/tutorial/basic/tutorial.html:
> >>> from sklearn import datasets
> >>> iris = datasets.load_iris()
> >>> digits = datasets.load_digits()
> *>>> **from* *sklearn* *import* svm
> *>>> *clf = svm.SVC(gamma=0.001, C=100.)
> *>>> *clf.fit(digits.data[:-1], digits.target[:-1])
> *>>> *clf.predict(digits.data[-1:])
>
> We can then put a link to the given example in
> http://apache.github.io/incubator-systemml/algorithms-classification.html#support-vector-machines
>
> Thanks,
>
> Niketan Pansare
> IBM Almaden Research Center
> E-mail: npansar At us.ibm.com
> http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar
>
> [image: Inactive hide details for Nakul Jindal ---03/06/2016 07:22:10
> PM---Hi, I've put together a proof of concept for having DML be a]Nakul
> Jindal ---03/06/2016 07:22:10 PM---Hi, I've put together a proof of concept
> for having DML be a first class
>
> From: Nakul Jindal <na...@gmail.com>
> To: dev@systemml.incubator.apache.org
> Date: 03/06/2016 07:22 PM
> Subject: DML in Zeppelin
> ------------------------------
>
>
>
> Hi,
>
> I've put together a proof of concept for having DML be a first class
> citizen in Apache Zeppelin.
>
> Brief intro to Zeppelin -
> Zeppelin is a "notebook" interface to interact with Spark, Cassandra, Hive
> and other projects. It can be thought of as a REPL in a browser.
> Small units of code are put into "cell"s. These individual "cells" can then
> be run interactively. Of course there is support for queue-ing up and
> running cells in parallel.
> Cells are contained in notebooks. Notebooks can be exported and are
> persistent between sessions.
>
> One can type code in (Scala) Spark in cell 1 and save a data frame object.
> He can then type code in PySpark in cell 2 and access the previously saved
> data frame.
> This is done by the Zeppelin runtime system by injecting a special variable
> called "z" into the Spark and PySpark environments in Zeppelin. This "z" is
> an object of type ZeppelinContext and makes available a "get" and a "put"
> method.
> DML in Spark mode can now access this feature as well.
>
> In this POC, DML can operate in 2 modes - standalone and spark.
>
> Screenshots of it working:
> http://imgur.com/a/m7ASx
>
> GIF of the screenshots:
> http://i.imgur.com/NttMuKC.gifv
>
> Instructions:
> https://gist.github.com/anonymous/6ab8c569b2360232e252
>
> JIRA:
> https://issues.apache.org/jira/browse/SYSTEMML-542
>
>
> Nakul Jindal
>
>
>
>