You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemml.apache.org by du...@apache.org on 2016/09/12 19:46:36 UTC
incubator-systemml git commit: [DOCS] Adding a Python example using
the new MLContext.
Repository: incubator-systemml
Updated Branches:
refs/heads/master f463e5f46 -> adc4a5b6f
[DOCS] Adding a Python example using the new MLContext.
Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/adc4a5b6
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/adc4a5b6
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/adc4a5b6
Branch: refs/heads/master
Commit: adc4a5b6fded15fe7b2dcfb5f5ba6494ee41761d
Parents: f463e5f
Author: Mike Dusenberry <mw...@us.ibm.com>
Authored: Mon Sep 12 12:37:52 2016 -0700
Committer: Mike Dusenberry <mw...@us.ibm.com>
Committed: Mon Sep 12 12:37:52 2016 -0700
----------------------------------------------------------------------
docs/spark-mlcontext-programming-guide.md | 156 +++++++++++++++++++++++++
1 file changed, 156 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/adc4a5b6/docs/spark-mlcontext-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/spark-mlcontext-programming-guide.md b/docs/spark-mlcontext-programming-guide.md
index bed05ad..c7b2bb6 100644
--- a/docs/spark-mlcontext-programming-guide.md
+++ b/docs/spark-mlcontext-programming-guide.md
@@ -1640,6 +1640,160 @@ scala> for (i <- 1 to 5) {
</div>
+---
+
+# Jupyter (PySpark) Notebook Example - Poisson Nonnegative Matrix Factorization
+
+Similar to the Scala API, SystemML also provides a Python MLContext API. In addition to the
+regular `SystemML.jar` file, you'll need to install the Python API as follows:
+
+ * Latest release:
+ * Python 2:
+
+ ```
+ pip install systemml
+ # Bleeding edge: pip install git+git://github.com/apache/incubator-systemml.git#subdirectory=src/main/python
+ ```
+
+ * Python 3:
+
+ ```
+ pip3 install systemml
+ # Bleeding edge: pip3 install git+git://github.com/apache/incubator-systemml.git#subdirectory=src/main/python
+ ```
+ * Don't forget to download the `SystemML.jar` file, which can be found in the latest release, or
+ in a nightly build.
+
+Here, we'll explore the use of SystemML via PySpark in a [Jupyter notebook](http://jupyter.org/).
+This Jupyter notebook example can be nicely viewed in a rendered state
+[on GitHub](https://github.com/apache/incubator-systemml/blob/master/samples/jupyter-notebooks/SystemML-PySpark-Recommendation-Demo.ipynb),
+and can be [downloaded here](https://raw.githubusercontent.com/apache/incubator-systemml/master/samples/jupyter-notebooks/SystemML-PySpark-Recommendation-Demo.ipynb) to a directory of your choice.
+
+From the directory with the downloaded notebook, start Jupyter with PySpark:
+
+ * Python 2:
+
+ ```
+ PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path SystemML.jar --jars SystemML.jar
+ ```
+
+ * Python 3:
+
+ ```
+ PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path SystemML.jar --jars SystemML.jar
+ ```
+
+This will open Jupyter in a browser:
+
+![Jupyter Notebook](img/spark-mlcontext-programming-guide/jupyter1.png "Jupyter Notebook")
+
+We can then open up the `SystemML-PySpark-Recommendation-Demo` notebook.
+
+## Set up the notebook and download the data
+
+{% highlight python %}
+%load_ext autoreload
+%autoreload 2
+%matplotlib inline
+
+import numpy as np
+import matplotlib.pyplot as plt
+from systemml import MLContext, dml # pip install systeml
+plt.rcParams['figure.figsize'] = (10, 6)
+{% endhighlight %}
+
+{% highlight python %}
+%%sh
+# Download dataset
+curl -O http://snap.stanford.edu/data/amazon0601.txt.gz
+gunzip amazon0601.txt.gz
+{% endhighlight %}
+
+## Use PySpark to load the data in as a Spark DataFrame
+
+{% highlight python %}
+# Load data
+import pyspark.sql.functions as F
+dataPath = "amazon0601.txt"
+
+X_train = (sc.textFile(dataPath)
+ .filter(lambda l: not l.startswith("#"))
+ .map(lambda l: l.split("\t"))
+ .map(lambda prods: (int(prods[0]), int(prods[1]), 1.0))
+ .toDF(("prod_i", "prod_j", "x_ij"))
+ .filter("prod_i < 500 AND prod_j < 500") # Filter for memory constraints
+ .cache())
+
+max_prod_i = X_train.select(F.max("prod_i")).first()[0]
+max_prod_j = X_train.select(F.max("prod_j")).first()[0]
+numProducts = max(max_prod_i, max_prod_j) + 1 # 0-based indexing
+print("Total number of products: {}".format(numProducts))
+{% endhighlight %}
+
+## Create a SystemML MLContext object
+
+{% highlight python %}
+# Create SystemML MLContext
+ml = MLContext(sc)
+{% endhighlight %}
+
+## Define a kernel for Poisson nonnegative matrix factorization (PNMF) in DML
+
+{% highlight python %}
+# Define PNMF kernel in SystemML's DSL using the R-like syntax for PNMF
+pnmf = """
+# data & args
+X = X+1 # change product IDs to be 1-based, rather than 0-based
+V = table(X[,1], X[,2])
+size = ifdef($size, -1)
+if(size > -1) {
+ V = V[1:size,1:size]
+}
+
+n = nrow(V)
+m = ncol(V)
+range = 0.01
+W = Rand(rows=n, cols=rank, min=0, max=range, pdf="uniform")
+H = Rand(rows=rank, cols=m, min=0, max=range, pdf="uniform")
+losses = matrix(0, rows=max_iter, cols=1)
+
+# run PNMF
+i=1
+while(i <= max_iter) {
+ # update params
+ H = (H * (t(W) %*% (V/(W%*%H))))/t(colSums(W))
+ W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H))
+
+ # compute loss
+ losses[i,] = -1 * (sum(V*log(W%*%H)) - as.scalar(colSums(W)%*%rowSums(H)))
+ i = i + 1;
+}
+"""
+{% endhighlight %}
+
+## Execute the algorithm
+
+{% highlight python %}
+# Run the PNMF script on SystemML with Spark
+script = dml(pnmf).input(X=X_train, max_iter=100, rank=10).output("W", "H", "losses")
+W, H, losses = ml.execute(script).get("W", "H", "losses")
+{% endhighlight %}
+
+## Retrieve the losses during training and plot them
+
+{% highlight python %}
+# Plot training loss over time
+xy = losses.toDF().sort("__INDEX").map(lambda r: (r[0], r[1])).collect()
+x, y = zip(*xy)
+plt.plot(x, y)
+plt.xlabel('Iteration')
+plt.ylabel('Loss')
+plt.title('PNMF Training Loss')
+{% endhighlight %}
+
+![Jupyter Loss Graph](img/spark-mlcontext-programming-guide/jupyter_loss_graph.png "Jupyter Loss Graph")
+
+---
# Spark Shell Example - OLD API
@@ -2683,6 +2837,8 @@ plt.title('PNMF Training Loss')
![Jupyter Loss Graph](img/spark-mlcontext-programming-guide/jupyter_loss_graph.png "Jupyter Loss Graph")
+---
+
# Recommended Spark Configuration Settings
For best performance, we recommend setting the following flags when running SystemML with Spark: