You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemml.apache.org by de...@apache.org on 2017/04/07 18:58:05 UTC

[01/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1116] Make SystemML Python DSL NumPy-friendly

Repository: incubator-systemml
Updated Branches:
  refs/heads/gh-pages [created] b91f9bfec


[SYSTEMML-1116] Make SystemML Python DSL NumPy-friendly

1. Added python test cases for matrix.
2. Added web documentation for all the Python APIs.
3. Added set_lazy method to enable and disable lazy evaluation.
4. matrix class itself has almost all basic linear algebra operators
supported by DML.
4. Updated SystemML.jar to *-incubating.jar
5. Added maven cleanup logic for python artifacts.
6. Integrated python testcases with maven (See
org.apache.sysml.test.integration.functions.python.PythonTestRunner). This
requires SPARK_HOME to be set.

Closes #290.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/313b1db8
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/313b1db8
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/313b1db8

Branch: refs/heads/gh-pages
Commit: 313b1db8d869ffbff24802afcee1f5287024af2f
Parents: f0bbde3
Author: Niketan Pansare <np...@us.ibm.com>
Authored: Fri Dec 2 16:21:13 2016 -0800
Committer: Niketan Pansare <np...@us.ibm.com>
Committed: Fri Dec 2 16:25:43 2016 -0800

----------------------------------------------------------------------
 _layouts/global.html      |   1 +
 beginners-guide-python.md |  28 +-
 devdocs/python_api.html   |  40 +-
 index.md                  |   2 +
 python-reference.md       | 953 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 986 insertions(+), 38 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/313b1db8/_layouts/global.html
----------------------------------------------------------------------
diff --git a/_layouts/global.html b/_layouts/global.html
index 516c7b4..f7cb969 100644
--- a/_layouts/global.html
+++ b/_layouts/global.html
@@ -57,6 +57,7 @@
                                 <li><a href="dml-language-reference.html">DML Language Reference</a></li>
                                 <li><a href="beginners-guide-to-dml-and-pydml.html">Beginner's Guide to DML and PyDML</a></li>
                                 <li><a href="beginners-guide-python.html">Beginner's Guide for Python users</a></li>
+                                <li><a href="python-reference.html">Reference Guide for Python users</a></li>
                                 <li class="divider"></li>
                                 <li><b>ML Algorithms:</b></li>
                                 <li><a href="algorithms-reference.html">Algorithms Reference</a></li>

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/313b1db8/beginners-guide-python.md
----------------------------------------------------------------------
diff --git a/beginners-guide-python.md b/beginners-guide-python.md
index 8d597bf..d0598aa 100644
--- a/beginners-guide-python.md
+++ b/beginners-guide-python.md
@@ -46,7 +46,7 @@ Before you get started on SystemML, make sure that your environment is set up an
 
 ### Install Java (need Java 8) and Apache Spark
 
-If you already have a Apache Spark installation, you can skip this step.
+If you already have an Apache Spark installation, you can skip this step.
   
 <div class="codetabs">
 <div data-lang="OSX" markdown="1">
@@ -70,19 +70,18 @@ brew install apache-spark16
 
 ### Install SystemML
 
-#### Step 1: Install SystemML Python package 
-
 We are working towards uploading the python package on pypi. Until then, please use following commands: 
 
 ```bash
 git checkout https://github.com/apache/incubator-systemml.git
 cd incubator-systemml
 mvn post-integration-test -P distribution -DskipTests
-pip install src/main/python/dist/systemml-incubating-0.11.0.dev1.tar.gz
+pip install src/main/python/dist/systemml-incubating-0.12.0.dev1.tar.gz
 ```
 
 The above commands will install Python package and place the corresponding Java binaries (along with algorithms) into the installed location.
 To find the location of the downloaded Java binaries, use the following command:
+
 ```bash
 python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")'
 ```
@@ -92,24 +91,16 @@ or download them from [SystemML website](http://systemml.apache.org/download.htm
 or build them from the [source](https://github.com/apache/incubator-systemml).
 
 To uninstall SystemML, please use following command:
+
 ```bash
 pip uninstall systemml-incubating
 ```
 
 ### Start Pyspark shell
 
-<div class="codetabs">
-<div data-lang="OSX" markdown="1">
-```bash
-pyspark --master local[*]
-```
-</div>
-<div data-lang="Linux" markdown="1">
 ```bash
 pyspark --master local[*]
 ```
-</div>
-</div>
 
 ## Matrix operations
 
@@ -122,7 +113,7 @@ m1 = sml.matrix(np.ones((3,3)) + 2)
 m2 = sml.matrix(np.ones((3,3)) + 3)
 m2 = m1 * (m2 + m1)
 m4 = 1.0 - m2
-m4.sum(axis=1).toNumPyArray()
+m4.sum(axis=1).toNumPy()
 ```
 
 Output:
@@ -156,7 +147,7 @@ X = sml.matrix(X_train)
 y = sml.matrix(y_train)
 A = X.transpose().dot(X)
 b = X.transpose().dot(y)
-beta = sml.solve(A, b).toNumPyArray()
+beta = sml.solve(A, b).toNumPy()
 y_predicted = X_test.dot(beta)
 print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2)) 
 ```
@@ -333,7 +324,7 @@ from sklearn import datasets, neighbors
 from pyspark.sql import DataFrame, SQLContext
 import systemml as sml
 import pandas as pd
-import os
+import os, imp
 sqlCtx = SQLContext(sc)
 digits = datasets.load_digits()
 X_digits = digits.data
@@ -343,7 +334,8 @@ n_samples = len(X_digits)
 X_df = sqlCtx.createDataFrame(pd.DataFrame(X_digits[:.9 * n_samples]))
 y_df = sqlCtx.createDataFrame(pd.DataFrame(y_digits[:.9 * n_samples]))
 ml = sml.MLContext(sc)
-script = os.path.join(os.environ['SYSTEMML_HOME'], 'scripts', 'algorithms', 'MultiLogReg.dml')
-script = sml.dml(script).input(X=X_df, Y_vec=y_df).output("B_out")
+# Get the path of MultiLogReg.dml
+scriptPath = os.path.join(imp.find_module("systemml")[1], 'systemml-java', 'scripts', 'algorithms', 'MultiLogReg.dml')
+script = sml.dml(scriptPath).input(X=X_df, Y_vec=y_df).output("B_out")
 beta = ml.execute(script).get('B_out').toNumPy()
 ```

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/313b1db8/devdocs/python_api.html
----------------------------------------------------------------------
diff --git a/devdocs/python_api.html b/devdocs/python_api.html
index 41a8e3e..93ec624 100644
--- a/devdocs/python_api.html
+++ b/devdocs/python_api.html
@@ -391,7 +391,7 @@ sparsity: Sparsity (between 0.0 and 1.0).</p>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">sml</span><span class="o">.</span><span class="n">setSparkContext</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">systemml</span> <span class="k">import</span> <span class="n">random</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span> <span class="o">=</span> <span class="n">sml</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">))</span>
-<span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span><span class="o">.</span><span class="n">toNumPyArray</span><span class="p">()</span>
+<span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span><span class="o">.</span><span class="n">toNumPy</span><span class="p">()</span>
 <span class="go">array([[ 3.48857226,  6.17261819,  2.51167259],</span>
 <span class="go">       [ 3.60506708, -1.90266305,  3.97601633],</span>
 <span class="go">       [ 3.62245706,  5.9430881 ,  2.53070413]])</span>
@@ -412,7 +412,7 @@ sparsity: Sparsity (between 0.0 and 1.0).</p>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">sml</span><span class="o">.</span><span class="n">setSparkContext</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">systemml</span> <span class="k">import</span> <span class="n">random</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span> <span class="o">=</span> <span class="n">sml</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">))</span>
-<span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span><span class="o">.</span><span class="n">toNumPyArray</span><span class="p">()</span>
+<span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span><span class="o">.</span><span class="n">toNumPy</span><span class="p">()</span>
 <span class="go">array([[ 0.54511396,  0.11937437,  0.72975775],</span>
 <span class="go">       [ 0.14135946,  0.01944448,  0.52544478],</span>
 <span class="go">       [ 0.67582422,  0.87068849,  0.02766852]])</span>
@@ -432,7 +432,7 @@ sparsity: Sparsity (between 0.0 and 1.0).</p>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">sml</span><span class="o">.</span><span class="n">setSparkContext</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">systemml</span> <span class="k">import</span> <span class="n">random</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span> <span class="o">=</span> <span class="n">sml</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">poisson</span><span class="p">(</span><span class="n">lam</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">))</span>
-<span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span><span class="o">.</span><span class="n">toNumPyArray</span><span class="p">()</span>
+<span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span><span class="o">.</span><span class="n">toNumPy</span><span class="p">()</span>
 <span class="go">array([[ 1.,  0.,  2.],</span>
 <span class="go">       [ 1.,  0.,  0.],</span>
 <span class="go">       [ 0.,  0.,  0.]])</span>
@@ -479,7 +479,7 @@ sparsity: Sparsity (between 0.0 and 1.0).</p>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">sml</span><span class="o">.</span><span class="n">setSparkContext</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">systemml</span> <span class="k">import</span> <span class="n">random</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span> <span class="o">=</span> <span class="n">sml</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">))</span>
-<span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span><span class="o">.</span><span class="n">toNumPyArray</span><span class="p">()</span>
+<span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span><span class="o">.</span><span class="n">toNumPy</span><span class="p">()</span>
 <span class="go">array([[ 3.48857226,  6.17261819,  2.51167259],</span>
 <span class="go">       [ 3.60506708, -1.90266305,  3.97601633],</span>
 <span class="go">       [ 3.62245706,  5.9430881 ,  2.53070413]])</span>
@@ -500,7 +500,7 @@ sparsity: Sparsity (between 0.0 and 1.0).</p>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">sml</span><span class="o">.</span><span class="n">setSparkContext</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">systemml</span> <span class="k">import</span> <span class="n">random</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span> <span class="o">=</span> <span class="n">sml</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">uniform</span><span class="p">(</span><span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">))</span>
-<span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span><span class="o">.</span><span class="n">toNumPyArray</span><span class="p">()</span>
+<span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span><span class="o">.</span><span class="n">toNumPy</span><span class="p">()</span>
 <span class="go">array([[ 0.54511396,  0.11937437,  0.72975775],</span>
 <span class="go">       [ 0.14135946,  0.01944448,  0.52544478],</span>
 <span class="go">       [ 0.67582422,  0.87068849,  0.02766852]])</span>
@@ -520,7 +520,7 @@ sparsity: Sparsity (between 0.0 and 1.0).</p>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">sml</span><span class="o">.</span><span class="n">setSparkContext</span><span class="p">(</span><span class="n">sc</span><span class="p">)</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="kn">from</span> <span class="nn">systemml</span> <span class="k">import</span> <span class="n">random</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span> <span class="o">=</span> <span class="n">sml</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">poisson</span><span class="p">(</span><span class="n">lam</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">))</span>
-<span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span><span class="o">.</span><span class="n">toNumPyArray</span><span class="p">()</span>
+<span class="gp">&gt;&gt;&gt; </span><span class="n">m1</span><span class="o">.</span><span class="n">toNumPy</span><span class="p">()</span>
 <span class="go">array([[ 1.,  0.,  2.],</span>
 <span class="go">       [ 1.,  0.,  0.],</span>
 <span class="go">       [ 0.,  0.,  0.]])</span>
@@ -607,7 +607,7 @@ and Pandas DataFrame).</p>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m2</span> <span class="o">=</span> <span class="n">m1</span> <span class="o">*</span> <span class="p">(</span><span class="n">m2</span> <span class="o">+</span> <span class="n">m1</span><span class="p">)</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m4</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="n">m2</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m4</span>
-<span class="go"># This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPyArray() or toDataFrame() or toPandas() methods.</span>
+<span class="go"># This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.</span>
 <span class="go">mVar1 = load(&quot; &quot;, format=&quot;csv&quot;)</span>
 <span class="go">mVar2 = load(&quot; &quot;, format=&quot;csv&quot;)</span>
 <span class="go">mVar3 = mVar2 + mVar1</span>
@@ -616,9 +616,9 @@ and Pandas DataFrame).</p>
 <span class="go">save(mVar5, &quot; &quot;)</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m2</span><span class="o">.</span><span class="n">eval</span><span class="p">()</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m2</span>
-<span class="go"># This matrix (mVar4) is backed by NumPy array. To fetch the NumPy array, invoke toNumPyArray() method.</span>
+<span class="go"># This matrix (mVar4) is backed by NumPy array. To fetch the NumPy array, invoke toNumPy() method.</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m4</span>
-<span class="go"># This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPyArray() or toDataFrame() or toPandas() methods.</span>
+<span class="go"># This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.</span>
 <span class="go">mVar4 = load(&quot; &quot;, format=&quot;csv&quot;)</span>
 <span class="go">mVar5 = 1.0 - mVar4</span>
 <span class="go">save(mVar5, &quot; &quot;)</span>
@@ -780,14 +780,14 @@ left-indexed-matrix[index] = value</p>
 <dd></dd></dl>
 
 <dl class="method">
-<dt id="systemml.defmatrix.matrix.toDataFrame">
-<code class="descname">toDataFrame</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/systemml/defmatrix.html#matrix.toDataFrame"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#systemml.defmatrix.matrix.toDataFrame" title="Permalink to this definition">�</a></dt>
+<dt id="systemml.defmatrix.matrix.toDF">
+<code class="descname">toDF</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/systemml/defmatrix.html#matrix.toDF"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#systemml.defmatrix.matrix.toDF" title="Permalink to this definition">�</a></dt>
 <dd><p>This is a convenience function that calls the global eval method and then converts the matrix object into DataFrame.</p>
 </dd></dl>
 
 <dl class="method">
-<dt id="systemml.defmatrix.matrix.toNumPyArray">
-<code class="descname">toNumPyArray</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/systemml/defmatrix.html#matrix.toNumPyArray"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#systemml.defmatrix.matrix.toNumPyArray" title="Permalink to this definition">�</a></dt>
+<dt id="systemml.defmatrix.matrix.toNumPy">
+<code class="descname">toNumPy</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/systemml/defmatrix.html#matrix.toNumPy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#systemml.defmatrix.matrix.toNumPy" title="Permalink to this definition">�</a></dt>
 <dd><p>This is a convenience function that calls the global eval method and then converts the matrix object into NumPy array.</p>
 </dd></dl>
 
@@ -1282,7 +1282,7 @@ and Pandas DataFrame).</p>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m2</span> <span class="o">=</span> <span class="n">m1</span> <span class="o">*</span> <span class="p">(</span><span class="n">m2</span> <span class="o">+</span> <span class="n">m1</span><span class="p">)</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m4</span> <span class="o">=</span> <span class="mf">1.0</span> <span class="o">-</span> <span class="n">m2</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m4</span>
-<span class="go"># This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPyArray() or toDataFrame() or toPandas() methods.</span>
+<span class="go"># This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.</span>
 <span class="go">mVar1 = load(&quot; &quot;, format=&quot;csv&quot;)</span>
 <span class="go">mVar2 = load(&quot; &quot;, format=&quot;csv&quot;)</span>
 <span class="go">mVar3 = mVar2 + mVar1</span>
@@ -1291,9 +1291,9 @@ and Pandas DataFrame).</p>
 <span class="go">save(mVar5, &quot; &quot;)</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m2</span><span class="o">.</span><span class="n">eval</span><span class="p">()</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m2</span>
-<span class="go"># This matrix (mVar4) is backed by NumPy array. To fetch the NumPy array, invoke toNumPyArray() method.</span>
+<span class="go"># This matrix (mVar4) is backed by NumPy array. To fetch the NumPy array, invoke toNumPy() method.</span>
 <span class="gp">&gt;&gt;&gt; </span><span class="n">m4</span>
-<span class="go"># This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPyArray() or toDataFrame() or toPandas() methods.</span>
+<span class="go"># This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.</span>
 <span class="go">mVar4 = load(&quot; &quot;, format=&quot;csv&quot;)</span>
 <span class="go">mVar5 = 1.0 - mVar4</span>
 <span class="go">save(mVar5, &quot; &quot;)</span>
@@ -1455,14 +1455,14 @@ left-indexed-matrix[index] = value</p>
 <dd></dd></dl>
 
 <dl class="method">
-<dt id="systemml.matrix.toDataFrame">
-<code class="descname">toDataFrame</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/systemml/defmatrix.html#matrix.toDataFrame"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#systemml.matrix.toDataFrame" title="Permalink to this definition">�</a></dt>
+<dt id="systemml.matrix.toDF">
+<code class="descname">toDF</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/systemml/defmatrix.html#matrix.toDF"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#systemml.matrix.toDF" title="Permalink to this definition">�</a></dt>
 <dd><p>This is a convenience function that calls the global eval method and then converts the matrix object into DataFrame.</p>
 </dd></dl>
 
 <dl class="method">
-<dt id="systemml.matrix.toNumPyArray">
-<code class="descname">toNumPyArray</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/systemml/defmatrix.html#matrix.toNumPyArray"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#systemml.matrix.toNumPyArray" title="Permalink to this definition">�</a></dt>
+<dt id="systemml.matrix.toNumPy">
+<code class="descname">toNumPy</code><span class="sig-paren">(</span><span class="sig-paren">)</span><a class="reference internal" href="_modules/systemml/defmatrix.html#matrix.toNumPy"><span class="viewcode-link">[source]</span></a><a class="headerlink" href="#systemml.matrix.toNumPy" title="Permalink to this definition">�</a></dt>
 <dd><p>This is a convenience function that calls the global eval method and then converts the matrix object into NumPy array.</p>
 </dd></dl>
 

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/313b1db8/index.md
----------------------------------------------------------------------
diff --git a/index.md b/index.md
index 3fcece6..6b91654 100644
--- a/index.md
+++ b/index.md
@@ -70,6 +70,8 @@ PyDML is a high-level Python-like declarative language for machine learning.
 An introduction to the basics of DML and PyDML.
 * [Beginner's Guide for Python users](beginners-guide-python) -
 Beginner's Guide for Python users.
+* [Reference Guide for Python users](python-reference) -
+Reference Guide for Python users.
 
 ## ML Algorithms
 

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/313b1db8/python-reference.md
----------------------------------------------------------------------
diff --git a/python-reference.md b/python-reference.md
new file mode 100644
index 0000000..3c2bbc3
--- /dev/null
+++ b/python-reference.md
@@ -0,0 +1,953 @@
+---
+layout: global
+title: Reference Guide for Python users
+description: Reference Guide for Python users
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+<br/>
+
+## Introduction
+
+SystemML enables flexible, scalable machine learning. This flexibility is achieved through the specification of a high-level declarative machine learning language that comes in two flavors, 
+one with an R-like syntax (DML) and one with a Python-like syntax (PyDML).
+
+Algorithm scripts written in DML and PyDML can be run on Hadoop, on Spark, or in Standalone mode. 
+No script modifications are required to change between modes. SystemML automatically performs advanced optimizations 
+based on data and cluster characteristics, so much of the need to manually tweak algorithms is largely reduced or eliminated.
+To understand more about DML and PyDML, we recommend that you read [Beginner's Guide to DML and PyDML](https://apache.github.io/incubator-systemml/beginners-guide-to-dml-and-pydml.html).
+
+For convenience of Python users, SystemML exposes several language-level APIs that allow Python users to use SystemML
+and its algorithms without the need to know DML or PyDML. We explain these APIs in the below sections.
+
+## matrix API
+
+The matrix class allows users to perform linear algebra operations in SystemML using a NumPy-like interface.
+This class supports several arithmetic operators (such as +, -, *, /, ^, etc) and also supports most of NumPy's universal functions (i.e. ufuncs).
+
+The current version of NumPy explicitly disables overriding ufunc, but this should be enabled in next release. 
+Until then to test above code, please use:
+
+```bash
+git clone https://github.com/niketanpansare/numpy.git
+cd numpy
+python setup.py install
+```
+
+This will enable NumPy's functions to invoke matrix class:
+
+```python
+import systemml as sml
+import numpy as np
+m1 = sml.matrix(np.ones((3,3)) + 2)
+m2 = sml.matrix(np.ones((3,3)) + 3)
+np.add(m1, m2)
+``` 
+
+The matrix class doesnot support following ufuncs:
+
+- Complex number related ufunc (for example: `conj`)
+- Hyperbolic/inverse-hyperbolic functions (for example: sinh, arcsinh, cosh, ...)
+- Bitwise operators
+- Xor operator
+- Infinite/Nan-checking (for example: isreal, iscomplex, isfinite, isinf, isnan)
+- Other ufuncs: copysign, nextafter, modf, frexp, trunc.
+
+This class also supports several input/output formats such as NumPy arrays, Pandas DataFrame, SciPy sparse matrix and PySpark DataFrame.
+
+By default, the operations are evaluated lazily to avoid conversion overhead and also to maximize optimization scope.
+To disable lazy evaluation, please us `set_lazy` method:
+
+```python
+>>> import systemml as sml
+>>> import numpy as np
+>>> m1 = sml.matrix(np.ones((3,3)) + 2)
+
+Welcome to Apache SystemML!
+
+>>> m2 = sml.matrix(np.ones((3,3)) + 3)
+>>> np.add(m1, m2) + m1
+# This matrix (mVar4) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.
+mVar2 = load(" ", format="csv")
+mVar1 = load(" ", format="csv")
+mVar3 = mVar1 + mVar2
+mVar4 = mVar3 + mVar1
+save(mVar4, " ")
+
+
+>>> sml.set_lazy(False)
+>>> m1 = sml.matrix(np.ones((3,3)) + 2)
+>>> m2 = sml.matrix(np.ones((3,3)) + 3)
+>>> np.add(m1, m2) + m1
+# This matrix (mVar8) is backed by NumPy array. To fetch the NumPy array, invoke toNumPy() method.
+``` 
+
+### Usage:
+
+```python
+import systemml as sml
+import numpy as np
+m1 = sml.matrix(np.ones((3,3)) + 2)
+m2 = sml.matrix(np.ones((3,3)) + 3)
+m2 = m1 * (m2 + m1)
+m4 = 1.0 - m2
+m4.sum(axis=1).toNumPy()
+```
+
+Output:
+
+```bash
+array([[-60.],
+       [-60.],
+       [-60.]])
+```
+
+
+### Reference Documentation:
+
+ *class*`systemml.defmatrix.matrix`(*data*, *op=None*)
+:   Bases: `object`
+
+    matrix class is a python wrapper that implements basic matrix
+    operators, matrix functions as well as converters to common Python
+    types (for example: Numpy arrays, PySpark DataFrame and Pandas
+    DataFrame).
+
+    The operators supported are:
+
+    1.  Arithmetic operators: +, -, *, /, //, %, \** as well as dot
+        (i.e. matrix multiplication)
+    2.  Indexing in the matrix
+    3.  Relational/Boolean operators: \<, \<=, \>, \>=, ==, !=, &, \|
+
+    In addition, following functions are supported for matrix:
+
+    1.  transpose
+    2.  Aggregation functions: sum, mean, var, sd, max, min, argmin,
+        argmax, cumsum
+    3.  Global statistical built-In functions: exp, log, abs, sqrt,
+        round, floor, ceil, sin, cos, tan, asin, acos, atan, sign, solve
+
+    For all the above functions, we always return a two dimensional matrix, especially for aggregation functions with axis. 
+    For example: Assuming m1 is a matrix of (3, n), NumPy returns a 1d vector of dimension (3,) for operation m1.sum(axis=1)
+    whereas SystemML returns a 2d matrix of dimension (3, 1).
+    
+    Note: an evaluated matrix contains a data field computed by eval
+    method as DataFrame or NumPy array.
+
+        >>> import SystemML as sml
+        >>> import numpy as np
+        >>> sml.setSparkContext(sc)
+
+    Welcome to Apache SystemML!
+
+        >>> m1 = sml.matrix(np.ones((3,3)) + 2)
+        >>> m2 = sml.matrix(np.ones((3,3)) + 3)
+        >>> m2 = m1 * (m2 + m1)
+        >>> m4 = 1.0 - m2
+        >>> m4
+        # This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.
+        mVar1 = load(" ", format="csv")
+        mVar2 = load(" ", format="csv")
+        mVar3 = mVar2 + mVar1
+        mVar4 = mVar1 * mVar3
+        mVar5 = 1.0 - mVar4
+        save(mVar5, " ")
+        >>> m2.eval()
+        >>> m2
+        # This matrix (mVar4) is backed by NumPy array. To fetch the NumPy array, invoke toNumPy() method.
+        >>> m4
+        # This matrix (mVar5) is backed by below given PyDML script (which is not yet evaluated). To fetch the data of this matrix, invoke toNumPy() or toDF() or toPandas() methods.
+        mVar4 = load(" ", format="csv")
+        mVar5 = 1.0 - mVar4
+        save(mVar5, " ")
+        >>> m4.sum(axis=1).toNumPy()
+        array([[-60.],
+               [-60.],
+               [-60.]])
+
+    Design Decisions:
+
+    1.  Until eval() method is invoked, we create an AST (not exposed to
+        the user) that consist of unevaluated operations and data
+        required by those operations. As an anology, a spark user can
+        treat eval() method similar to calling RDD.persist() followed by
+        RDD.count().
+    2.  The AST consist of two kinds of nodes: either of type matrix or
+        of type DMLOp. Both these classes expose \_visit method, that
+        helps in traversing the AST in DFS manner.
+    3.  A matrix object can either be evaluated or not. If evaluated,
+        the attribute 'data' is set to one of the supported types (for
+        example: NumPy array or DataFrame). In this case, the attribute
+        'op' is set to None. If not evaluated, the attribute 'op' which
+        refers to one of the intermediate node of AST and if of type
+        DMLOp. In this case, the attribute 'data' is set to None.
+
+    5.  DMLOp has an attribute 'inputs' which contains list of matrix
+        objects or DMLOp.
+
+    6.  To simplify the traversal, every matrix object is considered
+        immutable and an matrix operations creates a new matrix object.
+        As an example: m1 = sml.matrix(np.ones((3,3))) creates a matrix
+        object backed by 'data=(np.ones((3,3))'. m1 = m1 \* 2 will
+        create a new matrix object which is now backed by 'op=DMLOp( ...
+        )' whose input is earlier created matrix object.
+
+    7.  Left indexing (implemented in \_\_setitem\_\_ method) is a
+        special case, where Python expects the existing object to be
+        mutated. To ensure the above property, we make deep copy of
+        existing object and point any references to the left-indexed
+        matrix to the newly created object. Then the left-indexed matrix
+        is set to be backed by DMLOp consisting of following pydml:
+        left-indexed-matrix = new-deep-copied-matrix
+        left-indexed-matrix[index] = value
+
+    8.  Please use m.print\_ast() and/or type m for debugging. Here is a
+        sample session:
+
+            >>> npm = np.ones((3,3))
+            >>> m1 = sml.matrix(npm + 3)
+            >>> m2 = sml.matrix(npm + 5)
+            >>> m3 = m1 + m2
+            >>> m3
+            mVar2 = load(" ", format="csv")
+            mVar1 = load(" ", format="csv")
+            mVar3 = mVar1 + mVar2
+            save(mVar3, " ")
+            >>> m3.print_ast()
+            - [mVar3] (op).
+              - [mVar1] (data).
+              - [mVar2] (data).    
+
+ `abs`()
+:   
+
+ `acos`()
+:   
+
+ `arccos`()
+:   
+
+ `arcsin`()
+:   
+
+ `arctan`()
+:   
+
+ `argmax`(*axis=None*)
+:   Returns the indices of the maximum values along an axis.
+
+    axis : int, optional (only axis=1, i.e. rowIndexMax is supported
+    in this version)
+
+ `argmin`(*axis=None*)
+:   Returns the indices of the minimum values along an axis.
+
+    axis : int, optional (only axis=1, i.e. rowIndexMax is supported
+    in this version)
+
+ `asfptype`()
+:   
+
+ `asin`()
+:   
+
+ `astype`(*t*)
+:   
+
+ `atan`()
+:   
+
+ `ceil`()
+:   
+
+ `cos`()
+:   
+
+ `cumsum`(*axis=None*)
+:   Returns the indices of the maximum values along an axis.
+
+    axis : int, optional (only axis=0, i.e. cumsum along the rows is
+    supported in this version)
+
+ `deg2rad`()
+:   Convert angles from degrees to radians.
+
+ `dot`(*other*)[](#systemml.defmatrix.matrix.dot "Permalink to this definition")
+:   Numpy way of performing matrix multiplication
+
+ `eval`(*outputDF=False*)[](#systemml.defmatrix.matrix.eval "Permalink to this definition")
+:   This is a convenience function that calls the global eval method
+
+ `exp`()[](#systemml.defmatrix.matrix.exp "Permalink to this definition")
+:   
+
+ `exp2`()[](#systemml.defmatrix.matrix.exp2 "Permalink to this definition")
+:   
+
+ `expm1`()[](#systemml.defmatrix.matrix.expm1 "Permalink to this definition")
+:   
+
+ `floor`()[](#systemml.defmatrix.matrix.floor "Permalink to this definition")
+:   
+
+ `get_shape`()[](#systemml.defmatrix.matrix.get_shape "Permalink to this definition")
+:   
+
+ `ldexp`(*other*)[](#systemml.defmatrix.matrix.ldexp "Permalink to this definition")
+:   
+
+ `log`(*y=None*)[](#systemml.defmatrix.matrix.log "Permalink to this definition")
+:   
+
+ `log10`()[](#systemml.defmatrix.matrix.log10 "Permalink to this definition")
+:   
+
+ `log1p`()[](#systemml.defmatrix.matrix.log1p "Permalink to this definition")
+:   
+
+ `log2`()[](#systemml.defmatrix.matrix.log2 "Permalink to this definition")
+:   
+
+ `logaddexp`(*other*)[](#systemml.defmatrix.matrix.logaddexp "Permalink to this definition")
+:   
+
+ `logaddexp2`(*other*)[](#systemml.defmatrix.matrix.logaddexp2 "Permalink to this definition")
+:   
+
+ `logical_not`()[](#systemml.defmatrix.matrix.logical_not "Permalink to this definition")
+:   
+
+ `max`(*other=None*, *axis=None*)[](#systemml.defmatrix.matrix.max "Permalink to this definition")
+:   Compute the maximum value along the specified axis
+
+    other: matrix or numpy array (& other supported types) or scalar
+    axis : int, optional
+
+ `mean`(*axis=None*)[](#systemml.defmatrix.matrix.mean "Permalink to this definition")
+:   Compute the arithmetic mean along the specified axis
+
+    axis : int, optional
+
+ `min`(*other=None*, *axis=None*)[](#systemml.defmatrix.matrix.min "Permalink to this definition")
+:   Compute the minimum value along the specified axis
+
+    other: matrix or numpy array (& other supported types) or scalar
+    axis : int, optional
+
+ `mod`(*other*)[](#systemml.defmatrix.matrix.mod "Permalink to this definition")
+:   
+
+ `ndim`*= 2*[](#systemml.defmatrix.matrix.ndim "Permalink to this definition")
+:   
+
+ `negative`()[](#systemml.defmatrix.matrix.negative "Permalink to this definition")
+:   
+
+ `ones_like`()[](#systemml.defmatrix.matrix.ones_like "Permalink to this definition")
+:   
+
+ `print_ast`()[](#systemml.defmatrix.matrix.print_ast "Permalink to this definition")
+:   Please use m.print\_ast() and/or type m for debugging. Here is a
+    sample session:
+
+        >>> npm = np.ones((3,3))
+        >>> m1 = sml.matrix(npm + 3)
+        >>> m2 = sml.matrix(npm + 5)
+        >>> m3 = m1 + m2
+        >>> m3
+        mVar2 = load(" ", format="csv")
+        mVar1 = load(" ", format="csv")
+        mVar3 = mVar1 + mVar2
+        save(mVar3, " ")
+        >>> m3.print_ast()
+        - [mVar3] (op).
+          - [mVar1] (data).
+          - [mVar2] (data).
+
+ `rad2deg`()[](#systemml.defmatrix.matrix.rad2deg "Permalink to this definition")
+:   Convert angles from radians to degrees.
+
+ `reciprocal`()[](#systemml.defmatrix.matrix.reciprocal "Permalink to this definition")
+:   
+
+ `remainder`(*other*)[](#systemml.defmatrix.matrix.remainder "Permalink to this definition")
+:   
+
+ `round`()[](#systemml.defmatrix.matrix.round "Permalink to this definition")
+:   
+
+ `script`*= None*[](#systemml.defmatrix.matrix.script "Permalink to this definition")
+:   
+
+ `sd`(*axis=None*)[](#systemml.defmatrix.matrix.sd "Permalink to this definition")
+:   Compute the standard deviation along the specified axis
+
+    axis : int, optional
+
+ `set_shape`(*shape*)[](#systemml.defmatrix.matrix.set_shape "Permalink to this definition")
+:   
+
+ `shape`[](#systemml.defmatrix.matrix.shape "Permalink to this definition")
+:   
+
+ `sign`()[](#systemml.defmatrix.matrix.sign "Permalink to this definition")
+:   
+
+ `sin`()[](#systemml.defmatrix.matrix.sin "Permalink to this definition")
+:   
+
+ `sqrt`()[](#systemml.defmatrix.matrix.sqrt "Permalink to this definition")
+:   
+
+ `square`()[](#systemml.defmatrix.matrix.square "Permalink to this definition")
+:   
+
+ `sum`(*axis=None*)[](#systemml.defmatrix.matrix.sum "Permalink to this definition")
+:   Compute the sum along the specified axis. 
+
+    axis : int, optional
+
+ `systemmlVarID`*= 0*[](#systemml.defmatrix.matrix.systemmlVarID "Permalink to this definition")
+:   
+
+ `tan`()[](#systemml.defmatrix.matrix.tan "Permalink to this definition")
+:   
+
+ `toDF`()[](#systemml.defmatrix.matrix.toDF "Permalink to this definition")
+:   This is a convenience function that calls the global eval method
+    and then converts the matrix object into DataFrame.
+
+ `toNumPy`()[](#systemml.defmatrix.matrix.toNumPy "Permalink to this definition")
+:   This is a convenience function that calls the global eval method
+    and then converts the matrix object into NumPy array.
+
+ `toPandas`()[](#systemml.defmatrix.matrix.toPandas "Permalink to this definition")
+:   This is a convenience function that calls the global eval method
+    and then converts the matrix object into Pandas DataFrame.
+
+ `trace`()[](#systemml.defmatrix.matrix.trace "Permalink to this definition")
+:   Return the sum of the cells of the main diagonal square matrix
+
+ `transpose`()[](#systemml.defmatrix.matrix.transpose "Permalink to this definition")
+:   Transposes the matrix.
+
+ `var`(*axis=None*)[](#systemml.defmatrix.matrix.var "Permalink to this definition")
+:   Compute the variance along the specified axis
+
+    axis : int, optional
+
+ `zeros_like`()[](#systemml.defmatrix.matrix.zeros_like "Permalink to this definition")
+:   
+
+ `systemml.defmatrix.eval`(*outputs*, *outputDF=False*, *execute=True*)[](#systemml.defmatrix.eval "Permalink to this definition")
+:   Executes the unevaluated DML script and computes the matrices
+    specified by outputs.
+
+    outputs: list of matrices or a matrix object outputDF: back the data
+    of matrix as PySpark DataFrame
+
+ `systemml.defmatrix.solve`(*A*, *b*)[](#systemml.defmatrix.solve "Permalink to this definition")
+:   Computes the least squares solution for system of linear equations A
+    %\*% x = b
+
+        >>> import numpy as np
+        >>> from sklearn import datasets
+        >>> import SystemML as sml
+        >>> from pyspark.sql import SQLContext
+        >>> diabetes = datasets.load_diabetes()
+        >>> diabetes_X = diabetes.data[:, np.newaxis, 2]
+        >>> X_train = diabetes_X[:-20]
+        >>> X_test = diabetes_X[-20:]
+        >>> y_train = diabetes.target[:-20]
+        >>> y_test = diabetes.target[-20:]
+        >>> sml.setSparkContext(sc)
+        >>> X = sml.matrix(X_train)
+        >>> y = sml.matrix(y_train)
+        >>> A = X.transpose().dot(X)
+        >>> b = X.transpose().dot(y)
+        >>> beta = sml.solve(A, b).toNumPy()
+        >>> y_predicted = X_test.dot(beta)
+        >>> print('Residual sum of squares: %.2f' % np.mean((y_predicted - y_test) ** 2))
+        Residual sum of squares: 25282.12
+
+ `systemml.defmatrix.set_lazy`(*isLazy*)[](#systemml.defmatrix.set_max_depth "Permalink to this definition")
+:   This method allows users to set whether the matrix operations should be executed in lazy manner.
+
+    isLazy: True if matrix operations should be evaluated in lazy manner.
+
+ `systemml.defmatrix.debug_array_conversion`(*throwError*)[](#systemml.defmatrix.debug_array_conversion "Permalink to this definition")
+:   
+
+ `systemml.random.sampling.normal`(*loc=0.0*, *scale=1.0*, *size=(1*, *1)*, *sparsity=1.0*)(#systemml.random.sampling.normal "Permalink to this definition")
+:   Draw random samples from a normal (Gaussian) distribution.
+
+    loc: Mean ('centre') of the distribution. scale: Standard deviation
+    (spread or 'width') of the distribution. size: Output shape (only
+    tuple of length 2, i.e. (m, n), supported). sparsity: Sparsity
+    (between 0.0 and 1.0).
+
+        >>> import systemml as sml
+        >>> import numpy as np
+        >>> sml.setSparkContext(sc)
+        >>> from systemml import random
+        >>> m1 = sml.random.normal(loc=3, scale=2, size=(3,3))
+        >>> m1.toNumPy()
+        array([[ 3.48857226,  6.17261819,  2.51167259],
+               [ 3.60506708, -1.90266305,  3.97601633],
+               [ 3.62245706,  5.9430881 ,  2.53070413]])
+
+ `systemml.random.sampling.uniform`(*low=0.0*, *high=1.0*, *size=(1*, *1)*, *sparsity=1.0*)(#systemml.random.sampling.uniform "Permalink to this definition")
+:   Draw samples from a uniform distribution.
+
+    low: Lower boundary of the output interval. high: Upper boundary of
+    the output interval. size: Output shape (only tuple of length 2,
+    i.e. (m, n), supported). sparsity: Sparsity (between 0.0 and 1.0).
+
+        >>> import systemml as sml
+        >>> import numpy as np
+        >>> sml.setSparkContext(sc)
+        >>> from systemml import random
+        >>> m1 = sml.random.uniform(size=(3,3))
+        >>> m1.toNumPy()
+        array([[ 0.54511396,  0.11937437,  0.72975775],
+               [ 0.14135946,  0.01944448,  0.52544478],
+               [ 0.67582422,  0.87068849,  0.02766852]])
+
+ `systemml.random.sampling.poisson`(*lam=1.0*, *size=(1*, *1)*, *sparsity=1.0*)(#systemml.random.sampling.poisson "Permalink to this definition")
+:   Draw samples from a Poisson distribution.
+
+    lam: Expectation of interval, should be \> 0. size: Output shape
+    (only tuple of length 2, i.e. (m, n), supported). sparsity: Sparsity
+    (between 0.0 and 1.0).
+
+        >>> import systemml as sml
+        >>> import numpy as np
+        >>> sml.setSparkContext(sc)
+        >>> from systemml import random
+        >>> m1 = sml.random.poisson(lam=1, size=(3,3))
+        >>> m1.toNumPy()
+        array([[ 1.,  0.,  2.],
+               [ 1.,  0.,  0.],
+               [ 0.,  0.,  0.]])
+
+
+
+## MLContext API
+
+The Spark MLContext API offers a programmatic interface for interacting with SystemML from Spark using languages such as Scala, Java, and Python. 
+As a result, it offers a convenient way to interact with SystemML from the Spark Shell and from Notebooks such as Jupyter and Zeppelin.
+
+### Usage
+
+The below example demonstrates how to invoke the algorithm [scripts/algorithms/MultiLogReg.dml](https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/MultiLogReg.dml)
+using Python [MLContext API](https://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide).
+
+```python
+from sklearn import datasets, neighbors
+from pyspark.sql import DataFrame, SQLContext
+import systemml as sml
+import pandas as pd
+import os, imp
+sqlCtx = SQLContext(sc)
+digits = datasets.load_digits()
+X_digits = digits.data
+y_digits = digits.target + 1
+n_samples = len(X_digits)
+# Split the data into training/testing sets and convert to PySpark DataFrame
+X_df = sqlCtx.createDataFrame(pd.DataFrame(X_digits[:.9 * n_samples]))
+y_df = sqlCtx.createDataFrame(pd.DataFrame(y_digits[:.9 * n_samples]))
+ml = sml.MLContext(sc)
+# Get the path of MultiLogReg.dml
+scriptPath = os.path.join(imp.find_module("systemml")[1], 'systemml-java', 'scripts', 'algorithms', 'MultiLogReg.dml')
+script = sml.dml(scriptPath).input(X=X_df, Y_vec=y_df).output("B_out")
+beta = ml.execute(script).get('B_out').toNumPy()
+```
+
+### Reference documentation
+
+ *class*`systemml.mlcontext.MLResults`(*results*, *sc*)[](#systemml.mlcontext.MLResults "Permalink to this definition")
+:   Bases: `object`{.xref .py .py-class .docutils .literal}
+
+    Wrapper around a Java ML Results object.
+
+    results: JavaObject
+    :   A Java MLResults object as returned by calling ml.execute().
+    sc: SparkContext
+    :   SparkContext
+
+     `get`(*\*outputs*)[](#systemml.mlcontext.MLResults.get "Permalink to this definition")
+    :   outputs: string, list of strings
+        :   Output variables as defined inside the DML script.
+
+ *class*`systemml.mlcontext.MLContext`(*sc*)[](#systemml.mlcontext.MLContext "Permalink to this definition")
+:   Bases: `object`{.xref .py .py-class .docutils .literal}
+
+    Wrapper around the new SystemML MLContext.
+
+    sc: SparkContext
+    :   SparkContext
+
+ `execute`(*script*)[](#systemml.mlcontext.MLContext.execute "Permalink to this definition")
+:   Execute a DML / PyDML script.
+
+    script: Script instance
+    :   Script instance defined with the appropriate input and
+        output variables.
+
+    ml\_results: MLResults
+    :   MLResults instance.
+
+ `setExplain`(*explain*)[](#systemml.mlcontext.MLContext.setExplain "Permalink to this definition")
+:   Explanation about the program. Mainly intended for developers.
+
+    explain: boolean
+
+ `setExplainLevel`(*explainLevel*)[](#systemml.mlcontext.MLContext.setExplainLevel "Permalink to this definition")
+:   Set explain level.
+
+    explainLevel: string
+    :   Can be one of 'hops', 'runtime', 'recompile\_hops',
+        'recompile\_runtime' or in the above in upper case.
+
+ `setStatistics`(*statistics*)[](#systemml.mlcontext.MLContext.setStatistics "Permalink to this definition")
+:   Whether or not to output statistics (such as execution time,
+    elapsed time) about script executions.
+
+    statistics: boolean
+
+ `setStatisticsMaxHeavyHitters`(*maxHeavyHitters*)[](#systemml.mlcontext.MLContext.setStatisticsMaxHeavyHitters "Permalink to this definition")
+:   The maximum number of heavy hitters that are printed as part of
+    the statistics.
+
+    maxHeavyHitters: int
+
+ *class*`systemml.mlcontext.Script`(*scriptString*, *scriptType='dml'*)[](#systemml.mlcontext.Script "Permalink to this definition")
+:   Bases: `object`{.xref .py .py-class .docutils .literal}
+
+    Instance of a DML/PyDML Script.
+
+    scriptString: string
+    :   Can be either a file path to a DML script or a DML script
+        itself.
+    scriptType: string
+    :   Script language, either 'dml' for DML (R-like) or 'pydml' for
+        PyDML (Python-like).
+
+ `input`(*\*args*, *\*\*kwargs*)[](#systemml.mlcontext.Script.input "Permalink to this definition")
+:   args: name, value tuple
+    :   where name is a string, and currently supported value
+        formats are double, string, dataframe, rdd, and list of such
+        object.
+    kwargs: dict of name, value pairs
+    :   To know what formats are supported for name and value, look
+        above.
+
+ `output`(*\*names*)[](#systemml.mlcontext.Script.output "Permalink to this definition")
+:   names: string, list of strings
+    :   Output variables as defined inside the DML script.
+
+ `systemml.mlcontext.dml`(*scriptString*)[](#systemml.mlcontext.dml "Permalink to this definition")
+:   Create a dml script object based on a string.
+
+    scriptString: string
+    :   Can be a path to a dml script or a dml script itself.
+
+    script: Script instance
+    :   Instance of a script object.
+
+ `systemml.mlcontext.pydml`(*scriptString*)[](#systemml.mlcontext.pydml "Permalink to this definition")
+:   Create a pydml script object based on a string.
+
+    scriptString: string
+    :   Can be a path to a pydml script or a pydml script itself.
+
+    script: Script instance
+    :   Instance of a script object.
+
+ `systemml.mlcontext.getNumCols`(*numPyArr*)[](#systemml.mlcontext.getNumCols "Permalink to this definition")
+:   
+
+ `systemml.mlcontext.convertToMatrixBlock`(*sc*, *src*)[](#systemml.mlcontext.convertToMatrixBlock "Permalink to this definition")
+:   
+
+ `systemml.mlcontext.convertToNumPyArr`(*sc*, *mb*)[](#systemml.mlcontext.convertToNumPyArr "Permalink to this definition")
+:   
+
+ `systemml.mlcontext.convertToPandasDF`(*X*)[](#systemml.mlcontext.convertToPandasDF "Permalink to this definition")
+:   
+
+ `systemml.mlcontext.convertToLabeledDF`(*sqlCtx*, *X*, *y=None*)[](#systemml.mlcontext.convertToLabeledDF "Permalink to this definition")
+:   
+
+
+## mllearn API
+
+### Usage
+
+```python
+# Scikit-learn way
+from sklearn import datasets, neighbors
+from systemml.mllearn import LogisticRegression
+from pyspark.sql import SQLContext
+sqlCtx = SQLContext(sc)
+digits = datasets.load_digits()
+X_digits = digits.data
+y_digits = digits.target 
+n_samples = len(X_digits)
+X_train = X_digits[:.9 * n_samples]
+y_train = y_digits[:.9 * n_samples]
+X_test = X_digits[.9 * n_samples:]
+y_test = y_digits[.9 * n_samples:]
+logistic = LogisticRegression(sqlCtx)
+print('LogisticRegression score: %f' % logistic.fit(X_train, y_train).score(X_test, y_test))
+```
+
+Output:
+
+```bash
+LogisticRegression score: 0.922222
+```
+
+### Reference documentation
+
+ *class*`systemml.mllearn.estimators.LinearRegression`(*sqlCtx*, *fit\_intercept=True*, *max\_iter=100*, *tol=1e-06*, *C=1.0*, *solver='newton-cg'*, *transferUsingDF=False*)(#systemml.mllearn.estimators.LinearRegression "Permalink to this definition")
+:   Bases: `systemml.mllearn.estimators.BaseSystemMLRegressor`{.xref .py
+    .py-class .docutils .literal}
+
+    Performs linear regression to model the relationship between one
+    numerical response variable and one or more explanatory (feature)
+    variables.
+
+        >>> import numpy as np
+        >>> from sklearn import datasets
+        >>> from systemml.mllearn import LinearRegression
+        >>> from pyspark.sql import SQLContext
+        >>> # Load the diabetes dataset
+        >>> diabetes = datasets.load_diabetes()
+        >>> # Use only one feature
+        >>> diabetes_X = diabetes.data[:, np.newaxis, 2]
+        >>> # Split the data into training/testing sets
+        >>> diabetes_X_train = diabetes_X[:-20]
+        >>> diabetes_X_test = diabetes_X[-20:]
+        >>> # Split the targets into training/testing sets
+        >>> diabetes_y_train = diabetes.target[:-20]
+        >>> diabetes_y_test = diabetes.target[-20:]
+        >>> # Create linear regression object
+        >>> regr = LinearRegression(sqlCtx, solver='newton-cg')
+        >>> # Train the model using the training sets
+        >>> regr.fit(diabetes_X_train, diabetes_y_train)
+        >>> # The mean square error
+        >>> print("Residual sum of squares: %.2f" % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
+
+ *class*`systemml.mllearn.estimators.LogisticRegression`(*sqlCtx*, *penalty='l2'*, *fit\_intercept=True*, *max\_iter=100*, *max\_inner\_iter=0*, *tol=1e-06*, *C=1.0*, *solver='newton-cg'*, *transferUsingDF=False*)(#systemml.mllearn.estimators.LogisticRegression "Permalink to this definition")
+:   Bases: `systemml.mllearn.estimators.BaseSystemMLClassifier`{.xref
+    .py .py-class .docutils .literal}
+
+    Performs both binomial and multinomial logistic regression.
+
+    Scikit-learn way
+
+        >>> from sklearn import datasets, neighbors
+        >>> from systemml.mllearn import LogisticRegression
+        >>> from pyspark.sql import SQLContext
+        >>> sqlCtx = SQLContext(sc)
+        >>> digits = datasets.load_digits()
+        >>> X_digits = digits.data
+        >>> y_digits = digits.target + 1
+        >>> n_samples = len(X_digits)
+        >>> X_train = X_digits[:.9 * n_samples]
+        >>> y_train = y_digits[:.9 * n_samples]
+        >>> X_test = X_digits[.9 * n_samples:]
+        >>> y_test = y_digits[.9 * n_samples:]
+        >>> logistic = LogisticRegression(sqlCtx)
+        >>> print('LogisticRegression score: %f' % logistic.fit(X_train, y_train).score(X_test, y_test))
+
+    MLPipeline way
+
+        >>> from pyspark.ml import Pipeline
+        >>> from systemml.mllearn import LogisticRegression
+        >>> from pyspark.ml.feature import HashingTF, Tokenizer
+        >>> from pyspark.sql import SQLContext
+        >>> sqlCtx = SQLContext(sc)
+        >>> training = sqlCtx.createDataFrame([
+        >>>     (0L, "a b c d e spark", 1.0),
+        >>>     (1L, "b d", 2.0),
+        >>>     (2L, "spark f g h", 1.0),
+        >>>     (3L, "hadoop mapreduce", 2.0),
+        >>>     (4L, "b spark who", 1.0),
+        >>>     (5L, "g d a y", 2.0),
+        >>>     (6L, "spark fly", 1.0),
+        >>>     (7L, "was mapreduce", 2.0),
+        >>>     (8L, "e spark program", 1.0),
+        >>>     (9L, "a e c l", 2.0),
+        >>>     (10L, "spark compile", 1.0),
+        >>>     (11L, "hadoop software", 2.0)
+        >>> ], ["id", "text", "label"])
+        >>> tokenizer = Tokenizer(inputCol="text", outputCol="words")
+        >>> hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20)
+        >>> lr = LogisticRegression(sqlCtx)
+        >>> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
+        >>> model = pipeline.fit(training)
+        >>> test = sqlCtx.createDataFrame([
+        >>>     (12L, "spark i j k"),
+        >>>     (13L, "l m n"),
+        >>>     (14L, "mapreduce spark"),
+        >>>     (15L, "apache hadoop")], ["id", "text"])
+        >>> prediction = model.transform(test)
+        >>> prediction.show()
+
+ *class*`systemml.mllearn.estimators.SVM`(*sqlCtx*, *fit\_intercept=True*, *max\_iter=100*, *tol=1e-06*, *C=1.0*, *is\_multi\_class=False*, *transferUsingDF=False*)(#systemml.mllearn.estimators.SVM "Permalink to this definition")
+:   Bases: `systemml.mllearn.estimators.BaseSystemMLClassifier`{.xref
+    .py .py-class .docutils .literal}
+
+    Performs both binary-class and multiclass SVM (Support Vector
+    Machines).
+
+        >>> from sklearn import datasets, neighbors
+        >>> from systemml.mllearn import SVM
+        >>> from pyspark.sql import SQLContext
+        >>> sqlCtx = SQLContext(sc)
+        >>> digits = datasets.load_digits()
+        >>> X_digits = digits.data
+        >>> y_digits = digits.target 
+        >>> n_samples = len(X_digits)
+        >>> X_train = X_digits[:.9 * n_samples]
+        >>> y_train = y_digits[:.9 * n_samples]
+        >>> X_test = X_digits[.9 * n_samples:]
+        >>> y_test = y_digits[.9 * n_samples:]
+        >>> svm = SVM(sqlCtx, is_multi_class=True)
+        >>> print('LogisticRegression score: %f' % svm.fit(X_train, y_train).score(X_test, y_test))
+
+ *class*`systemml.mllearn.estimators.NaiveBayes`(*sqlCtx*, *laplace=1.0*, *transferUsingDF=False*)(#systemml.mllearn.estimators.NaiveBayes "Permalink to this definition")
+:   Bases: `systemml.mllearn.estimators.BaseSystemMLClassifier`{.xref
+    .py .py-class .docutils .literal}
+
+    Performs Naive Bayes.
+
+        >>> from sklearn.datasets import fetch_20newsgroups
+        >>> from sklearn.feature_extraction.text import TfidfVectorizer
+        >>> from systemml.mllearn import NaiveBayes
+        >>> from sklearn import metrics
+        >>> from pyspark.sql import SQLContext
+        >>> sqlCtx = SQLContext(sc)
+        >>> categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
+        >>> newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
+        >>> newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
+        >>> vectorizer = TfidfVectorizer()
+        >>> # Both vectors and vectors_test are SciPy CSR matrix
+        >>> vectors = vectorizer.fit_transform(newsgroups_train.data)
+        >>> vectors_test = vectorizer.transform(newsgroups_test.data)
+        >>> nb = NaiveBayes(sqlCtx)
+        >>> nb.fit(vectors, newsgroups_train.target)
+        >>> pred = nb.predict(vectors_test)
+        >>> metrics.f1_score(newsgroups_test.target, pred, average='weighted')
+
+
+## Utility classes (used internally)
+
+### systemml.classloader 
+
+ `systemml.classloader.createJavaObject`(*sc*, *obj\_type*)[](#systemml.classloader.createJavaObject "Permalink to this definition")
+:   Performs appropriate check if SystemML.jar is available and returns
+    the handle to MLContext object on JVM
+
+    sc: SparkContext
+    :   SparkContext
+
+    obj\_type: Type of object to create ('mlcontext' or 'dummy')
+
+### systemml.converters
+
+ `systemml.converters.getNumCols`(*numPyArr*)[](#systemml.converters.getNumCols "Permalink to this definition")
+:   
+
+ `systemml.converters.convertToMatrixBlock`(*sc*, *src*)[](#systemml.converters.convertToMatrixBlock "Permalink to this definition")
+:   
+
+ `systemml.converters.convertToNumPyArr`(*sc*, *mb*)[](#systemml.converters.convertToNumPyArr "Permalink to this definition")
+:   
+
+ `systemml.converters.convertToPandasDF`(*X*)[](#systemml.converters.convertToPandasDF "Permalink to this definition")
+:   
+
+ `systemml.converters.convertToLabeledDF`(*sqlCtx*, *X*, *y=None*)[](#systemml.converters.convertToLabeledDF "Permalink to this definition")
+:  
+
+### Other classes from systemml.defmatrix
+
+ *class*`systemml.defmatrix.DMLOp`(*inputs*, *dml=None*)[](#systemml.defmatrix.DMLOp "Permalink to this definition")
+:   Bases: `object`{.xref .py .py-class .docutils .literal}
+
+    Represents an intermediate node of Abstract syntax tree created to
+    generate the PyDML script
+
+
+## Troubleshooting Python APIs
+
+#### Unable to load SystemML.jar into current pyspark session.
+
+While using SystemML's Python package through pyspark or notebook (SparkContext is not previously created in the session), the
+below method is not required. However, if the user wishes to use SystemML through spark-submit and has not previously invoked 
+
+ `systemml.defmatrix.setSparkContext`(*sc*)
+:   Before using the matrix, the user needs to invoke this function if SparkContext is not previously created in the session.
+
+    sc: SparkContext
+    :   SparkContext
+
+Example:
+
+```python
+import systemml as sml
+import numpy as np
+sml.setSparkContext(sc)
+m1 = sml.matrix(np.ones((3,3)) + 2)
+m2 = sml.matrix(np.ones((3,3)) + 3)
+m2 = m1 * (m2 + m1)
+m4 = 1.0 - m2
+m4.sum(axis=1).toNumPy()
+```
+
+If SystemML was not installed via pip, you may have to download SystemML.jar and provide it to pyspark via `--driver-class-path` and `--jars`. 
+
+#### matrix API is running slow when set_lazy(False) or when eval() is called often.
+
+This is a known issue. The matrix API is slow in this scenario due to slow Py4J conversion from Java MatrixObject or Java RDD to Python NumPy or DataFrame.
+To resolve this for now, we recommend writing the matrix to FileSystemML and using `load` function.
+
+#### maximum recursion depth exceeded
+
+SystemML matrix is backed by lazy evaluation and uses a recursive Depth First Search (DFS).
+Python can throw `RuntimeError: maximum recursion depth exceeded` when the recursion of DFS exceeds beyond the limit 
+set by Python. There are two ways to address it:
+
+1. Increase the limit in Python:
+ 
+	```python
+	import sys
+	some_large_number = 2000
+	sys.setrecursionlimit(some_large_number)
+	```
+
+2. Evaluate the intermeditate matrix to cut-off large recursion.
\ No newline at end of file


[19/50] [abbrv] incubator-systemml git commit: [MINOR] Update doc version to match project version

Posted by de...@apache.org.
[MINOR] Update doc version to match project version


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/f80ab128
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/f80ab128
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/f80ab128

Branch: refs/heads/gh-pages
Commit: f80ab12858adc1df54d2f5ba631e37e04ac651e2
Parents: fc9914d
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Fri Feb 3 11:21:09 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Fri Feb 3 11:21:09 2017 -0800

----------------------------------------------------------------------
 _config.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/f80ab128/_config.yml
----------------------------------------------------------------------
diff --git a/_config.yml b/_config.yml
index 5f291fe..2f8c3e7 100644
--- a/_config.yml
+++ b/_config.yml
@@ -11,7 +11,7 @@ include:
   - _modules
 
 # These allow the documentation to be updated with newer releases
-SYSTEMML_VERSION: 0.11.0
+SYSTEMML_VERSION: 0.13.0
 
 # if 'analytics_on' is true, analytics section will be rendered on the HTML pages
 analytics_on: true


[40/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

Posted by de...@apache.org.
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DescriptiveBivarStats.tex
----------------------------------------------------------------------
diff --git a/alg-ref/DescriptiveBivarStats.tex b/alg-ref/DescriptiveBivarStats.tex
new file mode 100644
index 0000000..a2d3db1
--- /dev/null
+++ b/alg-ref/DescriptiveBivarStats.tex
@@ -0,0 +1,438 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Bivariate Statistics}
+
+\noindent{\bf Description}
+\smallskip
+
+Bivariate statistics are used to quantitatively describe the association between
+two features, such as test their statistical (in-)dependence or measure
+the accuracy of one data feature predicting the other feature, in a sample.
+The \BivarScriptName{} script computes common bivariate statistics,
+such as \NameStatR{} and \NameStatChi{}, in parallel for many pairs
+of data features.  For a given dataset matrix, script \BivarScriptName{} computes
+certain bivariate statistics for the given feature (column) pairs in the
+matrix.  The feature types govern the exact set of statistics computed for that pair.
+For example, \NameStatR{} can only be computed on two quantitative (scale)
+features like `Height' and `Temperature'. 
+It does not make sense to compute the linear correlation of two categorical attributes
+like `Hair Color'. 
+
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%\tolerance=0
+{\tt{}-f }path/\/\BivarScriptName{}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} index1=}path/file
+{\tt{} index2=}path/file
+{\tt{} types1=}path/file
+{\tt{} types2=}path/file
+{\tt{} OUTDIR=}path
+% {\tt{} fmt=}format
+
+}
+
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the data matrix $X$ whose columns are the features
+that we want to compare and correlate with bivariate statistics.
+\item[{\tt index1}:] % (default:\mbox{ }{\tt " "})
+Location (on HDFS) to read the single-row matrix that lists the column indices
+of the \emph{first-argument} features in pairwise statistics.
+Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the
+index $k$ of column \texttt{X[,$\,k$]} in the data matrix
+whose bivariate statistics need to be computed.
+% The default value means ``use all $X$-columns from the first to the last.''
+\item[{\tt index2}:] % (default:\mbox{ }{\tt " "})
+Location (on HDFS) to read the single-row matrix that lists the column indices
+of the \emph{second-argument} features in pairwise statistics.
+Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the
+index $l$ of column \texttt{X[,$\,l$]} in the data matrix
+whose bivariate statistics need to be computed.
+% The default value means ``use all $X$-columns from the first to the last.''
+\item[{\tt types1}:] % (default:\mbox{ }{\tt " "})
+Location (on HDFS) to read the single-row matrix that lists the \emph{types}
+of the \emph{first-argument} features in pairwise statistics.
+Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the type
+of column \texttt{X[,$\,k$]} in the data matrix, where $k$ is the $i^{\textrm{th}}$
+entry in the {\tt index1} matrix.  Feature types must be encoded by
+integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
+% The default value means ``treat all referenced $X$-columns as scale.''
+\item[{\tt types2}:] % (default:\mbox{ }{\tt " "})
+Location (on HDFS) to read the single-row matrix that lists the \emph{types}
+of the \emph{second-argument} features in pairwise statistics.
+Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the type
+of column \texttt{X[,$\,l$]} in the data matrix, where $l$ is the $j^{\textrm{th}}$
+entry in the {\tt index2} matrix.  Feature types must be encoded by
+integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
+% The default value means ``treat all referenced $X$-columns as scale.''
+\item[{\tt OUTDIR}:]
+Location path (on HDFS) where the output matrices with computed bivariate
+statistics will be stored.  The matrices' file names and format are defined
+in Table~\ref{table:bivars}.
+% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+% see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+\begin{table}[t]\hfil
+\begin{tabular}{|lll|}
+\hline\rule{0pt}{12pt}%
+Ouput File / Matrix         & Row$\,$\# & Name of Statistic   \\[2pt]
+\hline\hline\rule{0pt}{12pt}%
+\emph{All Files}            &     1     & 1-st feature column \\
+\rule{1em}{0pt}"            &     2     & 2-nd feature column \\[2pt]
+\hline\rule{0pt}{12pt}%
+bivar.scale.scale.stats     &     3     & \NameStatR          \\[2pt]
+\hline\rule{0pt}{12pt}%
+bivar.nominal.nominal.stats &     3     & \NameStatChi        \\
+\rule{1em}{0pt}"            &     4     & Degrees of freedom  \\
+\rule{1em}{0pt}"            &     5     & \NameStatPChi       \\
+\rule{1em}{0pt}"            &     6     & \NameStatV          \\[2pt]
+\hline\rule{0pt}{12pt}%
+bivar.nominal.scale.stats   &     3     & \NameStatEta        \\
+\rule{1em}{0pt}"            &     4     & \NameStatF          \\[2pt]
+\hline\rule{0pt}{12pt}%
+bivar.ordinal.ordinal.stats &     3     & \NameStatRho        \\[2pt]
+\hline
+\end{tabular}\hfil
+\caption{%
+The output matrices of \BivarScriptName{} have one row per one bivariate
+statistic and one column per one pair of input features.  This table lists
+the meaning of each matrix and each row.%
+% Signs ``+'' show applicability to scale or/and to categorical features.
+}
+\label{table:bivars}
+\end{table}
+
+
+
+\pagebreak[2]
+
+\noindent{\bf Details}
+\smallskip
+
+Script \BivarScriptName{} takes an input matrix \texttt{X} whose columns represent
+the features and whose rows represent the records of a data sample.
+Given \texttt{X}, the script computes certain relevant bivariate statistics
+for specified pairs of feature columns \texttt{X[,$\,i$]} and \texttt{X[,$\,j$]}.
+Command-line parameters \texttt{index1} and \texttt{index2} specify the files with
+column pairs of interest to the user.  Namely, the file given by \texttt{index1}
+contains the vector of the 1st-attribute column indices and the file given
+by \texttt{index2} has the vector of the 2nd-attribute column indices, with
+``1st'' and ``2nd'' referring to their places in bivariate statistics.
+Note that both \texttt{index1} and \texttt{index2} files should contain a 1-row matrix
+of positive integers.
+
+The bivariate statistics to be computed depend on the \emph{types}, or
+\emph{measurement levels}, of the two columns.
+The types for each pair are provided in the files whose locations are specified by
+\texttt{types1} and \texttt{types2} command-line parameters.
+These files are also 1-row matrices, i.e.\ vectors, that list the 1st-attribute and
+the 2nd-attribute column types in the same order as their indices in the
+\texttt{index1} and \texttt{index2} files.  The types must be provided as per
+the following convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
+
+The script orgainizes its results into (potentially) four output matrices, one per
+each type combination.  The types of bivariate statistics are defined using the types
+of the columns that were used for their arguments, with ``ordinal'' sometimes
+retrogressing to ``nominal.''  Table~\ref{table:bivars} describes what each column
+in each output matrix contains.  In particular, the script includes the following
+statistics:
+\begin{Itemize}
+\item For a pair of scale (quantitative) columns, \NameStatR;
+\item For a pair of nominal columns (with finite-sized, fixed, unordered domains), 
+the \NameStatChi{} and its p-value;
+\item For a pair of one scale column and one nominal column, \NameStatF{};
+\item For a pair of ordinal columns (ordered domains depicting ranks), \NameStatRho.
+\end{Itemize}
+Note that, as shown in Table~\ref{table:bivars}, the output matrices contain the
+column indices of the features involved in each statistic.
+Moreover, if the output matrix does not contain
+a value in a certain cell then it should be interpreted as a~$0$
+(sparse matrix representation).
+
+Below we list all bivariate statistics computed by script \BivarScriptName.
+The statistics are collected into several groups by the type of their input
+features.  We refer to the two input features as $v_1$ and $v_2$ unless
+specified otherwise; the value pairs are $(v_{1,i}, v_{2,i})$ for $i=1,\ldots,n$,
+where $n$ is the number of rows in \texttt{X}, i.e.\ the sample size.
+
+
+\paragraph{Scale-vs-scale statistics.}
+Sample statistics that describe association between two quantitative (scale) features.
+A scale feature has numerical values, with the natural ordering relation.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatR]:
+A measure of linear dependence between two numerical features:
+\begin{equation*}
+r \,\,=\,\, \frac{\Cov(v_1, v_2)}{\sqrt{\Var v_1 \Var v_2}}
+\,\,=\,\, \frac{\sum_{i=1}^n (v_{1,i} - \bar{v}_1) (v_{2,i} - \bar{v}_2)}%
+{\sqrt{\sum_{i=1}^n (v_{1,i} - \bar{v}_1)^{2\mathstrut} \cdot \sum_{i=1}^n (v_{2,i} - \bar{v}_2)^{2\mathstrut}}}
+\end{equation*}
+Commonly denoted by~$r$, correlation ranges between $-1$ and $+1$, reaching ${\pm}1$ when all value
+pairs $(v_{1,i}, v_{2,i})$ lie on the same line.  Correlation near~0 means that a line is not a good
+way to represent the dependence between the two features; however, this does not imply independence.
+The sign indicates direction of the linear association: $r > 0$ ($r < 0$) if one feature tends to
+linearly increase (decrease) when the other feature increases.  Nonlinear association, if present,
+may disobey this sign.
+\NameStatR{} is symmetric: $r(v_1, v_2) = r(v_2, v_1)$; it does not change if we transform $v_1$ and $v_2$
+to $a + b v_1$ and $c + d v_2$ where $a, b, c, d$ are constants and $b, d > 0$.
+
+Suppose that we use simple linear regression to represent one feature given the other, say
+represent $v_{2,i} \approx \alpha + \beta v_{1,i}$ by selecting $\alpha$ and $\beta$
+to minimize the least-squares error $\sum_{i=1}^n (v_{2,i} - \alpha - \beta v_{1,i})^2$.
+Then the best error equals
+\begin{equation*}
+\min_{\alpha, \beta} \,\,\sum_{i=1}^n \big(v_{2,i} - \alpha - \beta v_{1,i}\big)^2 \,\,=\,\,
+(1 - r^2) \,\sum_{i=1}^n \big(v_{2,i} - \bar{v}_2\big)^2
+\end{equation*}
+In other words, $1\,{-}\,r^2$ is the ratio of the residual sum of squares to
+the total sum of squares.  Hence, $r^2$ is an accuracy measure of the linear regression.
+\end{Description}
+
+
+\paragraph{Nominal-vs-nominal statistics.}
+Sample statistics that describe association between two nominal categorical features.
+Both features' value domains are encoded with positive integers in arbitrary order:
+nominal features do not order their value domains.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatChi]:
+A measure of how much the frequencies of value pairs of two categorical features deviate from
+statistical independence.  Under independence, the probability of every value pair must equal
+the product of probabilities of each value in the pair:
+$\Prob[a, b] - \Prob[a]\,\Prob[b] = 0$.  But we do not know these (hypothesized) probabilities;
+we only know the sample frequency counts.  Let $n_{a,b}$ be the frequency count of pair
+$(a, b)$, let $n_a$ and $n_b$ be the frequency counts of $a$~alone and of $b$~alone.  Under
+independence, difference $n_{a,b}{/}n - (n_a{/}n)(n_b{/}n)$ is unlikely to be exactly~0 due
+to sample randomness, yet it is unlikely to be too far from~0.  For some pairs $(a,b)$ it may
+deviate from~0 farther than for other pairs.  \NameStatChi{}~is an aggregate measure that
+combines squares of these differences across all value pairs:
+\begin{equation*}
+\chi^2 \,\,=\,\, \sum_{a,\,b} \Big(\frac{n_a n_b}{n}\Big)^{-1} \Big(n_{a,b} - \frac{n_a n_b}{n}\Big)^2
+\,=\,\, \sum_{a,\,b} \frac{(O_{a,b} - E_{a,b})^2}{E_{a,b}}
+\end{equation*}
+where $O_{a,b} = n_{a,b}$ are the \emph{observed} frequencies and $E_{a,b} = (n_a n_b){/}n$ are
+the \emph{expected} frequencies for all pairs~$(a,b)$.  Under independence (plus other standard
+assumptions) the sample~$\chi^2$ closely follows a well-known distribution, making it a basis for
+statistical tests for independence, see~\emph{\NameStatPChi} for details.  Note that \NameStatChi{}
+does \emph{not} measure the strength of dependence: even very weak dependence may result in a
+significant deviation from independence if the counts are large enough.  Use~\NameStatV{} instead
+to measure the strength of dependence.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Degrees of freedom]:
+An integer parameter required for the interpretation of~\NameStatChi{} measure.  Under independence
+(plus other standard assumptions) the sample~$\chi^2$ statistic is approximately distributed as the
+sum of $d$~squares of independent normal random variables with mean~0 and variance~1, where $d$ is
+this integer parameter.  For a pair of categorical features such that the $1^{\textrm{st}}$~feature
+has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories, the number of degrees
+of freedom is $d = (k_1 - 1)(k_2 - 1)$.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatPChi]:
+A measure of how likely we would observe the current frequencies of value pairs of two categorical
+features assuming their statistical independence.  More precisely, it computes the probability that
+the sum of $d$~squares of independent normal random variables with mean~0 and variance~1
+(called the $\chi^2$~distribution with $d$ degrees of freedom) generates a value at least as large
+as the current sample \NameStatChi.  The $d$ parameter is \emph{degrees of freedom}, see above.
+Under independence (plus other standard assumptions) the sample \NameStatChi{} closely follows the
+$\chi^2$~distribution and is unlikely to land very far into its tail.  On the other hand, if the
+two features are dependent, their sample \NameStatChi{} becomes arbitrarily large as $n\to\infty$
+and lands extremely far into the tail of the $\chi^2$~distribution given a large enough data sample.
+\NameStatPChi{} returns the tail ``weight'' on the right-hand side of \NameStatChi:
+\begin{equation*}
+P\,\,=\,\, \Prob\big[r \geq \textrm{\NameStatChi} \,\,\big|\,\, r \sim \textrm{the $\chi^2$ distribution}\big]
+\end{equation*}
+As any probability, $P$ ranges between 0 and~1.  If $P\leq 0.05$, the dependence between the two
+features may be considered statistically significant (i.e.\ their independence is considered
+statistically ruled out).  For highly dependent features, it is not unusual to have $P\leq 10^{-20}$
+or less, in which case our script will simply return $P = 0$.  Independent features should have
+their $P\geq 0.05$ in about 95\% cases.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatV]:
+A measure for the strength of association, i.e.\ of statistical dependence, between two categorical
+features, conceptually similar to \NameStatR.  It divides the observed~\NameStatChi{} by the maximum
+possible~$\chi^2_{\textrm{max}}$ given $n$ and the number $k_1, k_2$~of categories in each feature,
+then takes the square root.  Thus, \NameStatV{} ranges from 0 to~1,
+where 0 implies no association and 1 implies the maximum possible association (one-to-one
+correspondence) between the two features.  See \emph{\NameStatChi} for the computation of~$\chi^2$;
+its maximum${} = {}$%
+$n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}$ where the $1^{\textrm{st}}$~feature
+has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories~\cite{AcockStavig1979:CramersV},
+so
+\begin{equation*}
+\textrm{\NameStatV} \,\,=\,\, \sqrt{\frac{\textrm{\NameStatChi}}{n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}}}
+\end{equation*}
+As opposed to \NameStatPChi, which goes to~0 (rapidly) as the features' dependence increases,
+\NameStatV{} goes towards~1 (slowly) as the dependence increases.  Both \NameStatChi{} and
+\NameStatPChi{} are very sensitive to~$n$, but in \NameStatV{} this is mitigated by taking the
+ratio.
+\end{Description}
+
+
+\paragraph{Nominal-vs-scale statistics.}
+Sample statistics that describe association between a categorical feature
+(order ignored) and a quantitative (scale) feature.
+The values of the categorical feature must be coded as positive integers.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatEta]:
+A measure for the strength of association (statistical dependence) between a nominal feature
+and a scale feature, conceptually similar to \NameStatR.  Ranges from 0 to~1, approaching 0
+when there is no association and approaching 1 when there is a strong association.  
+The nominal feature, treated as the independent variable, is assumed to have relatively few
+possible values, all with large frequency counts.  The scale feature is treated as the dependent
+variable.  Denoting the nominal feature by~$x$ and the scale feature by~$y$, we have:
+\begin{equation*}
+\eta^2 \,=\, 1 - \frac{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2},
+\,\,\,\,\textrm{where}\,\,\,\,
+\hat{y}[x] = \frac{1}{\mathop{\mathrm{freq}}(x)}\sum_{i=1}^n  
+\,\left\{\!\!\begin{array}{rl} y_i & \textrm{if $x_i = x$}\\ 0 & \textrm{otherwise}\end{array}\right.\!\!\!
+\end{equation*}
+and $\bar{y} = (1{/}n)\sum_{i=1}^n y_i$ is the mean.  Value $\hat{y}[x]$ is the average 
+of~$y_i$ among all records where $x_i = x$; it can also be viewed as the ``predictor'' 
+of $y$ given~$x$.  Then $\sum_{i=1}^{n} (y_i - \hat{y}[x_i])^2$ is the residual error
+sum-of-squares and $\sum_{i=1}^{n} (y_i - \bar{y})^2$ is the total sum-of-squares for~$y$. 
+Hence, $\eta^2$ measures the accuracy of predicting $y$ with~$x$, just like the
+``R-squared'' statistic measures the accuracy of linear regression.  Our output $\eta$
+is the square root of~$\eta^2$.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatF]:
+A measure of how much the values of the scale feature, denoted here by~$y$,
+deviate from statistical independence on the nominal feature, denoted by~$x$.
+The same measure appears in the one-way analysis of vari\-ance (ANOVA).
+Like \NameStatChi, \NameStatF{} is used to test the hypothesis that
+$y$~is independent from~$x$, given the following assumptions:
+\begin{Itemize}
+\item The scale feature $y$ has approximately normal distribution whose mean
+may depend only on~$x$ and variance is the same for all~$x$;
+\item The nominal feature $x$ has relatively small value domain with large
+frequency counts, the $x_i$-values are treated as fixed (non-random);
+\item All records are sampled independently of each other.
+\end{Itemize}
+To compute \NameStatF{}, we first compute $\hat{y}[x]$ as the average of~$y_i$
+among all records where $x_i = x$.  These $\hat{y}[x]$ can be viewed as
+``predictors'' of $y$ given~$x$; if $y$ is independent on~$x$, they should
+``predict'' only the global mean~$\bar{y}$.  Then we form two sums-of-squares:
+\begin{Itemize}
+\item \emph{Residual} sum-of-squares of the ``predictor'' accuracy: $y_i - \hat{y}[x_i]$;
+\item \emph{Explained} sum-of-squares of the ``predictor'' variability: $\hat{y}[x_i] - \bar{y}$.
+\end{Itemize}
+\NameStatF{} is the ratio of the explained sum-of-squares to
+the residual sum-of-squares, each divided by their corresponding degrees
+of freedom:
+\begin{equation*}
+F \,\,=\,\, 
+\frac{\sum_{x}\, \mathop{\mathrm{freq}}(x) \, \big(\hat{y}[x] - \bar{y}\big)^2 \,\big/\,\, (k\,{-}\,1)}%
+{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2 \,\big/\,\, (n\,{-}\,k)} \,\,=\,\,
+\frac{n\,{-}\,k}{k\,{-}\,1} \cdot \frac{\eta^2}{1 - \eta^2}
+\end{equation*}
+Here $k$ is the domain size of the nominal feature~$x$.  The $k$ ``predictors''
+lose 1~freedom due to their linear dependence with~$\bar{y}$; similarly,
+the $n$~$y_i$-s lose $k$~freedoms due to the ``predictors''.
+
+The statistic can test if the independence hypothesis of $y$ from $x$ is reasonable;
+more generally (with relaxed normality assumptions) it can test the hypothesis that
+\emph{the mean} of $y$ among records with a given~$x$ is the same for all~$x$.
+Under this hypothesis \NameStatF{} has, or approximates, the $F(k\,{-}\,1, n\,{-}\,k)$-distribution.
+But if the mean of $y$ given $x$ depends on~$x$, \NameStatF{}
+becomes arbitrarily large as $n\to\infty$ (with $k$~fixed) and lands extremely far
+into the tail of the $F(k\,{-}\,1, n\,{-}\,k)$-distribution given a large enough data sample.
+\end{Description}
+
+
+\paragraph{Ordinal-vs-ordinal statistics.}
+Sample statistics that describe association between two ordinal categorical features.
+Both features' value domains are encoded with positive integers, so that the natural
+order of the integers coincides with the order in each value domain.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it\NameStatRho]:
+A measure for the strength of association (statistical dependence) between
+two ordinal features, conceptually similar to \NameStatR.  Specifically, it is \NameStatR{}
+applied to the feature vectors in which all values are replaced by their ranks, i.e.\ 
+their positions if the vector is sorted.  The ranks of identical (duplicate) values
+are replaced with their average rank.  For example, in vector
+$(15, 11, 26, 15, 8)$ the value ``15'' occurs twice with ranks 3 and~4 per the sorted
+order $(8_1, 11_2, 15_3, 15_4, 26_5)$; so, both values are assigned their average
+rank of $3.5 = (3\,{+}\,4)\,{/}\,2$ and the vector is replaced by~$(3.5,\, 2,\, 5,\, 3.5,\, 1)$.
+
+Our implementation of \NameStatRho{} is geared towards features having small value domains
+and large counts for the values.  Given the two input vectors, we form a contingency table $T$
+of pairwise frequency counts, as well as a vector of frequency counts for each feature: $f_1$
+and~$f_2$.  Here in $T_{i,j}$, $f_{1,i}$, $f_{2,j}$ indices $i$ and~$j$ refer to the
+order-preserving integer encoding of the feature values.
+We use prefix sums over $f_1$ and~$f_2$ to compute the values' average ranks:
+$r_{1,i} = \sum_{j=1}^{i-1} f_{1,j} + (f_{1,i}\,{+}\,1){/}2$, and analogously for~$r_2$.
+Finally, we compute rank variances for $r_1, r_2$ weighted by counts $f_1, f_2$ and their
+covariance weighted by~$T$, before applying the standard formula for \NameStatR:
+\begin{equation*}
+\rho \,\,=\,\, \frac{\Cov_T(r_1, r_2)}{\sqrt{\Var_{f_1}(r_1)\Var_{f_2}(r_2)}}
+\,\,=\,\, \frac{\sum_{i,j} T_{i,j} (r_{1,i} - \bar{r}_1) (r_{2,j} - \bar{r}_2)}%
+{\sqrt{\sum_i f_{1,i} (r_{1,i} - \bar{r}_1)^{2\mathstrut} \cdot \sum_j f_{2,j} (r_{2,j} - \bar{r}_2)^{2\mathstrut}}}
+\end{equation*}
+where $\bar{r}_1 = \sum_i r_{1,i} f_{1,i}{/}n$, analogously for~$\bar{r}_2$.
+The value of $\rho$ lies between $-1$ and $+1$, with sign indicating the prevalent direction
+of the association: $\rho > 0$ ($\rho < 0$) means that one feature tends to increase (decrease)
+when the other feature increases.  The correlation becomes~1 when the two features are
+monotonically related.
+\end{Description}
+
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+A collection of (potentially) 4 matrices.  Each matrix contains bivariate statistics that
+resulted from a different combination of feature types.  There is one matrix for scale-scale
+statistics (which includes \NameStatR), one for nominal-nominal statistics (includes \NameStatChi{}),
+one for nominal-scale statistics (includes \NameStatF) and one for ordinal-ordinal statistics
+(includes \NameStatRho).  If any of these matrices is not produced, then no pair of columns required
+the corresponding type combination.  See Table~\ref{table:bivars} for the matrix naming and
+format details.
+
+
+\smallskip
+\pagebreak[2]
+
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f \BivarScriptName{} -nvargs
+X=/user/biadmin/X.mtx 
+index1=/user/biadmin/S1.mtx 
+index2=/user/biadmin/S2.mtx 
+types1=/user/biadmin/K1.mtx 
+types2=/user/biadmin/K2.mtx 
+OUTDIR=/user/biadmin/stats.mtx
+
+}
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DescriptiveStats.tex
----------------------------------------------------------------------
diff --git a/alg-ref/DescriptiveStats.tex b/alg-ref/DescriptiveStats.tex
new file mode 100644
index 0000000..5a59ad4
--- /dev/null
+++ b/alg-ref/DescriptiveStats.tex
@@ -0,0 +1,115 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\newcommand{\UnivarScriptName}{\texttt{\tt Univar-Stats.dml}}
+\newcommand{\BivarScriptName}{\texttt{\tt bivar-stats.dml}}
+
+\newcommand{\OutputRowIDMinimum}{1}
+\newcommand{\OutputRowIDMaximum}{2}
+\newcommand{\OutputRowIDRange}{3}
+\newcommand{\OutputRowIDMean}{4}
+\newcommand{\OutputRowIDVariance}{5}
+\newcommand{\OutputRowIDStDeviation}{6}
+\newcommand{\OutputRowIDStErrorMean}{7}
+\newcommand{\OutputRowIDCoeffVar}{8}
+\newcommand{\OutputRowIDQuartiles}{?, 13, ?}
+\newcommand{\OutputRowIDMedian}{13}
+\newcommand{\OutputRowIDIQMean}{14}
+\newcommand{\OutputRowIDSkewness}{9}
+\newcommand{\OutputRowIDKurtosis}{10}
+\newcommand{\OutputRowIDStErrorSkewness}{11}
+\newcommand{\OutputRowIDStErrorCurtosis}{12}
+\newcommand{\OutputRowIDNumCategories}{15}
+\newcommand{\OutputRowIDMode}{16}
+\newcommand{\OutputRowIDNumModes}{17}
+\newcommand{\OutputRowText}[1]{\mbox{(output row~{#1})\hspace{0.5pt}:}}
+
+\newcommand{\NameStatR}{Pearson's correlation coefficient}
+\newcommand{\NameStatChi}{Pearson's~$\chi^2$}
+\newcommand{\NameStatPChi}{$P\textrm{-}$value of Pearson's~$\chi^2$}
+\newcommand{\NameStatV}{Cram\'er's~$V$}
+\newcommand{\NameStatEta}{Eta statistic}
+\newcommand{\NameStatF}{$F$~statistic}
+\newcommand{\NameStatRho}{Spearman's rank correlation coefficient}
+
+Descriptive statistics are used to quantitatively describe the main characteristics of the data.
+They provide meaningful summaries computed over different observations or data records
+collected in a study.  These summaries typically form the basis of the initial data exploration
+as part of a more extensive statistical analysis.  Such a quantitative analysis assumes that
+every variable (also known as, attribute, feature, or column) in the data has a specific
+\emph{level of measurement}~\cite{Stevens1946:scales}.
+
+The measurement level of a variable, often called as {\bf variable type}, can either be
+\emph{scale} or \emph{categorical}.  A \emph{scale} variable represents the data measured on
+an interval scale or ratio scale.  Examples of scale variables include `Height', `Weight',
+`Salary', and `Temperature'.  Scale variables are also referred to as \emph{quantitative}
+or \emph{continuous} variables.  In contrast, a \emph{categorical} variable has a fixed
+limited number of distinct values or categories.  Examples of categorical variables
+include `Gender', `Region', `Hair color', `Zipcode', and `Level of Satisfaction'.
+Categorical variables can further be classified into two types, \emph{nominal} and
+\emph{ordinal}, depending on whether the categories in the variable can be ordered via an
+intrinsic ranking.  For example, there is no meaningful ranking among distinct values in
+`Hair color' variable, while the categories in `Level of Satisfaction' can be ranked from
+highly dissatisfied to highly satisfied.
+
+The input dataset for descriptive statistics is provided in the form of a matrix, whose
+rows are the records (data points) and whose columns are the features (i.e.~variables).
+Some scripts allow this matrix to be vertically split into two or three matrices.  Descriptive
+statistics are computed over the specified features (columns) in the matrix.  Which
+statistics are computed depends on the types of the features.  It is important to keep
+in mind the following caveats and restrictions:
+\begin{Enumerate}
+\item  Given a finite set of data records, i.e.~a \emph{sample}, we take their feature
+values and compute their \emph{sample statistics}.  These statistics
+will vary from sample to sample even if the underlying distribution of feature values
+remains the same.  Sample statistics are accurate for the given sample only.
+If the goal is to estimate the \emph{distribution statistics} that are parameters of
+the (hypothesized) underlying distribution of the features, the corresponding sample
+statistics may sometimes be used as approximations, but their accuracy will vary.
+\item  In particular, the accuracy of the estimated distribution statistics will be low
+if the number of values in the sample is small.  That is, for small samples, the computed
+statistics may depend on the randomness of the individual sample values more than on
+the underlying distribution of the features.
+\item  The accuracy will also be low if the sample records cannot be assumed mutually
+independent and identically distributed (i.i.d.), that is, sampled at random from the
+same underlying distribution.  In practice, feature values in one record often depend
+on other features and other records, including unknown ones.
+\item  Most of the computed statistics will have low estimation accuracy in the presence of
+extreme values (outliers) or if the underlying distribution has heavy tails, for example
+obeys a power law.  However, a few of the computed statistics, such as the median and
+\NameStatRho{}, are \emph{robust} to outliers.
+\item  Some sample statistics are reported with their \emph{sample standard errors}
+in an attempt to quantify their accuracy as distribution parameter estimators.  But these
+sample standard errors, in turn, only estimate the underlying distribution's standard
+errors and will have low accuracy for small or \mbox{non-i.i.d.} samples, outliers in samples,
+or heavy-tailed distributions.
+\item  We assume that the quantitative (scale) feature columns do not contain missing
+values, infinite values, \texttt{NaN}s, or coded non-numeric values, unless otherwise
+specified.  We assume that each categorical feature column contains positive integers
+from 1 to the number of categories; for ordinal features, the natural order on
+the integers should coincide with the order on the categories.
+\end{Enumerate}
+
+\input{DescriptiveUnivarStats}
+
+\input{DescriptiveBivarStats}
+
+\input{DescriptiveStratStats}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DescriptiveStratStats.tex
----------------------------------------------------------------------
diff --git a/alg-ref/DescriptiveStratStats.tex b/alg-ref/DescriptiveStratStats.tex
new file mode 100644
index 0000000..be0cffd
--- /dev/null
+++ b/alg-ref/DescriptiveStratStats.tex
@@ -0,0 +1,306 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Stratified Bivariate Statistics}
+
+\noindent{\bf Description}
+\smallskip
+
+The {\tt stratstats.dml} script computes common bivariate statistics, such
+as correlation, slope, and their p-value, in parallel for many pairs of input
+variables in the presence of a confounding categorical variable.  The values
+of this confounding variable group the records into strata (subpopulations),
+in which all bivariate pairs are assumed free of confounding.  The script
+uses the same data model as in one-way analysis of covariance (ANCOVA), with
+strata representing population samples.  It also outputs univariate stratified
+and bivariate unstratified statistics.
+
+\begin{table}[t]\hfil
+\begin{tabular}{|l|ll|ll|ll||ll|}
+\hline
+Month of the year & \multicolumn{2}{l|}{October} & \multicolumn{2}{l|}{November} &
+    \multicolumn{2}{l||}{December} & \multicolumn{2}{c|}{Oct$\,$--$\,$Dec} \\
+Customers, millions    & 0.6 & 1.4 & 1.4 & 0.6 & 3.0 & 1.0 & 5.0 & 3.0 \\
+\hline
+Promotion (0 or 1)     & 0   & 1   & 0   & 1   & 0   & 1   & 0   & 1   \\
+Avg.\ sales per 1000   & 0.4 & 0.5 & 0.9 & 1.0 & 2.5 & 2.6 & 1.8 & 1.3 \\
+\hline
+\end{tabular}\hfil
+\caption{Stratification example: the effect of the promotion on average sales
+becomes reversed and amplified (from $+0.1$ to $-0.5$) if we ignore the months.}
+\label{table:stratexample}
+\end{table}
+
+To see how data stratification mitigates confounding, consider an (artificial)
+example in Table~\ref{table:stratexample}.  A highly seasonal retail item
+was marketed with and without a promotion over the final 3~months of the year.
+In each month the sale was more likely with the promotion than without it.
+But during the peak holiday season, when shoppers came in greater numbers and
+bought the item more often, the promotion was less frequently used.  As a result,
+if the 4-th quarter data is pooled together, the promotion's effect becomes
+reversed and magnified.  Stratifying by month restores the positive correlation.
+
+The script computes its statistics in parallel over all possible pairs from two
+specified sets of covariates.  The 1-st covariate is a column in input matrix~$X$
+and the 2-nd covariate is a column in input matrix~$Y$; matrices $X$ and~$Y$ may
+be the same or different.  The columns of interest are given by their index numbers
+in special matrices.  The stratum column, specified in its own matrix, is the same
+for all covariate pairs.
+
+Both covariates in each pair must be numerical, with the 2-nd covariate normally
+distributed given the 1-st covariate (see~Details).  Missing covariate values or
+strata are represented by~``NaN''.  Records with NaN's are selectively omitted
+wherever their NaN's are material to the output statistic.
+
+\smallskip
+\pagebreak[3]
+
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}stratstats.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} Xcid=}path/file
+{\tt{} Y=}path/file
+{\tt{} Ycid=}path/file
+{\tt{} S=}path/file
+{\tt{} Scid=}int
+{\tt{} O=}path/file
+{\tt{} fmt=}format
+
+}
+
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read matrix $X$ whose columns we want to use as
+the 1-st covariate (i.e.~as the feature variable)
+\item[{\tt Xcid}:] (default:\mbox{ }{\tt " "})
+Location to read the single-row matrix that lists all index numbers
+of the $X$-columns used as the 1-st covariate; the default value means
+``use all $X$-columns''
+\item[{\tt Y}:] (default:\mbox{ }{\tt " "})
+Location to read matrix $Y$ whose columns we want to use as the 2-nd
+covariate (i.e.~as the response variable); the default value means
+``use $X$ in place of~$Y$''
+\item[{\tt Ycid}:] (default:\mbox{ }{\tt " "})
+Location to read the single-row matrix that lists all index numbers
+of the $Y$-columns used as the 2-nd covariate; the default value means
+``use all $Y$-columns''
+\item[{\tt S}:] (default:\mbox{ }{\tt " "})
+Location to read matrix $S$ that has the stratum column.
+Note: the stratum column must contain small positive integers; all fractional
+values are rounded; stratum IDs of value ${\leq}\,0$ or NaN are treated as
+missing.  The default value for {\tt S} means ``use $X$ in place of~$S$''
+\item[{\tt Scid}:] (default:\mbox{ }{\tt 1})
+The index number of the stratum column in~$S$
+\item[{\tt O}:]
+Location to store the output matrix defined in Table~\ref{table:stratoutput}
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+\begin{table}[t]\small\hfil
+\begin{tabular}{|rcl|rcl|}
+\hline
+& Col.\# & Meaning & & Col.\# & Meaning \\
+\hline
+\multirow{9}{*}{\begin{sideways}1-st covariate\end{sideways}}\hspace{-1em}
+& 01     & $X$-column number                & 
+\multirow{9}{*}{\begin{sideways}2-nd covariate\end{sideways}}\hspace{-1em}
+& 11     & $Y$-column number                \\
+& 02     & presence count for $x$           & 
+& 12     & presence count for $y$           \\
+& 03     & global mean $(x)$                & 
+& 13     & global mean $(y)$                \\
+& 04     & global std.\ dev. $(x)$          & 
+& 14     & global std.\ dev. $(y)$          \\
+& 05     & stratified std.\ dev. $(x)$      & 
+& 15     & stratified std.\ dev. $(y)$      \\
+& 06     & $R^2$ for $x \sim {}$strata      & 
+& 16     & $R^2$ for $y \sim {}$strata      \\
+& 07     & adjusted $R^2$ for $x \sim {}$strata      & 
+& 17     & adjusted $R^2$ for $y \sim {}$strata      \\
+& 08     & p-value, $x \sim {}$strata       & 
+& 18     & p-value, $y \sim {}$strata       \\
+& 09--10 & reserved                         & 
+& 19--20 & reserved                         \\
+\hline
+\multirow{9}{*}{\begin{sideways}$y\sim x$, NO strata\end{sideways}}\hspace{-1.15em}
+& 21     & presence count $(x, y)$          &
+\multirow{10}{*}{\begin{sideways}$y\sim x$ AND strata$\!\!\!\!$\end{sideways}}\hspace{-1.15em}
+& 31     & presence count $(x, y, s)$       \\
+& 22     & regression slope                 &
+& 32     & regression slope                 \\
+& 23     & regres.\ slope std.\ dev.        &
+& 33     & regres.\ slope std.\ dev.        \\
+& 24     & correlation${} = \pm\sqrt{R^2}$  &
+& 34     & correlation${} = \pm\sqrt{R^2}$  \\
+& 25     & residual std.\ dev.              &
+& 35     & residual std.\ dev.              \\
+& 26     & $R^2$ in $y$ due to $x$          &
+& 36     & $R^2$ in $y$ due to $x$          \\
+& 27     & adjusted $R^2$ in $y$ due to $x$ &
+& 37     & adjusted $R^2$ in $y$ due to $x$ \\
+& 28     & p-value for ``slope = 0''        &
+& 38     & p-value for ``slope = 0''        \\
+& 29     & reserved                         &
+& 39     & \# strata with ${\geq}\,2$ count \\
+& 30     & reserved                         &
+& 40     & reserved                         \\
+\hline
+\end{tabular}\hfil
+\caption{The {\tt stratstats.dml} output matrix has one row per each distinct
+pair of 1-st and 2-nd covariates, and 40 columns with the statistics described
+here.}
+\label{table:stratoutput}
+\end{table}
+
+
+
+
+\noindent{\bf Details}
+\smallskip
+
+Suppose we have $n$ records of format $(i, x, y)$, where $i\in\{1,\ldots, k\}$ is
+a stratum number and $(x, y)$ are two numerical covariates.  We want to analyze
+conditional linear relationship between $y$ and $x$ conditioned by~$i$.
+Note that $x$, but not~$y$, may represent a categorical variable if we assign a
+numerical value to each category, for example 0 and 1 for two categories.
+
+We assume a linear regression model for~$y$:
+\begin{equation}
+y_{i,j} \,=\, \alpha_i + \beta x_{i,j} + \eps_{i,j}\,, \quad\textrm{where}\,\,\,\,
+\eps_{i,j} \sim \Normal(0, \sigma^2)
+\label{eqn:stratlinmodel}
+\end{equation}
+Here $i = 1\ldots k$ is a stratum number and $j = 1\ldots n_i$ is a record number
+in stratum~$i$; by $n_i$ we denote the number of records available in stratum~$i$.
+The noise term~$\eps_{i,j}$ is assumed to have the same variance in all strata.
+When $n_i\,{>}\,0$, we can estimate the means of $x_{i, j}$ and $y_{i, j}$ in
+stratum~$i$ as
+\begin{equation*}
+\bar{x}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,x_{i, j}\Big) / n_i\,;\quad
+\bar{y}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,y_{i, j}\Big) / n_i
+\end{equation*}
+If $\beta$ is known, the best estimate for $\alpha_i$ is $\bar{y}_i - \beta \bar{x}_i$,
+which gives the prediction error sum-of-squares of
+\begin{equation}
+\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \beta x_{i,j} - (\bar{y}_i - \beta \bar{x}_i)\big)^2
+\,\,=\,\, \beta^{2\,}V_x \,-\, 2\beta \,V_{x,y} \,+\, V_y
+\label{eqn:stratsumsq}
+\end{equation}
+where $V_x$, $V_y$, and $V_{x, y}$ are, correspondingly, the ``stratified'' sample
+estimates of variance $\Var(x)$ and $\Var(y)$ and covariance $\Cov(x,y)$ computed as
+\begin{align*}
+V_x     \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)^2; \quad
+V_y     \,=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \bar{y}_i\big)^2;\\
+V_{x,y} \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)\big(y_{i,j} - \bar{y}_i\big)
+\end{align*}
+They are stratified because we compute the sample (co-)variances in each stratum~$i$
+separately, then combine by summation.  The stratified estimates for $\Var(X)$ and $\Var(Y)$
+tend to be smaller than the non-stratified ones (with the global mean instead of $\bar{x}_i$
+and~$\bar{y}_i$) since $\bar{x}_i$ and $\bar{y}_i$ fit closer to $x_{i,j}$ and $y_{i,j}$
+than the global means.  The stratified variance estimates the uncertainty in $x_{i,j}$ 
+and~$y_{i,j}$ given their stratum~$i$.
+
+Minimizing over~$\beta$ the error sum-of-squares~(\ref{eqn:stratsumsq})
+gives us the regression slope estimate \mbox{$\hat{\beta} = V_{x,y} / V_x$},
+with~(\ref{eqn:stratsumsq}) becoming the residual sum-of-squares~(RSS):
+\begin{equation*}
+\mathrm{RSS} \,\,=\, \,
+\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - 
+\hat{\beta} x_{i,j} - (\bar{y}_i - \hat{\beta} \bar{x}_i)\big)^2
+\,\,=\,\,  V_y \,\big(1 \,-\, V_{x,y}^2 / (V_x V_y)\big)
+\end{equation*}
+The quantity $\hat{R}^2 = V_{x,y}^2 / (V_x V_y)$, called \emph{$R$-squared}, estimates the fraction
+of stratified variance in~$y_{i,j}$ explained by covariate $x_{i, j}$ in the linear 
+regression model~(\ref{eqn:stratlinmodel}).  We define \emph{stratified correlation} as the
+square root of~$\hat{R}^2$ taken with the sign of~$V_{x,y}$.  We also use RSS to estimate
+the residual standard deviation $\sigma$ in~(\ref{eqn:stratlinmodel}) that models the prediction error
+of $y_{i,j}$ given $x_{i,j}$ and the stratum:
+\begin{equation*}
+\hat{\beta}\, =\, \frac{V_{x,y}}{V_x}; \,\,\,\, \hat{R} \,=\, \frac{V_{x,y}}{\sqrt{V_x V_y}};
+\,\,\,\, \hat{R}^2 \,=\, \frac{V_{x,y}^2}{V_x V_y};
+\,\,\,\, \hat{\sigma} \,=\, \sqrt{\frac{\mathrm{RSS}}{n - k - 1}}\,\,\,\,
+\Big(n = \sum_{i=1}^k n_i\Big)
+\end{equation*}
+
+The $t$-test and the $F$-test for the null-hypothesis of ``$\beta = 0$'' are
+obtained by considering the effect of $\hat{\beta}$ on the residual sum-of-squares,
+measured by the decrease from $V_y$ to~RSS.
+The $F$-statistic is the ratio of the ``explained'' sum-of-squares
+to the residual sum-of-squares, divided by their corresponding degrees of freedom.
+There are $n\,{-}\,k$ degrees of freedom for~$V_y$, parameter $\beta$ reduces that
+to $n\,{-}\,k\,{-}\,1$ for~RSS, and their difference $V_y - {}$RSS has just 1 degree
+of freedom:
+\begin{equation*}
+F \,=\, \frac{(V_y - \mathrm{RSS})/1}{\mathrm{RSS}/(n\,{-}\,k\,{-}\,1)}
+\,=\, \frac{\hat{R}^2\,(n\,{-}\,k\,{-}\,1)}{1-\hat{R}^2}; \quad
+t \,=\, \hat{R}\, \sqrt{\frac{n\,{-}\,k\,{-}\,1}{1-\hat{R}^2}}.
+\end{equation*}
+The $t$-statistic is simply the square root of the $F$-statistic with the appropriate
+choice of sign.  If the null hypothesis and the linear model are both true, the $t$-statistic
+has Student $t$-distribution with $n\,{-}\,k\,{-}\,1$ degrees of freedom.  We can
+also compute it if we divide $\hat{\beta}$ by its estimated standard deviation:
+\begin{equation*}
+\stdev(\hat{\beta})_{\mathrm{est}} \,=\, \hat{\sigma}\,/\sqrt{V_x} \quad\Longrightarrow\quad
+t \,=\, \hat{R}\sqrt{V_y} \,/\, \hat{\sigma} \,=\, \beta \,/\, \stdev(\hat{\beta})_{\mathrm{est}}
+\end{equation*}
+The standard deviation estimate for~$\beta$ is included in {\tt stratstats.dml} output.
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+The output matrix format is defined in Table~\ref{table:stratoutput}.
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f stratstats.dml -nvargs X=/user/biadmin/X.mtx Xcid=/user/biadmin/Xcid.mtx
+  Y=/user/biadmin/Y.mtx Ycid=/user/biadmin/Ycid.mtx S=/user/biadmin/S.mtx Scid=2
+  O=/user/biadmin/Out.mtx fmt=csv
+
+}
+{\hangindent=\parindent\noindent\tt
+\hml -f stratstats.dml -nvargs X=/user/biadmin/Data.mtx Xcid=/user/biadmin/Xcid.mtx
+  Ycid=/user/biadmin/Ycid.mtx Scid=7 O=/user/biadmin/Out.mtx
+
+}
+
+%\smallskip
+%\noindent{\bf See Also}
+%\smallskip
+%
+%For non-stratified bivariate statistics with a wider variety of input data types
+%and statistical tests, see \ldots.  For general linear regression, see
+%{\tt LinearRegDS.dml} and {\tt LinearRegCG.dml}.  For logistic regression, appropriate
+%when the response variable is categorical, see {\tt MultiLogReg.dml}.
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DescriptiveUnivarStats.tex
----------------------------------------------------------------------
diff --git a/alg-ref/DescriptiveUnivarStats.tex b/alg-ref/DescriptiveUnivarStats.tex
new file mode 100644
index 0000000..5838e3e
--- /dev/null
+++ b/alg-ref/DescriptiveUnivarStats.tex
@@ -0,0 +1,603 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Univariate Statistics}
+
+\noindent{\bf Description}
+\smallskip
+
+\emph{Univariate statistics} are the simplest form of descriptive statistics in data
+analysis.  They are used to quantitatively describe the main characteristics of each
+feature in the data.  For a given dataset matrix, script \UnivarScriptName{} computes
+certain univariate statistics for each feature column in the
+matrix.  The feature type governs the exact set of statistics computed for that feature.
+For example, the statistic \emph{mean} can only be computed on a quantitative (scale)
+feature like `Height' and `Temperature'.  It does not make sense to compute the mean
+of a categorical attribute like `Hair Color'.
+
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%\tolerance=0
+{\tt{}-f } \UnivarScriptName{}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} TYPES=}path/file
+{\tt{} STATS=}path/file
+% {\tt{} fmt=}format
+
+}
+
+
+\medskip
+\pagebreak[2]
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the data matrix $X$ whose columns we want to
+analyze as the features.
+\item[{\tt TYPES}:] % (default:\mbox{ }{\tt " "})
+Location (on HDFS) to read the single-row matrix whose $i^{\textrm{th}}$
+column-cell contains the type of the $i^{\textrm{th}}$ feature column
+\texttt{X[,$\,i$]} in the data matrix.  Feature types must be encoded by
+integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
+% The default value means ``treat all $X$-columns as scale.''
+\item[{\tt STATS}:]
+Location (on HDFS) where the output matrix of computed statistics
+will be stored.  The format of the output matrix is defined by
+Table~\ref{table:univars}.
+% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+% see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+\begin{table}[t]\hfil
+\begin{tabular}{|rl|c|c|}
+\hline
+\multirow{2}{*}{Row}& \multirow{2}{*}{Name of Statistic} & \multicolumn{2}{c|}{Applies to:} \\
+                            &                            & Scale & Categ.\\
+\hline
+\OutputRowIDMinimum         & Minimum                    &   +   &       \\
+\OutputRowIDMaximum         & Maximum                    &   +   &       \\
+\OutputRowIDRange           & Range                      &   +   &       \\
+\OutputRowIDMean            & Mean                       &   +   &       \\
+\OutputRowIDVariance        & Variance                   &   +   &       \\
+\OutputRowIDStDeviation     & Standard deviation         &   +   &       \\
+\OutputRowIDStErrorMean     & Standard error of mean     &   +   &       \\
+\OutputRowIDCoeffVar        & Coefficient of variation   &   +   &       \\
+\OutputRowIDSkewness        & Skewness                   &   +   &       \\
+\OutputRowIDKurtosis        & Kurtosis                   &   +   &       \\
+\OutputRowIDStErrorSkewness & Standard error of skewness &   +   &       \\
+\OutputRowIDStErrorCurtosis & Standard error of kurtosis &   +   &       \\
+\OutputRowIDMedian          & Median                     &   +   &       \\
+\OutputRowIDIQMean          & Inter quartile mean        &   +   &       \\
+\OutputRowIDNumCategories   & Number of categories       &       &   +   \\
+\OutputRowIDMode            & Mode                       &       &   +   \\
+\OutputRowIDNumModes        & Number of modes            &       &   +   \\
+\hline
+\end{tabular}\hfil
+\caption{The output matrix of \UnivarScriptName{} has one row per each
+univariate statistic and one column per input feature.  This table lists
+the meaning of each row.  Signs ``+'' show applicability to scale or/and
+to categorical features.}
+\label{table:univars}
+\end{table}
+
+
+\pagebreak[1]
+
+\smallskip
+\noindent{\bf Details}
+\smallskip
+
+Given an input matrix \texttt{X}, this script computes the set of all
+relevant univariate statistics for each feature column \texttt{X[,$\,i$]}
+in~\texttt{X}.  The list of statistics to be computed depends on the
+\emph{type}, or \emph{measurement level}, of each column.
+The \textrm{TYPES} command-line argument points to a vector containing
+the types of all columns.  The types must be provided as per the following
+convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
+
+Below we list all univariate statistics computed by script \UnivarScriptName.
+The statistics are collected by relevance into several groups, namely: central
+tendency, dispersion, shape, and categorical measures.  The first three groups
+contain statistics computed for a quantitative (also known as: numerical, scale,
+or continuous) feature; the last group contains the statistics for a categorical
+(either nominal or ordinal) feature.  
+
+Let~$n$ be the number of data records (rows) with feature values.
+In what follows we fix a column index \texttt{idx} and consider
+sample statistics of feature column \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}.
+Let $v = (v_1, v_2, \ldots, v_n)$ be the values of \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}
+in their original unsorted order: $v_i = \texttt{X[}i\texttt{,}\,\texttt{idx]}$.
+Let $v^s = (v^s_1, v^s_2, \ldots, v^s_n)$ be the same values in the sorted order,
+preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
+
+\paragraph{Central tendency measures.}
+Sample statistics that describe the location of the quantitative (scale) feature distribution,
+represent it with a single value.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Mean]
+\OutputRowText{\OutputRowIDMean}
+The arithmetic average over a sample of a quantitative feature.
+Computed as the ratio between the sum of values and the number of values:
+$\left(\sum_{i=1}^n v_i\right)\!/n$.
+Example: the mean of sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+equals~5.2.
+
+Note that the mean is significantly affected by extreme values in the sample
+and may be misleading as a central tendency measure if the feature varies on
+exponential scale.  For example, the mean of $\{$0.01, 0.1, 1.0, 10.0, 100.0$\}$
+is 22.222, greater than all the sample values except the~largest.
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\begin{figure}[t]
+\setlength{\unitlength}{10pt}
+\begin{picture}(33,12)
+\put( 6.2, 0.0){\small 2.2}
+\put(10.2, 0.0){\small 3.2}
+\put(12.2, 0.0){\small 3.7}
+\put(15.0, 0.0){\small 4.4}
+\put(18.6, 0.0){\small 5.3}
+\put(20.2, 0.0){\small 5.7}
+\put(21.75,0.0){\small 6.1}
+\put(23.05,0.0){\small 6.4}
+\put(26.2, 0.0){\small 7.2}
+\put(28.6, 0.0){\small 7.8}
+\put( 0.5, 0.7){\small 0.0}
+\put( 0.1, 3.2){\small 0.25}
+\put( 0.5, 5.7){\small 0.5}
+\put( 0.1, 8.2){\small 0.75}
+\put( 0.5,10.7){\small 1.0}
+\linethickness{1.5pt}
+\put( 2.0, 1.0){\line(1,0){4.8}}
+\put( 6.8, 1.0){\line(0,1){1.0}}
+\put( 6.8, 2.0){\line(1,0){4.0}}
+\put(10.8, 2.0){\line(0,1){1.0}}
+\put(10.8, 3.0){\line(1,0){2.0}}
+\put(12.8, 3.0){\line(0,1){1.0}}
+\put(12.8, 4.0){\line(1,0){2.8}}
+\put(15.6, 4.0){\line(0,1){1.0}}
+\put(15.6, 5.0){\line(1,0){3.6}}
+\put(19.2, 5.0){\line(0,1){1.0}}
+\put(19.2, 6.0){\line(1,0){1.6}}
+\put(20.8, 6.0){\line(0,1){1.0}}
+\put(20.8, 7.0){\line(1,0){1.6}}
+\put(22.4, 7.0){\line(0,1){1.0}}
+\put(22.4, 8.0){\line(1,0){1.2}}
+\put(23.6, 8.0){\line(0,1){1.0}}
+\put(23.6, 9.0){\line(1,0){3.2}}
+\put(26.8, 9.0){\line(0,1){1.0}}
+\put(26.8,10.0){\line(1,0){2.4}}
+\put(29.2,10.0){\line(0,1){1.0}}
+\put(29.2,11.0){\line(1,0){4.8}}
+\linethickness{0.3pt}
+\put( 6.8, 1.0){\circle*{0.3}}
+\put(10.8, 1.0){\circle*{0.3}}
+\put(12.8, 1.0){\circle*{0.3}}
+\put(15.6, 1.0){\circle*{0.3}}
+\put(19.2, 1.0){\circle*{0.3}}
+\put(20.8, 1.0){\circle*{0.3}}
+\put(22.4, 1.0){\circle*{0.3}}
+\put(23.6, 1.0){\circle*{0.3}}
+\put(26.8, 1.0){\circle*{0.3}}
+\put(29.2, 1.0){\circle*{0.3}}
+\put( 6.8, 1.0){\vector(1,0){27.2}}
+\put( 2.0, 1.0){\vector(0,1){10.8}}
+\put( 2.0, 3.5){\line(1,0){10.8}}
+\put( 2.0, 6.0){\line(1,0){17.2}}
+\put( 2.0, 8.5){\line(1,0){21.6}}
+\put( 2.0,11.0){\line(1,0){27.2}}
+\put(12.8, 1.0){\line(0,1){2.0}}
+\put(19.2, 1.0){\line(0,1){5.0}}
+\put(20.0, 1.0){\line(0,1){5.0}}
+\put(23.6, 1.0){\line(0,1){7.0}}
+\put( 9.0, 4.0){\line(1,0){3.8}}
+\put( 9.2, 2.7){\vector(0,1){0.8}}
+\put( 9.2, 4.8){\vector(0,-1){0.8}}
+\put(19.4, 8.0){\line(1,0){3.0}}
+\put(19.6, 7.2){\vector(0,1){0.8}}
+\put(19.6, 9.3){\vector(0,-1){0.8}}
+\put(13.0, 2.2){\small $q_{25\%}$}
+\put(17.3, 2.2){\small $q_{50\%}$}
+\put(23.8, 2.2){\small $q_{75\%}$}
+\put(20.15,3.5){\small $\mu$}
+\put( 8.0, 3.75){\small $\phi_1$}
+\put(18.35,7.8){\small $\phi_2$}
+\end{picture}
+\label{fig:example_quartiles}
+\caption{The computation of quartiles, median, and interquartile mean from the
+empirical distribution function of the 10-point
+sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$.  Each vertical step in
+the graph has height~$1{/}n = 0.1$.  Values $q_{25\%}$, $q_{50\%}$, and $q_{75\%}$ denote
+the $1^{\textrm{st}}$, $2^{\textrm{nd}}$, and $3^{\textrm{rd}}$ quartiles correspondingly;
+value~$\mu$ denotes the median.  Values $\phi_1$ and $\phi_2$ show the partial contribution
+of border points (quartiles) $v_3=3.7$ and $v_8=6.4$ into the interquartile mean.}
+\end{figure}
+
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Median]
+\OutputRowText{\OutputRowIDMedian}
+The ``middle'' value that separates the higher half of the sample values
+(in a sorted order) from the lower half.
+To compute the median, we sort the sample in the increasing order, preserving
+duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
+If $n$ is odd, the median equals $v^s_i$ where $i = (n\,{+}\,1)\,{/}\,2$,
+same as the $50^{\textrm{th}}$~percentile of the sample.
+If $n$ is even, there are two ``middle'' values $v^s_{n/2}$ and $v^s_{n/2\,+\,1}$,
+so we compute the median as the mean of these two values.
+(For even~$n$ we compute the $50^{\textrm{th}}$~percentile as~$v^s_{n/2}$,
+not as the median.)  Example: the median of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+equals $(5.3\,{+}\,5.7)\,{/}\,2$~${=}$~5.5, see Figure~\ref{fig:example_quartiles}.
+
+Unlike the mean, the median is not sensitive to extreme values in the sample,
+i.e.\ it is robust to outliers.  It works better as a measure of central tendency
+for heavy-tailed distributions and features that vary on exponential scale.
+However, the median is sensitive to small sample size.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Interquartile mean]
+\OutputRowText{\OutputRowIDIQMean}
+For a sample of a quantitative feature, this is
+the mean of the values greater than or equal to the $1^{\textrm{st}}$ quartile
+and less than or equal the $3^{\textrm{rd}}$ quartile.
+In other words, it is a ``truncated mean'' where the lowest 25$\%$ and
+the highest 25$\%$ of the sorted values are omitted in its computation.
+The two ``border values'', i.e.\ the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$
+quartiles themselves, contribute to this mean only partially.
+This measure is occasionally used as the ``robust'' version of the mean
+that is less sensitive to the extreme values.
+
+To compute the measure, we sort the sample in the increasing order,
+preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
+We set $j = \lceil n{/}4 \rceil$ for the $1^{\textrm{st}}$ quartile index
+and $k = \lceil 3n{/}4 \rceil$ for the $3^{\textrm{rd}}$ quartile index,
+then compute the following weighted mean:
+\begin{equation*}
+\frac{1}{3{/}4 - 1{/}4} \left[
+\left(\frac{j}{n} - \frac{1}{4}\right) v^s_j \,\,+ 
+\sum_{j<i<k} \left(\frac{i}{n} - \frac{i\,{-}\,1}{n}\right) v^s_i 
+\,\,+\,\, \left(\frac{3}{4} - \frac{k\,{-}\,1}{n}\right) v^s_k\right]
+\end{equation*}
+In other words, all sample values between the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$
+quartile enter the sum with weights $2{/}n$, times their number of duplicates, while the
+two quartiles themselves enter the sum with reduced weights.  The weights are proportional
+to the vertical steps in the empirical distribution function of the sample, see
+Figure~\ref{fig:example_quartiles} for an illustration.
+Example: the interquartile mean of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals the sum
+$0.1 (3.7\,{+}\,6.4) + 0.2 (4.4\,{+}\,5.3\,{+}\,5.7\,{+}\,6.1)$,
+which equals~5.31.
+\end{Description}
+
+
+\paragraph{Dispersion measures.}
+Statistics that describe the amount of variation or spread in a quantitative
+(scale) data feature.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Variance]
+\OutputRowText{\OutputRowIDVariance}
+A measure of dispersion, or spread-out, of sample values around their mean,
+expressed in units that are the square of those of the feature itself.
+Computed as the sum of squared differences between the values
+in the sample and their mean, divided by one less than the number of
+values: $\sum_{i=1}^n (v_i - \bar{v})^2\,/\,(n\,{-}\,1)$ where 
+$\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
+Example: the variance of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~3.24.
+Note that at least two values ($n\geq 2$) are required to avoid division
+by zero.  Sample variance is sensitive to outliers, even more than the mean.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Standard deviation]
+\OutputRowText{\OutputRowIDStDeviation}
+A measure of dispersion around the mean, the square root of variance.
+Computed by taking the square root of the sample variance;
+see \emph{Variance} above on computing the variance.
+Example: the standard deviation of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~1.8.
+At least two values are required to avoid division by zero.
+Note that standard deviation is sensitive to outliers.  
+
+Standard deviation is used in conjunction with the mean to determine
+an interval containing a given percentage of the feature values,
+assuming the normal distribution.  In a large sample from a normal
+distribution, around 68\% of the cases fall within one standard
+deviation and around 95\% of cases fall within two standard deviations
+of the mean.  For example, if the mean age is 45 with a standard deviation
+of 10, around 95\% of the cases would be between 25 and 65 in a normal
+distribution.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Coefficient of variation]
+\OutputRowText{\OutputRowIDCoeffVar}
+The ratio of the standard deviation to the mean, i.e.\ the
+\emph{relative} standard deviation, of a quantitative feature sample.
+Computed by dividing the sample \emph{standard deviation} by the
+sample \emph{mean}, see above for their computation details.
+Example: the coefficient of variation for sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+equals 1.8$\,{/}\,$5.2~${\approx}$~0.346.
+
+This metric is used primarily with non-negative features such as
+financial or population data.  It is sensitive to outliers.
+Note: zero mean causes division by zero, returning infinity or \texttt{NaN}.
+At least two values (records) are required to compute the standard deviation.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Minimum]
+\OutputRowText{\OutputRowIDMinimum}
+The smallest value of a quantitative sample, computed as $\min v = v^s_1$.
+Example: the minimum of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+equals~2.2.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Maximum]
+\OutputRowText{\OutputRowIDMaximum}
+The largest value of a quantitative sample, computed as $\max v = v^s_n$.
+Example: the maximum of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+equals~7.8.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Range]
+\OutputRowText{\OutputRowIDRange}
+The difference between the largest and the smallest value of a quantitative
+sample, computed as $\max v - \min v = v^s_n - v^s_1$.
+It provides information about the overall spread of the sample values.
+Example: the range of sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+equals 7.8$\,{-}\,$2.2~${=}$~5.6.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Standard error of the mean]
+\OutputRowText{\OutputRowIDStErrorMean}
+A measure of how much the value of the sample mean may vary from sample
+to sample taken from the same (hypothesized) distribution of the feature.
+It helps to roughly bound the distribution mean, i.e.\
+the limit of the sample mean as the sample size tends to infinity.
+Under certain assumptions (e.g.\ normality and large sample), the difference
+between the distribution mean and the sample mean is unlikely to exceed
+2~standard errors.
+
+The measure is computed by dividing the sample standard deviation
+by the square root of the number of values~$n$; see \emph{standard deviation}
+for its computation details.  Ensure $n\,{\geq}\,2$ to avoid division by~0.
+Example: for sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+with the mean of~5.2 the standard error of the mean
+equals 1.8$\,{/}\sqrt{10}$~${\approx}$~0.569.
+
+Note that the standard error itself is subject to sample randomness.
+Its accuracy as an error estimator may be low if the sample size is small
+or \mbox{non-i.i.d.}, if there are outliers, or if the distribution has
+heavy tails.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+% \item[\it Quartiles]
+% \OutputRowText{\OutputRowIDQuartiles}
+% %%% dsDefn %%%%
+% The values of a quantitative feature
+% that divide an ordered/sorted set of data records into four equal-size groups.
+% The $1^{\textrm{st}}$ quartile, or the $25^{\textrm{th}}$ percentile, splits
+% the sorted data into the lowest $25\%$ and the highest~$75\%$.  In other words,
+% it is the middle value between the minimum and the median.  The $2^{\textrm{nd}}$
+% quartile is the median itself, the value that separates the higher half of
+% the data (in the sorted order) from the lower half.  Finally, the $3^{\textrm{rd}}$
+% quartile, or the $75^{\textrm{th}}$ percentile, divides the sorted data into
+% lowest $75\%$ and highest~$25\%$.\par
+% %%% dsComp %%%%
+% To compute the quartiles for a data column \texttt{X[,i]} with $n$ numerical values
+% we sort it in the increasing order, preserving duplicates, then return 
+% \texttt{X}${}^{\textrm{sort}}$\texttt{[}$k$\texttt{,i]}
+% where $k = \lceil pn \rceil$ for $p = 0.25$, $0.5$, and~$0.75$.
+% When $n$ is even, the $2^{\textrm{nd}}$ quartile (the median) is further adjusted
+% to equal the mean of two middle values
+% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\texttt{,i]}$ and
+% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\,{+}\,1\texttt{,i]}$.
+% %%% dsWarn %%%%
+% We assume that the feature column does not contain \texttt{NaN}s or coded non-numeric values.
+% %%% dsExmpl %%%
+% \textbf{Example(s).}
+\end{Description}
+
+
+\paragraph{Shape measures.}
+Statistics that describe the shape and symmetry of the quantitative (scale)
+feature distribution estimated from a sample of its values.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Skewness]
+\OutputRowText{\OutputRowIDSkewness}
+It measures how symmetrically the values of a feature are spread out
+around the mean.  A significant positive skewness implies a longer (or fatter)
+right tail, i.e. feature values tend to lie farther away from the mean on the
+right side.  A significant negative skewness implies a longer (or fatter) left
+tail.  The normal distribution is symmetric and has a skewness value of~0;
+however, its sample skewness is likely to be nonzero, just close to zero.
+As a guideline, a skewness value more than twice its standard error is taken
+to indicate a departure from symmetry.
+
+Skewness is computed as the $3^{\textrm{rd}}$~central moment divided by the cube
+of the standard deviation.  We estimate the $3^{\textrm{rd}}$~central moment as
+the sum of cubed differences between the values in the feature column and their
+sample mean, divided by the number of values:  
+$\sum_{i=1}^n (v_i - \bar{v})^3 / n$
+where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
+The standard deviation is computed
+as described above in \emph{standard deviation}.  To avoid division by~0,
+at least two different sample values are required.  Example: for sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+with the mean of~5.2 and the standard deviation of~1.8
+skewness is estimated as $-1.0728\,{/}\,1.8^3 \approx -0.184$.
+Note: skewness is sensitive to outliers.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Standard error in skewness]
+\OutputRowText{\OutputRowIDStErrorSkewness}
+A measure of how much the sample skewness may vary from sample to sample,
+assuming that the feature is normally distributed, which makes its
+distribution skewness equal~0.  
+Given the number~$n$ of sample values, the standard error is computed as
+\begin{equation*}
+\sqrt{\frac{6n\,(n-1)}{(n-2)(n+1)(n+3)}}
+\end{equation*}
+This measure can tell us, for example:
+\begin{Itemize}
+\item If the sample skewness lands within two standard errors from~0, its
+positive or negative sign is non-significant, may just be accidental.
+\item If the sample skewness lands outside this interval, the feature
+is unlikely to be normally distributed.
+\end{Itemize}
+At least 3~values ($n\geq 3$) are required to avoid arithmetic failure.
+Note that the standard error is inaccurate if the feature distribution is
+far from normal or if the number of samples is small.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Kurtosis]
+\OutputRowText{\OutputRowIDKurtosis}
+As a distribution parameter, kurtosis is a measure of the extent to which
+feature values cluster around a central point.  In other words, it quantifies
+``peakedness'' of the distribution: how tall and sharp the central peak is
+relative to a standard bell curve.
+
+Positive kurtosis (\emph{leptokurtic} distribution) indicates that, relative
+to a normal distribution:
+\begin{Itemize}
+\item observations cluster more about the center (peak-shaped),
+\item the tails are thinner at non-extreme values, 
+\item the tails are thicker at extreme values.
+\end{Itemize}
+Negative kurtosis (\emph{platykurtic} distribution) indicates that, relative
+to a normal distribution:
+\begin{Itemize}
+\item observations cluster less about the center (box-shaped),
+\item the tails are thicker at non-extreme values, 
+\item the tails are thinner at extreme values.
+\end{Itemize}
+Kurtosis of a normal distribution is zero; however, the sample kurtosis
+(computed here) is likely to deviate from zero.
+
+Sample kurtosis is computed as the $4^{\textrm{th}}$~central moment divided
+by the $4^{\textrm{th}}$~power of the standard deviation, minus~3.
+We estimate the $4^{\textrm{th}}$~central moment as the sum of the
+$4^{\textrm{th}}$~powers of differences between the values in the feature column
+and their sample mean, divided by the number of values:
+$\sum_{i=1}^n (v_i - \bar{v})^4 / n$
+where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
+The standard deviation is computed as described above, see \emph{standard deviation}.
+
+Note that kurtosis is sensitive to outliers, and requires at least two different
+sample values.  Example: for sample
+$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
+with the mean of~5.2 and the standard deviation of~1.8,
+sample kurtosis equals $16.6962\,{/}\,1.8^4 - 3 \approx -1.41$.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Standard error in kurtosis]
+\OutputRowText{\OutputRowIDStErrorCurtosis}
+A measure of how much the sample kurtosis may vary from sample to sample,
+assuming that the feature is normally distributed, which makes its
+distribution kurtosis equal~0.
+Given the number~$n$ of sample values, the standard error is computed as
+\begin{equation*}
+\sqrt{\frac{24n\,(n-1)^2}{(n-3)(n-2)(n+3)(n+5)}}
+\end{equation*}
+This measure can tell us, for example:
+\begin{Itemize}
+\item If the sample kurtosis lands within two standard errors from~0, its
+positive or negative sign is non-significant, may just be accidental.
+\item If the sample kurtosis lands outside this interval, the feature
+is unlikely to be normally distributed.
+\end{Itemize}
+At least 4~values ($n\geq 4$) are required to avoid arithmetic failure.
+Note that the standard error is inaccurate if the feature distribution is
+far from normal or if the number of samples is small.
+\end{Description}
+
+
+\paragraph{Categorical measures.}  Statistics that describe the sample of
+a categorical feature, either nominal or ordinal.  We represent all
+categories by integers from~1 to the number of categories; we call
+these integers \emph{category~IDs}.
+\begin{Description}
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Number of categories]
+\OutputRowText{\OutputRowIDNumCategories}
+The maximum category~ID that occurs in the sample.  Note that some
+categories with~IDs \emph{smaller} than this maximum~ID may have
+no~occurrences in the sample, without reducing the number of categories.
+However, any categories with~IDs \emph{larger} than the maximum~ID with
+no occurrences in the sample will not be counted.
+Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
+the number of categories is reported as~8.  Category~IDs 2 and~6, which have
+zero occurrences, are still counted; but if there is a category with
+ID${}=9$ and zero occurrences, it is not counted.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Mode]
+\OutputRowText{\OutputRowIDMode}
+The most frequently occurring category value.
+If several values share the greatest frequency of occurrence, then each
+of them is a mode; but here we report only the smallest of these modes.
+Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
+the modes are 3 and~7, with 3 reported.
+
+Computed by counting the number of occurrences for each category,
+then taking the smallest category~ID that has the maximum count.
+Note that the sample modes may be different from the distribution modes,
+i.e.\ the categories whose (hypothesized) underlying probability is the
+maximum over all categories.
+%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
+\item[\it Number of modes]
+\OutputRowText{\OutputRowIDNumModes}
+The number of category values that each have the largest frequency
+count in the sample.  
+Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
+there are two category IDs (3 and~7) that occur the maximum count of 4~times;
+hence, we return~2.
+
+Computed by counting the number of occurrences for each category,
+then counting how many categories have the maximum count.
+Note that the sample modes may be different from the distribution modes,
+i.e.\ the categories whose (hypothesized) underlying probability is the
+maximum over all categories.
+\end{Description}
+
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+The output matrix containing all computed statistics is of size $17$~rows and
+as many columns as in the input matrix~\texttt{X}.  Each row corresponds to
+a particular statistic, according to the convention specified in
+Table~\ref{table:univars}.  The first $14$~statistics are applicable for
+\emph{scale} columns, and the last $3$~statistics are applicable for categorical,
+i.e.\ nominal and ordinal, columns.
+
+
+\pagebreak[2]
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f \UnivarScriptName{} -nvargs X=/user/biadmin/X.mtx
+  TYPES=/user/biadmin/types.mtx
+  STATS=/user/biadmin/stats.mtx
+
+}


[48/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1431] Throw controlled error when one-dimensional numpy array is passed to SystemML

Posted by de...@apache.org.
[SYSTEMML-1431] Throw controlled error when one-dimensional numpy array is passed to SystemML

Here is an example pyspark session demonstrating this PR:
>>> from mlxtend.data import mnist_data
>>> import numpy as np
>>> from sklearn.utils import shuffle
X, y = mnist_data()
from systemml import MLContext, dml
ml = MLContext(sc)
script = dml('print(sum(X))').input(X=X)
ml.execute(script)
script = dml('print(sum(X))').input(X=y)
ml.execute(script)
script = dml('print(sum(X))').input(X=y.reshape(-1, 1))
ml.execute(script)>>> X, y = mnist_data()
>>> from systemml import MLContext, dml
>>> ml = MLContext(sc)

Welcome to Apache SystemML!

>>> script = dml('print(sum(X))').input(X=X)
>>> ml.execute(script)
1.31267102E8
MLResults
>>> script = dml('print(sum(X))').input(X=y)
>>> ml.execute(script)
...
TypeError: Expected 2-dimensional ndarray, instead passed 1-dimensional
ndarray. Hint: If you intend to pass the 1-dimensional ndarray as a
column-vector, please reshape it: input_ndarray.reshape(-1, 1)
>>> script = dml('print(sum(X))').input(X=y.reshape(-1, 1))
>>> ml.execute(script)
22500.0

Closes #438.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/a1d73f80
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/a1d73f80
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/a1d73f80

Branch: refs/heads/gh-pages
Commit: a1d73f805bc6a94e953c0b999269b79fcbb07a16
Parents: 7407b70
Author: Niketan Pansare <np...@us.ibm.com>
Authored: Thu Mar 23 11:41:16 2017 -0700
Committer: Niketan Pansare <np...@us.ibm.com>
Committed: Thu Mar 23 11:44:33 2017 -0700

----------------------------------------------------------------------
 beginners-guide-python.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a1d73f80/beginners-guide-python.md
----------------------------------------------------------------------
diff --git a/beginners-guide-python.md b/beginners-guide-python.md
index ffab09e..24f7151 100644
--- a/beginners-guide-python.md
+++ b/beginners-guide-python.md
@@ -183,7 +183,7 @@ y_train = diabetes.target[:-20]
 y_test = diabetes.target[-20:]
 # Train Linear Regression model
 X = sml.matrix(X_train)
-y = sml.matrix(y_train)
+y = sml.matrix(np.matrix(y_train).T)
 A = X.transpose().dot(X)
 b = X.transpose().dot(y)
 beta = sml.solve(A, b).toNumPy()


[23/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1209] Add configurable API Docs menu to docs header

Posted by de...@apache.org.
[SYSTEMML-1209] Add configurable API Docs menu to docs header

Add configurable API Docs menu to docs header to link to generated
API docs from versioned documentation.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/7c17feb5
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/7c17feb5
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/7c17feb5

Branch: refs/heads/gh-pages
Commit: 7c17feb53adc2f96397c09b3f4185c94a38d5ed8
Parents: bfb93b0
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Tue Feb 7 12:19:45 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Tue Feb 7 12:19:45 2017 -0800

----------------------------------------------------------------------
 _config.yml          |  3 +++
 _layouts/global.html | 10 +++++++++-
 2 files changed, 12 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/7c17feb5/_config.yml
----------------------------------------------------------------------
diff --git a/_config.yml b/_config.yml
index 2f8c3e7..1d213d7 100644
--- a/_config.yml
+++ b/_config.yml
@@ -20,3 +20,6 @@ analytics_google_universal_tracking_id : UA-71553733-1
 
 # if FEEDBACK_LINKS is true, render feedback links
 FEEDBACK_LINKS: true
+
+# if API_DOCS_MENU is true, render API docs menu
+API_DOCS_MENU: false

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/7c17feb5/_layouts/global.html
----------------------------------------------------------------------
diff --git a/_layouts/global.html b/_layouts/global.html
index 6c87e0c..8e03017 100644
--- a/_layouts/global.html
+++ b/_layouts/global.html
@@ -73,8 +73,16 @@
                                 <li><a href="release-process.html">Release Process</a></li>
                             </ul>
                         </li>
+                        {% if site.API_DOCS_MENU == true %}
                         <li class="dropdown">
-                            <a href="#" class="dropdown-toggle" data-toggle="dropdown">Issue Tracking<b class="caret"></b></a>
+                            <a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
+                            <ul class="dropdown-menu" role="menu">
+                                <li><a href="./api/java/index.html">Javadoc</a></li>
+                            </ul>
+                        </li>
+                        {% endif %}
+                        <li class="dropdown">
+                            <a href="#" class="dropdown-toggle" data-toggle="dropdown">Issues<b class="caret"></b></a>
                             <ul class="dropdown-menu" role="menu">
                                 <li><b>JIRA:</b></li>
                                 <li><a href="https://issues.apache.org/jira/browse/SYSTEMML">SystemML JIRA</a></li>


[03/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1144] Fix PCA documentation for principal

Posted by de...@apache.org.
[SYSTEMML-1144] Fix PCA documentation for principal

Update 'principle' to 'principal'.

Closes #311.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/8b917582
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/8b917582
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/8b917582

Branch: refs/heads/gh-pages
Commit: 8b917582dfdae9dc001115ea3376e94d7f49e2d2
Parents: fa88464
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Thu Dec 8 13:24:29 2016 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Thu Dec 8 13:24:29 2016 -0800

----------------------------------------------------------------------
 Algorithms Reference/PCA.tex       | 16 ++++++++--------
 algorithms-matrix-factorization.md | 28 ++++++++++++++--------------
 algorithms-reference.md            |  2 +-
 3 files changed, 23 insertions(+), 23 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/8b917582/Algorithms Reference/PCA.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/PCA.tex b/Algorithms Reference/PCA.tex
index 5895502..cef750e 100644
--- a/Algorithms Reference/PCA.tex	
+++ b/Algorithms Reference/PCA.tex	
@@ -19,12 +19,12 @@
 
 \end{comment}
 
-\subsection{Principle Component Analysis}
+\subsection{Principal Component Analysis}
 \label{pca}
 
 \noindent{\bf Description}
 
-Principle Component Analysis (PCA) is a simple, non-parametric method to transform the given data set with possibly correlated columns into a set of linearly uncorrelated or orthogonal columns, called {\em principle components}. The principle components are ordered in such a way that the first component accounts for the largest possible variance, followed by remaining principle components in the decreasing order of the amount of variance captured from the data. PCA is often used as a dimensionality reduction technique, where the original data is projected or rotated onto a low-dimensional space with basis vectors defined by top-$K$ (for a given value of $K$) principle components.
+Principal Component Analysis (PCA) is a simple, non-parametric method to transform the given data set with possibly correlated columns into a set of linearly uncorrelated or orthogonal columns, called {\em principal components}. The principal components are ordered in such a way that the first component accounts for the largest possible variance, followed by remaining principal components in the decreasing order of the amount of variance captured from the data. PCA is often used as a dimensionality reduction technique, where the original data is projected or rotated onto a low-dimensional space with basis vectors defined by top-$K$ (for a given value of $K$) principal components.
 \\
 
 \noindent{\bf Usage}
@@ -45,10 +45,10 @@ Principle Component Analysis (PCA) is a simple, non-parametric method to transfo
 
 \begin{itemize}
 \item INPUT: Location (on HDFS) to read the input matrix.
-\item K: Indicates dimension of the new vector space constructed from $K$ principle components. It must be a value between $1$ and the number of columns in the input data.
-\item CENTER (default: {\tt 0}): Indicates whether or not to {\em center} input data prior to the computation of principle components.
-\item SCALE (default: {\tt 0}): Indicates whether or not to {\em scale} input data prior to the computation of principle components.
-\item PROJDATA: Indicates whether or not the input data must be projected on to new vector space defined over principle components.
+\item K: Indicates dimension of the new vector space constructed from $K$ principal components. It must be a value between $1$ and the number of columns in the input data.
+\item CENTER (default: {\tt 0}): Indicates whether or not to {\em center} input data prior to the computation of principal components.
+\item SCALE (default: {\tt 0}): Indicates whether or not to {\em scale} input data prior to the computation of principal components.
+\item PROJDATA: Indicates whether or not the input data must be projected on to new vector space defined over principal components.
 \item OFMT (default: {\tt csv}): Specifies the output format. Choice of comma-separated values (csv) or as a sparse-matrix (text).
 \item MODEL: Either the location (on HDFS) where the computed model is stored; or the location of an existing model.
 \item OUTPUT: Location (on HDFS) to store the data rotated on to the new vector space.
@@ -56,7 +56,7 @@ Principle Component Analysis (PCA) is a simple, non-parametric method to transfo
 
 \noindent{\bf Details}
 
-Principle Component Analysis (PCA) is a non-parametric procedure for orthogonal linear transformation of the input data to a new coordinate system, such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In other words, PCA first selects a normalized direction in $m$-dimensional space ($m$ is the number of columns in the input data) along which the variance in input data is maximized -- this is referred to as the first principle component. It then repeatedly finds other directions (principle components) in which the variance is maximized. At every step, PCA restricts the search for only those directions that are perpendicular to all previously selected directions. By doing so, PCA aims to reduce the redundancy among input variables. To understand the notion of redundancy, consider an extreme scenario with a data set comprising of two v
 ariables, where the first one denotes some quantity expressed in meters, and the other variable represents the same quantity but in inches. Both these variables evidently capture redundant information, and hence one of them can be removed. In a general scenario, keeping solely the linear combination of input variables would both express the data more concisely and reduce the number of variables. This is why PCA is often used as a dimensionality reduction technique.
+Principal Component Analysis (PCA) is a non-parametric procedure for orthogonal linear transformation of the input data to a new coordinate system, such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In other words, PCA first selects a normalized direction in $m$-dimensional space ($m$ is the number of columns in the input data) along which the variance in input data is maximized -- this is referred to as the first principal component. It then repeatedly finds other directions (principal components) in which the variance is maximized. At every step, PCA restricts the search for only those directions that are perpendicular to all previously selected directions. By doing so, PCA aims to reduce the redundancy among input variables. To understand the notion of redundancy, consider an extreme scenario with a data set comprising of two v
 ariables, where the first one denotes some quantity expressed in meters, and the other variable represents the same quantity but in inches. Both these variables evidently capture redundant information, and hence one of them can be removed. In a general scenario, keeping solely the linear combination of input variables would both express the data more concisely and reduce the number of variables. This is why PCA is often used as a dimensionality reduction technique.
 
 The specific method to compute such a new coordinate system is as follows -- compute a covariance matrix $C$ that measures the strength of correlation among all pairs of variables in the input data; factorize $C$ according to eigen decomposition to calculate its eigenvalues and eigenvectors; and finally, order eigenvectors in the decreasing order of their corresponding eigenvalue. The computed eigenvectors (also known as {\em loadings}) define the new coordinate system and the square root of eigen values provide the amount of variance in the input data explained by each coordinate or eigenvector. 
 \\
@@ -112,7 +112,7 @@ The specific method to compute such a new coordinate system is as follows -- com
 
 \noindent{\bf Returns}
 When MODEL is not provided, PCA procedure is applied on INPUT data to generate MODEL as well as the rotated data OUTPUT (if PROJDATA is set to $1$) in the new coordinate system. 
-The produced model consists of basis vectors MODEL$/dominant.eigen.vectors$ for the new coordinate system; eigen values MODEL$/dominant.eigen.values$; and the standard deviation MODEL$/dominant.eigen.standard.deviations$ of principle components.
+The produced model consists of basis vectors MODEL$/dominant.eigen.vectors$ for the new coordinate system; eigen values MODEL$/dominant.eigen.values$; and the standard deviation MODEL$/dominant.eigen.standard.deviations$ of principal components.
 When MODEL is provided, INPUT data is rotated according to the coordinate system defined by MODEL$/dominant.eigen.vectors$. The resulting data is stored at location OUTPUT.
 \\
 

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/8b917582/algorithms-matrix-factorization.md
----------------------------------------------------------------------
diff --git a/algorithms-matrix-factorization.md b/algorithms-matrix-factorization.md
index 2ed8a49..51eb614 100644
--- a/algorithms-matrix-factorization.md
+++ b/algorithms-matrix-factorization.md
@@ -25,20 +25,20 @@ limitations under the License.
 # 5 Matrix Factorization
 
 
-## 5.1 Principle Component Analysis
+## 5.1 Principal Component Analysis
 
 ### Description
 
-Principle Component Analysis (PCA) is a simple, non-parametric method to
+Principal Component Analysis (PCA) is a simple, non-parametric method to
 transform the given data set with possibly correlated columns into a set
-of linearly uncorrelated or orthogonal columns, called *principle
-components*. The principle components are ordered in such a way
+of linearly uncorrelated or orthogonal columns, called *principal
+components*. The principal components are ordered in such a way
 that the first component accounts for the largest possible variance,
-followed by remaining principle components in the decreasing order of
+followed by remaining principal components in the decreasing order of
 the amount of variance captured from the data. PCA is often used as a
 dimensionality reduction technique, where the original data is projected
 or rotated onto a low-dimensional space with basis vectors defined by
-top-$K$ (for a given value of $K$) principle components.
+top-$K$ (for a given value of $K$) principal components.
 
 
 ### Usage
@@ -80,19 +80,19 @@ top-$K$ (for a given value of $K$) principle components.
 **INPUT**: Location (on HDFS) to read the input matrix.
 
 **K**: Indicates dimension of the new vector space constructed from $K$
-    principle components. It must be a value between `1` and the number
+    principal components. It must be a value between `1` and the number
     of columns in the input data.
 
 **CENTER**: (default: `0`) `0` or `1`. Indicates whether or not to
     *center* input data prior to the computation of
-    principle components.
+    principal components.
 
 **SCALE**: (default: `0`) `0` or `1`. Indicates whether or not to
     *scale* input data prior to the computation of
-    principle components.
+    principal components.
 
 **PROJDATA**: `0` or `1`. Indicates whether or not the input data must be projected
-    on to new vector space defined over principle components.
+    on to new vector space defined over principal components.
 
 **OFMT**: (default: `"csv"`) Matrix file output format, such as `text`,
 `mm`, or `csv`; see read/write functions in
@@ -170,7 +170,7 @@ SystemML Language Reference for details.
 
 #### Details
 
-Principle Component Analysis (PCA) is a non-parametric procedure for
+Principal Component Analysis (PCA) is a non-parametric procedure for
 orthogonal linear transformation of the input data to a new coordinate
 system, such that the greatest variance by some projection of the data
 comes to lie on the first coordinate (called the first principal
@@ -178,8 +178,8 @@ component), the second greatest variance on the second coordinate, and
 so on. In other words, PCA first selects a normalized direction in
 $m$-dimensional space ($m$ is the number of columns in the input data)
 along which the variance in input data is maximized \u2013 this is referred
-to as the first principle component. It then repeatedly finds other
-directions (principle components) in which the variance is maximized. At
+to as the first principal component. It then repeatedly finds other
+directions (principal components) in which the variance is maximized. At
 every step, PCA restricts the search for only those directions that are
 perpendicular to all previously selected directions. By doing so, PCA
 aims to reduce the redundancy among input variables. To understand the
@@ -211,7 +211,7 @@ OUTPUT (if PROJDATA is set to $1$) in the new coordinate system. The
 produced model consists of basis vectors MODEL$/dominant.eigen.vectors$
 for the new coordinate system; eigen values
 MODEL$/dominant.eigen.values$; and the standard deviation
-MODEL$/dominant.eigen.standard.deviations$ of principle components. When
+MODEL$/dominant.eigen.standard.deviations$ of principal components. When
 MODEL is provided, INPUT data is rotated according to the coordinate
 system defined by MODEL$/dominant.eigen.vectors$. The resulting data is
 stored at location OUTPUT.

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/8b917582/algorithms-reference.md
----------------------------------------------------------------------
diff --git a/algorithms-reference.md b/algorithms-reference.md
index 244b882..26c2141 100644
--- a/algorithms-reference.md
+++ b/algorithms-reference.md
@@ -48,7 +48,7 @@ limitations under the License.
   * [Regression Scoring and Prediction](algorithms-regression.html#regression-scoring-and-prediction)
   
 * [Matrix Factorization](algorithms-matrix-factorization.html)
-  * [Principle Component Analysis](algorithms-matrix-factorization.html#principle-component-analysis)
+  * [Principal Component Analysis](algorithms-matrix-factorization.html#principal-component-analysis)
   * [Matrix Completion via Alternating Minimizations](algorithms-matrix-factorization.html#matrix-completion-via-alternating-minimizations)
 
 * [Survival Analysis](algorithms-survival-analysis.html)


[41/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

Posted by de...@apache.org.
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/SystemML_Algorithms_Reference.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/SystemML_Algorithms_Reference.tex b/Algorithms Reference/SystemML_Algorithms_Reference.tex
deleted file mode 100644
index 75308c9..0000000
--- a/Algorithms Reference/SystemML_Algorithms_Reference.tex	
+++ /dev/null
@@ -1,174 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\documentclass[letter]{article}
-\usepackage{graphicx,amsmath,amssymb,amsthm,subfigure,color,url,multirow,rotating,comment}
-\usepackage{tikz}
-\usepackage[normalem]{ulem}
-\usepackage[np,autolanguage]{numprint}
-\usepackage{tabularx}
-
-\usepackage[pdftex]{hyperref}
-\hypersetup{
-    unicode=false,          % non-Latin characters in Acrobat&#146;s bookmarks
-    pdftoolbar=true,        % show Acrobat&#146;s toolbar?
-    pdfmenubar=true,        % show Acrobat&#146;s menu?
-    pdffitwindow=true,      % window fit to page when opened
-    pdfstartview={FitV},    % fits the width of the page to the window
-    pdftitle={SystemML Algorithms Reference},    % title
-    pdfauthor={SystemML Team}, % author
-    pdfsubject={Documentation},   % subject of the document
-    pdfkeywords={},         % list of keywords
-    pdfnewwindow=true,      % links in new window
-    bookmarksnumbered=true, % put section numbers in bookmarks
-    bookmarksopen=true,     % open up bookmark tree
-    bookmarksopenlevel=1,   % \maxdimen level to which bookmarks are open
-    colorlinks=true,        % false: boxed links; true: colored links
-    linkcolor=black,        % color of internal links  
-    citecolor=blue,         % color of links to bibliography
-    filecolor=black,        % color of file links
-    urlcolor=black          % color of external links
-}
-
-
-\newtheorem{definition}{Definition}
-\newtheorem{example}{Example}
-
-\newcommand{\Paragraph}[1]{\vspace*{1ex} \noindent {\bf #1} \hspace*{1ex}}
-\newenvironment{Itemize}{\vspace{-0.5ex}\begin{itemize}\setlength{\itemsep}{-0.2ex}
-}{\end{itemize}\vspace{-0.5ex}}
-\newenvironment{Enumerate}{\vspace{-0.5ex}\begin{enumerate}\setlength{\itemsep}{-0.2ex}
-}{\end{enumerate}\vspace{-0.5ex}}
-\newenvironment{Description}{\vspace{-0.5ex}\begin{description}\setlength{\itemsep}{-0.2ex}
-}{\end{description}\vspace{-0.5ex}}
-
-
-\newcommand{\SystemML}{\texttt{SystemML} }
-\newcommand{\hml}{\texttt{hadoop jar SystemML.jar} }
-\newcommand{\pxp}{\mathbin{\texttt{\%\textasteriskcentered\%}}}
-\newcommand{\todo}[1]{{{\color{red}TODO: #1}}}
-\newcommand{\Normal}{\ensuremath{\mathop{\mathrm{Normal}}\nolimits}}
-\newcommand{\Prob}{\ensuremath{\mathop{\mathrm{Prob}\hspace{0.5pt}}\nolimits}}
-\newcommand{\E}{\ensuremath{\mathop{\mathrm{E}}\nolimits}}
-\newcommand{\mean}{\ensuremath{\mathop{\mathrm{mean}}\nolimits}}
-\newcommand{\Var}{\ensuremath{\mathop{\mathrm{Var}}\nolimits}}
-\newcommand{\Cov}{\ensuremath{\mathop{\mathrm{Cov}}\nolimits}}
-\newcommand{\stdev}{\ensuremath{\mathop{\mathrm{st.dev}}\nolimits}}
-\newcommand{\atan}{\ensuremath{\mathop{\mathrm{arctan}}\nolimits}}
-\newcommand{\diag}{\ensuremath{\mathop{\mathrm{diag}}\nolimits}}
-\newcommand{\const}{\ensuremath{\mathop{\mathrm{const}}\nolimits}}
-\newcommand{\eps}{\varepsilon}
-
-\sloppy
-
-%%%%%%%%%%%%%%%%%%%%% 
-% header
-%%%%%%%%%%%%%%%%%%%%%
-
-\title{\LARGE{{\SystemML Algorithms Reference}}} 
-\date{\today}
-
-%%%%%%%%%%%%%%%%%%%%%
-% document start
-%%%%%%%%%%%%%%%%%%%%%
-\begin{document}	
-
-%\pagenumbering{roman}
-\maketitle
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Descriptive Statistics}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\input{DescriptiveStats}
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Classification}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\input{LogReg}
-
-\subsection{Support Vector Machines}
-
-\input{BinarySVM}
-
-\input{MultiSVM}
-
-\input{NaiveBayes}
-
-\input{DecisionTrees}
-
-\input{RandomForest}
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Clustering}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\input{Kmeans}
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Regression}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\input{LinReg}
-
-\input{StepLinRegDS}
-
-\input{GLM}
-
-\input{StepGLM}
-
-\input{GLMpredict.tex}
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Matrix Factorization}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\input{pca}
-
-\input{ALS.tex}
-
-%%{\color{red}\subsection{GNMF}}
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%{\color{red}\section{Sequence Mining}}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Survival Analysis}
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\input{KaplanMeier}
-
-\input{Cox}
-
-\bibliographystyle{abbrv}
-
-\bibliography{SystemML_ALgorithms_Reference}
-
-	
-%%%%%%%%%%%%%%%%%%%%%
-% document end
-%%%%%%%%%%%%%%%%%%%%%
-\end{document}
-
-

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Language Reference/PyDML Language Reference.doc
----------------------------------------------------------------------
diff --git a/Language Reference/PyDML Language Reference.doc b/Language Reference/PyDML Language Reference.doc
deleted file mode 100644
index b43b6db..0000000
Binary files a/Language Reference/PyDML Language Reference.doc and /dev/null differ

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Language Reference/Python syntax for DML.doc
----------------------------------------------------------------------
diff --git a/Language Reference/Python syntax for DML.doc b/Language Reference/Python syntax for DML.doc
deleted file mode 100644
index ee43a6b..0000000
Binary files a/Language Reference/Python syntax for DML.doc and /dev/null differ

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Language Reference/README_HADOOP_CONFIG.txt
----------------------------------------------------------------------
diff --git a/Language Reference/README_HADOOP_CONFIG.txt b/Language Reference/README_HADOOP_CONFIG.txt
deleted file mode 100644
index e34d4f3..0000000
--- a/Language Reference/README_HADOOP_CONFIG.txt	
+++ /dev/null
@@ -1,83 +0,0 @@
-Usage
------
-The machine learning algorithms described in SystemML_Algorithms_Reference.pdf can be invoked
-from the hadoop command line using the described, algorithm-specific parameters. 
-
-Generic command line arguments arguments are provided by the help command below.
-
-   hadoop jar SystemML.jar -? or -help 
-
-
-Recommended configurations
---------------------------
-1) JVM Heap Sizes: 
-We recommend an equal-sized JVM configuration for clients, mappers, and reducers. For the client
-process this can be done via
-
-   export HADOOP_CLIENT_OPTS="-Xmx2048m -Xms2048m -Xmn256m" 
-   
-where Xmx specifies the maximum heap size, Xms the initial heap size, and Xmn is size of the young 
-generation. For Xmn values of equal or less than 15% of the max heap size, we guarantee the memory budget.
-
-For mapper or reducer JVM configurations, the following properties can be specified in mapred-site.xml,
-where 'child' refers to both mapper and reducer. If map and reduce are specified individually, they take 
-precedence over the generic property.
-
-  <property>
-    <name>mapreduce.child.java.opts</name> <!-- synonym: mapred.child.java.opts -->
-    <value>-Xmx2048m -Xms2048m -Xmn256m</value>
-  </property>
-  <property>
-    <name>mapreduce.map.java.opts</name> <!-- synonym: mapred.map.java.opts -->
-    <value>-Xmx2048m -Xms2048m -Xmn256m</value>
-  </property>
-  <property>
-    <name>mapreduce.reduce.java.opts</name> <!-- synonym: mapred.reduce.java.opts -->
-    <value>-Xmx2048m -Xms2048m -Xmn256m</value>
-  </property>
- 
-
-2) CP Memory Limitation:
-There exist size limitations for in-memory matrices. Dense in-memory matrices are limited to 16GB 
-independent of their dimension. Sparse in-memory matrices are limited to 2G rows and 2G columns 
-but the overall matrix can be larger. These limitations do only apply to in-memory matrices but 
-NOT in HDFS or involved in MR computations. Setting HADOOP_CLIENT_OPTS below those limitations 
-prevents runtime errors.
-
-3) Transparent Huge Pages (on Red Hat Enterprise Linux 6):
-Hadoop workloads might show very high System CPU utilization if THP is enabled. In case of such 
-behavior, we recommend to disable THP with
-   
-   echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
-   
-4) JVM Reuse:
-Performance benefits from JVM reuse because data sets that fit into the mapper memory budget are 
-reused across tasks per slot. However, Hadoop 1.0.3 JVM Reuse is incompatible with security (when 
-using the LinuxTaskController). The workaround is to use the DefaultTaskController. SystemML provides 
-a configuration property in SystemML-config.xml to enable JVM reuse on a per job level without
-changing the global cluster configuration.
-   
-   <jvmreuse>false</jvmreuse> 
-   
-5) Number of Reducers:
-The number of reducers can have significant impact on performance. SystemML provides a configuration
-property to set the default number of reducers per job without changing the global cluster configuration.
-In general, we recommend a setting of twice the number of nodes. Smaller numbers create less intermediate
-files, larger numbers increase the degree of parallelism for compute and parallel write. In
-SystemML-config.xml, set:
-   
-   <!-- default number of reduce tasks per MR job, default: 2 x number of nodes -->
-   <numreducers>12</numreducers> 
-
-6) SystemML temporary directories:
-SystemML uses temporary directories in two different locations: (1) on local file system for temping from 
-the client process, and (2) on HDFS for intermediate results between different MR jobs and between MR jobs 
-and in-memory operations. Locations of these directories can be configured in SystemML-config.xml with the
-following properties:
-
-   <!-- local fs tmp working directory-->
-   <localtmpdir>/tmp/systemml</localtmpdir>
-
-   <!-- hdfs tmp working directory--> 
-   <scratch>scratch_space</scratch> 
- 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/_config.yml
----------------------------------------------------------------------
diff --git a/_config.yml b/_config.yml
index ba1a808..15e0852 100644
--- a/_config.yml
+++ b/_config.yml
@@ -10,6 +10,10 @@ include:
   - _static
   - _modules
 
+exclude:
+  - alg-ref
+  - lang-ref
+
 # These allow the documentation to be updated with newer releases
 SYSTEMML_VERSION: Latest
 

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/ALS.tex
----------------------------------------------------------------------
diff --git a/alg-ref/ALS.tex b/alg-ref/ALS.tex
new file mode 100644
index 0000000..c2a5e3a
--- /dev/null
+++ b/alg-ref/ALS.tex
@@ -0,0 +1,298 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Matrix Completion via Alternating Minimizations}
+\label{matrix_completion}
+
+\noindent{\bf Description}
+\smallskip
+
+Low-rank matrix completion is an effective technique for statistical data analysis widely used in the data mining and machine learning applications.
+Matrix completion is a variant of low-rank matrix factorization with the goal of recovering a partially observed and potentially noisy matrix from a subset of its revealed entries.
+Perhaps the most popular applications in which matrix completion has been successfully applied is in the context of collaborative filtering in recommender systems. 
+In this setting, the rows in the data matrix correspond to users, 
+the columns to items such as movies, and entries to feedback provided by users for items. 
+The goal is to predict missing entries of the rating matrix. 
+This implementation uses the alternating least-squares (ALS) technique for solving large-scale matrix completion problems.\\ 
+
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+	{\tt{}-f }path/\/{\tt{}ALS.dml}
+	{\tt{} -nvargs}
+	{\tt{} V=}path/file
+	{\tt{} L=}path/file
+	{\tt{} R=}path/file
+%	{\tt{} VO=}path/file
+	{\tt{} rank=}int
+	{\tt{} reg=}L2$\mid$wL2%regularization
+	{\tt{} lambda=}double
+	{\tt{} fmt=}format
+	
+}
+
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+	\item[{\tt V}:]
+	Location (on HDFS) to read the input (user-item) matrix $V$ to be factorized
+	\item[{\tt L}:]
+	Location (on HDFS) to write the left (user) factor matrix $L$
+	\item[{\tt R}:]
+	Location (on HDFS) to write the right (item) factor matrix $R$
+%	\item[{\tt VO}:]
+%	Location (on HDFS) to write the input matrix $VO$ with empty rows and columns removed (if there are any)
+	\item[{\tt rank}:] (default:\mbox{ }{\tt 10})
+	Rank of the factorization
+	\item[{\tt reg}] (default:\mbox{ }{\tt L2})
+	Regularization:\\
+	{\tt L2} = L2 regularization;\\
+ 	{\tt wL2} = weighted L2 regularization;\\
+ 	if {\tt reg} is not provided no regularization will be performed. 
+ 	\item[{\tt lambda}:] (default:\mbox{ }{\tt 0.000001})
+ 	Regularization parameter
+ 	\item[{\tt maxi}:] (default:\mbox{ }{\tt 50})
+	 Maximum number of iterations
+	\item[{\tt check}:] (default:\mbox{ }{\tt FALSE})
+	Check for convergence after every iteration, i.e., updating $L$ and $R$ once
+	\item[{\tt thr}:] (default:\mbox{ }{\tt 0.0001})
+	Assuming {\tt check=TRUE}, the algorithm stops and convergence is declared 
+	if the decrease in loss in any two consecutive iterations falls below threshold {\tt thr}; 
+	if {\tt check=FALSE} parameter {\tt thr} is ignored.
+	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}
+\end{Description}
+ 
+ \smallskip
+ \noindent{\bf Usage: ALS Prediction/Top-K Prediction}
+ \smallskip
+ 
+ {\hangindent=\parindent\noindent\it%
+ 	{\tt{}-f }path/\/{\tt{}ALS\_predict.dml}
+ 	{\tt{} -nvargs}
+ 	{\tt{} X=}path/file
+ 	{\tt{} Y=}path/file
+ 	{\tt{} L=}path/file
+ 	{\tt{} R=}path/file
+ 	{\tt{} Vrows=}int
+ 	{\tt{} Vcols=}int
+ 	{\tt{} fmt=}format
+ 	
+ }\smallskip
+ 
+ 
+  \smallskip  
+  {\hangindent=\parindent\noindent\it%
+  	{\tt{}-f }path/\/{\tt{}ALS\_topk\_predict.dml}
+  	{\tt{} -nvargs}
+  	{\tt{} X=}path/file
+  	{\tt{} Y=}path/file
+  	{\tt{} L=}path/file
+  	{\tt{} R=}path/file
+  	{\tt{} V=}path/file
+  	{\tt{} K=}int
+  	{\tt{} fmt=}format
+  	
+  }\smallskip
+ 
+%   \noindent{\bf Arguments --- Prediction}
+%   \begin{Description}
+%   	\item[{\tt X}:]
+%   	Location (on HDFS) to read the input matrix $X$ containing user-ids (first column) and item-ids (second column) 
+%   	\item[{\tt L}:]
+%   	Location (on HDFS) to read the left (user) factor matrix $L$
+%   	\item[{\tt R}:]
+%   	Location (on HDFS) to read the right (item) factor matrix $R$
+%   	\item[{\tt Y}:]
+%   	Location (on HDFS) to write the output matrix $Y$ containing user-ids (first column), item-ids (second column) and predicted ratings (third column)
+%   	\item[{\tt Vrows}:] 
+%   	Number of rows of the user-item matrix $V$
+%   	\item[{\tt Vcols}] 
+%   	Number of columns of the user-item matrix $V$ 
+%   	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+%   	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}
+%   \end{Description}
+   
+
+  \noindent{\bf Arguments --- Prediction/Top-K Prediction}
+  \begin{Description}
+  	\item[{\tt V}:]
+  	Location (on HDFS) to read the user-item matrix $V$ 
+  	\item[{\tt X}:]
+  	Location (on HDFS) to read the input matrix $X$ with following format:
+  	\begin{itemize}
+  		\item for {ALS\_predict.dml}: a 2-column matrix that contains the user-ids (first column) and the item-ids (second column),
+  		\item for {ALS\_topk\_predict.dml}: a 1-column matrix that contains the user-ids.
+  	\end{itemize} 
+  	\item[{\tt Y}:]
+  	Location (on HDFS) to write the output of prediction with the following format:
+  	\begin{itemize}
+  		\item for {ALS\_predict.dml}: a 3-column matrix that contains the user-ids (first column), the item-ids (second column) and the predicted ratings (third column),
+  		\item for {ALS\_topk\_predict.dml}: a ($K+1$)-column matrix that contains the user-ids in the first column and the top-K item-ids in the remaining $K$ columns will be stored at {\tt Y}.
+  		Additionally, a matrix with the same dimensions that contains the corresponding actual top-K ratings will be stored at {\tt Y.ratings}; see below for details. 
+  	\end{itemize}
+%  	Note the following output format in predicting top-K items. 
+%  	For a user with no available ratings in $V$ no 
+%  	top-K items will be provided, i.e., the corresponding row in $Y$ will contains 0s.   
+%  	Moreover, $K'<K$ items with highest predicted ratings will be provided for a user $i$ 
+%  	if the number of missing ratings $K'$ (i.e., those with 0 value in $V$) for $i$ is less than $K$.
+  	\item[{\tt L}:]
+  	Location (on HDFS) to read the left (user) factor matrix $L$
+  	\item[{\tt R}:]
+  	Location (on HDFS) to write the right (item) factor matrix $R$
+   	\item[{\tt Vrows}:] 
+   	Number of rows of $V$ (i.e., number of users)
+   	\item[{\tt Vcols}] 
+   	Number of columns of $V$ (i.e., number of items) 
+  	\item[{\tt K}:] (default:\mbox{ }{\tt 5})
+  	Number of top-K items for top-K prediction
+  	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+  	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}
+  \end{Description}
+  
+ \noindent{\bf Details}
+ \smallskip
+ 
+ Given an $m \times n$ input matrix $V$ and a rank parameter $r \ll \min{(m,n)}$, low-rank matrix factorization seeks to find an $m \times r$ matrix $L$ and an $r \times n$ matrix $R$ such that $V \approx LR$, i.e., we aim to approximate $V$ by the low-rank matrix $LR$.
+ The quality of the approximation is determined by an application-dependent loss function $\mathcal{L}$. We aim at finding the loss-minimizing factor matrices, i.e., 
+ \begin{equation}\label{eq:problem}
+ (L^*, R^*) = \textrm{argmin}_{L,R}{\mathcal{L}(V,L,R)}.
+ \end{equation} 
+ In the context of collaborative filtering in the recommender systems it is often the case that the input matrix $V$ contains several missing entries. Such entries are coded with the 0 value and the loss function is computed only based on the nonzero entries in $V$, i.e.,
+ \begin{equation*} %\label{eq:loss}
+ \mathcal{L}=\sum_{(i,j)\in\Omega}l(V_{ij},L_{i*},R_{*j}),
+ \end{equation*} 
+ where $L_{i*}$ and $R_{*j}$, respectively, denote the $i$th row of $L$ and the $j$th column of $R$, $\Omega=\{\omega_1,\dots,\omega_N\}$ denotes the training set containing the observed (nonzero) entries in $V$, and $l$ is some local loss function.  
+ %for some training set $\Omega$ that contains the observed (nonzero) entries in $V$ and some local loss function $l$. In the above formula, 
+ 
+ ALS is an optimization technique that can be used to solve quadratic problems. 
+ For matrix completion, the algorithm repeatedly keeps one of the unknown matrices ($L$ or $R$) fixed and optimizes the other one. In particular, ALS alternates between recomputing the rows of $L$ in one step and the columns of $R$ in the subsequent step.  
+ Our implementation of the ALS algorithm supports the loss functions summarized in Table~\ref{tab:loss_functions} commonly used for matrix completion~\cite{ZhouWSP08:als}. 
+ %
+ \begin{table}[t]
+ 	\centering
+ 	\label{tab:loss_functions}
+ 	\begin{tabular}{|ll|} \hline
+ 		Loss & Definition \\ \hline
+% 		$\mathcal{L}_\text{Sl}$ & $\sum_{i,j} (V_{ij} - [LR]_{ij})^2$ \\
+% 		$\mathcal{L}_\text{Sl+L2}$ & $\mathcal{L}_\text{Sl} + \lambda \Bigl( \sum_{ik} L_{ik}^2 + \sum_{kj} R_{kj}^2 \Bigr)$ \\
+ 		$\mathcal{L}_\text{Nzsl}$ & $\sum_{i,j:V_{ij}\neq 0} (V_{ij} - [LR]_{ij})^2$ \\
+ 		$\mathcal{L}_\text{Nzsl+L2}$ & $\mathcal{L}_\text{Nzsl} + \lambda \Bigl( \sum_{ik} L_{ik}^2 + \sum_{kj} R_{kj}^2 \Bigr)$ \\
+ 		$\mathcal{L}_\text{Nzsl+wL2}$ & $\mathcal{L}_\text{Nzsl} + \lambda \Bigl(\sum_{ik}N_{i*} L_{ik}^2 + \sum_{kj}N_{*j} R_{kj}^2 \Bigr)$ \\ \hline 
+ 	\end{tabular}
+ 	\caption{Popular loss functions supported by our ALS implementation; $N_{i*}$ and $N_{*j}$, respectively, denote the number of nonzero entries in row $i$ and column $j$ of $V$.}
+ \end{table}
+ 
+ Note that the matrix completion problem as defined in (\ref{eq:problem}) is a non-convex problem for all loss functions from Table~\ref{tab:loss_functions}. 
+ However, when fixing one of the matrices $L$ or $R$, we get a least-squares problem with a globally optimal solution.  
+ For example, for the case of $\mathcal{L}_\text{Nzsl+wL2}$ we have the following closed form solutions
+  \begin{align*}
+  L^\top_{n+1,i*} &\leftarrow (R^{(i)}_n {[R^{(i)}_n]}^\top + \lambda N_2 I)^{-1} R_n V^\top_{i*}, \\
+  R_{n+1,*j} &\leftarrow ({[L^{(j)}_{n+1}]}^\top L^{(j)}_{n+1} + \lambda N_1 I)^{-1} L^\top_{n+1} V_{*j}, 
+  \end{align*}
+ where $L_{n+1,i*}$ (resp. $R_{n+1,*j}$) denotes the $i$th row of $L_{n+1}$ (resp. $j$th column of $R_{n+1}$), $\lambda$ denotes 
+ the regularization parameter, $I$ is the identity matrix of appropriate dimensionality, 
+ $V_{i*}$ (resp. $V_{*j}$) denotes the revealed entries in row $i$ (column $j$), 
+ $R^{(i)}_n$ (resp. $L^{(j)}_{n+1}$) refers to the corresponding columns of $R_n$ (rows of $L_{n+1}$), 
+ and $N_1$ (resp. $N_2$) denotes a diagonal matrix that contains the number of nonzero entries in row $i$ (column $j$) of $V$.   
+ 
+% For example, for the case of $\mathcal{L}_\text{Sl-L2}$ we have the following closed form solutions
+% \begin{align*}
+% L^\top_{n+1,i*} &\leftarrow (R_n {[R_n]}^\top + \lambda I)^{-1} R_n V^\top_{i*}, \\
+% R_{n+1,*j} &\leftarrow ({[L_{n+1}]}^\top L_{n+1} + \lambda I)^{-1} L^\top_{n+1} V_{*j}, 
+% \end{align*}
+% where $L_{n+1,i*}$ (resp. $R_{n+1,*j}$) denotes the $i$th row of $L_{n+1}$ (resp. $j$th column of $R_{n+1}$), $\lambda$ denotes 
+% the regularization parameter and $I$ is the identity matrix of appropriate dimensionality. 
+% For the case of $\mathcal{L}_\text{Nzsl}$ we need to remove the equation that correspond to zero entries of $V$ from the least-squares problems. 
+% With wL2 we get the following equations
+% \begin{align*}
+% L^\top_{n+1,i*} &\leftarrow (R^{(i)}_n {[R^{(i)}_n]}^\top + \lambda N_2 I)^{-1} R_n V^\top_{i*}, \\
+% R_{n+1,*j} &\leftarrow ({[L^{(j)}_{n+1}]}^\top L^{(j)}_{n+1} + \lambda N_1 I)^{-1} L^\top_{n+1} V_{*j}, 
+% \end{align*}
+% where $V_{i*}$ (resp. $V_{*j}$) denotes the revealed entries in row $i$ (column $j$), 
+% $R^{(i)}_n$ (resp. $L^{(j)}_{n+1}$) refers to the corresponding columns of $R_n$ (rows of $L_{n+1}$), 
+% and $N_1$ (resp. $N_2$) denotes a diagonal matrix that contains the number of nonzero entries in row $i$ (column $j$) of $V$.
+ 
+ \textbf{Prediction.} 
+ Based on the factor matrices computed by ALS we provide two prediction scripts:   
+ \begin{Enumerate}
+ 	\item {\tt ALS\_predict.dml} computes the predicted ratings for a given list of users and items;
+ 	\item {\tt ALS\_topk\_predict.dml} computes top-K item (where $K$ is given as input) with highest predicted ratings together with their corresponding ratings for a given list of users.
+ \end{Enumerate} 
+  
+ \smallskip
+ \noindent{\bf Returns}
+ \smallskip
+ 
+ We output the factor matrices $L$ and $R$ after the algorithm has converged. The algorithm is declared as converged if one of the two criteria is meet: 
+ (1) the decrease in the value of loss function falls below {\tt thr}
+ given as an input parameter (if parameter {\tt check=TRUE}), or (2) the maximum number of iterations (defined as parameter {\tt maxi}) is reached. 
+ Note that for a given user $i$ prediction is possible only if user $i$ has rated at least one item, i.e., row $i$ in matrix $V$ has at least one nonzero entry. 
+ In case, some users have not rated any items the corresponding factor in $L$ will be all 0s.
+ Similarly if some items have not been rated at all the corresponding factors in $R$  will contain only 0s. 
+ Our prediction scripts output the predicted ratings for a given list of users and items as well as the top-K items with highest predicted ratings together with the predicted ratings for a given list of users. Note that the predictions will only be provided for the users who have rated at least one item, i.e., the corresponding rows contain at least one nonzero entry. 
+% Moreover in the case of top-K prediction, if the number of predicted ratings---i.e., missing entries--- for some user $i$ is less than the input parameter $K$, all the predicted ratings for user $i$ will be provided.
+
+ 
+
+ 
+ 
+  
+ \smallskip
+ \noindent{\bf Examples}
+ \smallskip
+  
+% {\hangindent=\parindent\noindent\tt
+% 	\hml -f ALS.dml -nvargs V=/user/biadmin/V L=/user/biadmin/L R=/user/biadmin/R rank=10 reg="L2" lambda=0.0001 fmt=csv 
+% 		
+% }
+  
+ {\hangindent=\parindent\noindent\tt
+ 	\hml -f ALS.dml -nvargs V=/user/biadmin/V L=/user/biadmin/L R=/user/biadmin/R rank=10 reg="wL2" lambda=0.0001 maxi=50 check=TRUE thr=0.001 fmt=csv	
+ 	
+ }
+ 
+ \noindent To compute predicted ratings for a given list of users and items:
+ 
+ {\hangindent=\parindent\noindent\tt
+  	\hml -f ALS-predict.dml -nvargs X=/user/biadmin/X Y=/user/biadmin/Y L=/user/biadmin/L R=/user/biadmin/R  Vrows=100000 Vcols=10000 fmt=csv	
+  	
+ }
+  
+ \noindent To compute top-K items with highest predicted ratings together with the predicted ratings for a given list of users:
+ 
+ {\hangindent=\parindent\noindent\tt
+   	\hml -f ALS-top-predict.dml -nvargs X=/user/biadmin/X Y=/user/biadmin/Y L=/user/biadmin/L R=/user/biadmin/R V=/user/biadmin/V K=10 fmt=csv	
+   	
+ }
+
+
+%
+%\begin{itemize}
+%	\item Y. Zhou, D. K. Wilkinson, R. Schreiber, and R. Pan. \newblock{Large-scale parallel collaborative flitering for the Netflix prize}. In Proceedings of the International
+%	Conference on Algorithmic Aspects in Information and Management (AAIM), 2008, 337-348.
+%\end{itemize}
+ 
+ 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/BinarySVM.tex
----------------------------------------------------------------------
diff --git a/alg-ref/BinarySVM.tex b/alg-ref/BinarySVM.tex
new file mode 100644
index 0000000..7ff5b06
--- /dev/null
+++ b/alg-ref/BinarySVM.tex
@@ -0,0 +1,175 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsubsection{Binary-class Support Vector Machines}
+\label{l2svm}
+
+\noindent{\bf Description}
+
+Support Vector Machines are used to model the relationship between a categorical 
+dependent variable y and one or more explanatory variables denoted X. This 
+implementation learns (and predicts with) a binary class support vector machine 
+(y with domain size 2).
+\\
+
+\noindent{\bf Usage}
+
+\begin{tabbing}
+\texttt{-f} \textit{path}/\texttt{l2-svm.dml -nvargs} 
+\=\texttt{X=}\textit{path}/\textit{file} 
+  \texttt{Y=}\textit{path}/\textit{file} 
+  \texttt{icpt=}\textit{int} 
+  \texttt{tol=}\textit{double}\\
+\>\texttt{reg=}\textit{double} 
+  \texttt{maxiter=}\textit{int} 
+  \texttt{model=}\textit{path}/\textit{file}\\
+\>\texttt{Log=}\textit{path}/\textit{file}
+  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
+\end{tabbing}
+
+\begin{tabbing}
+\texttt{-f} \textit{path}/\texttt{l2-svm-predict.dml -nvargs} 
+\=\texttt{X=}\textit{path}/\textit{file} 
+  \texttt{Y=}\textit{path}/\textit{file} 
+  \texttt{icpt=}\textit{int} 
+  \texttt{model=}\textit{path}/\textit{file}\\
+\>\texttt{scores=}\textit{path}/\textit{file}
+  \texttt{accuracy=}\textit{path}/\textit{file}\\
+\>\texttt{confusion=}\textit{path}/\textit{file}
+  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
+\end{tabbing}
+
+%%\begin{verbatim}
+%%-f path/l2-svm.dml -nvargs X=path/file Y=path/file icpt=int tol=double
+%%                      reg=double maxiter=int model=path/file
+%%\end{verbatim}
+
+\noindent{\bf Arguments}
+
+\begin{itemize}
+\item X: Location (on HDFS) to read the matrix of feature vectors; 
+each row constitutes one feature vector.
+\item Y: Location to read the one-column matrix of (categorical) 
+labels that correspond to feature vectors in X. Binary class labels 
+can be expressed in one of two choices: $\pm 1$ or $1/2$. Note that,
+this argument is optional for prediction.
+\item icpt (default: {\tt 0}): If set to 1 then a constant bias column is 
+added to X. 
+\item tol (default: {\tt 0.001}): Procedure terminates early if the reduction
+in objective function value is less than tolerance times the initial objective
+function value.
+\item reg (default: {\tt 1}): Regularization constant. See details to find 
+out where lambda appears in the objective function. If one were interested 
+in drawing an analogy with the C parameter in C-SVM, then C = 2/lambda. 
+Usually, cross validation is employed to determine the optimum value of 
+lambda.
+\item maxiter (default: {\tt 100}): The maximum number of iterations.
+\item model: Location (on HDFS) that contains the learnt weights.
+\item Log: Location (on HDFS) to collect various metrics (e.g., objective 
+function value etc.) that depict progress across iterations while training.
+\item fmt (default: {\tt text}): Specifies the output format. Choice of 
+comma-separated values (csv) or as a sparse-matrix (text).
+\item scores: Location (on HDFS) to store scores for a held-out test set.
+Note that, this is an optional argument.
+\item accuracy: Location (on HDFS) to store the accuracy computed on a
+held-out test set. Note that, this is an optional argument.
+\item confusion: Location (on HDFS) to store the confusion matrix
+computed using a held-out test set. Note that, this is an optional 
+argument.
+\end{itemize}
+
+\noindent{\bf Details}
+
+Support vector machines learn a classification function by solving the
+following optimization problem ($L_2$-SVM):
+\begin{eqnarray*}
+&\textrm{argmin}_w& \frac{\lambda}{2} ||w||_2^2 + \sum_i \xi_i^2\\
+&\textrm{subject to:}& y_i w^{\top} x_i \geq 1 - \xi_i ~ \forall i
+\end{eqnarray*}
+where $x_i$ is an example from the training set with its label given by $y_i$, 
+$w$ is the vector of parameters and $\lambda$ is the regularization constant 
+specified by the user.
+
+To account for the missing bias term, one may augment the data with a column
+of constants which is achieved by setting intercept argument to 1 (C-J Hsieh 
+et al, 2008).
+
+This implementation optimizes the primal directly (Chapelle, 2007). It uses 
+nonlinear conjugate gradient descent to minimize the objective function 
+coupled with choosing step-sizes by performing one-dimensional Newton 
+minimization in the direction of the gradient.
+\\
+
+\noindent{\bf Returns}
+
+The learnt weights produced by l2-svm.dml are populated into a single column matrix 
+and written to file on HDFS (see model in section Arguments). The number of rows in 
+this matrix is ncol(X) if intercept was set to 0 during invocation and ncol(X) + 1 
+otherwise. The bias term, if used, is placed in the last row. Depending on what arguments
+are provided during invocation, l2-svm-predict.dml may compute one or more of scores, 
+accuracy and confusion matrix in the output format specified. 
+\\
+
+%%\noindent{\bf See Also}
+%%
+%%In case of multi-class classification problems (y with domain size greater than 2), 
+%%please consider using a multi-class classifier learning algorithm, e.g., multi-class
+%%support vector machines (see Section \ref{msvm}). To model the relationship between 
+%%a scalar dependent variable y and one or more explanatory variables X, consider 
+%%Linear Regression instead (see Section \ref{linreg-solver} or Section 
+%%\ref{linreg-iterative}).
+%%\\
+%%
+\noindent{\bf Examples}
+
+\begin{verbatim}
+hadoop jar SystemML.jar -f l2-svm.dml -nvargs X=/user/biadmin/X.mtx 
+                                              Y=/user/biadmin/y.mtx 
+                                              icpt=0 tol=0.001 fmt=csv
+                                              reg=1.0 maxiter=100 
+                                              model=/user/biadmin/weights.csv
+                                              Log=/user/biadmin/Log.csv
+\end{verbatim}
+
+\begin{verbatim}
+hadoop jar SystemML.jar -f l2-svm-predict.dml -nvargs X=/user/biadmin/X.mtx 
+                                                      Y=/user/biadmin/y.mtx 
+                                                      icpt=0 fmt=csv
+                                                      model=/user/biadmin/weights.csv
+                                                      scores=/user/biadmin/scores.csv
+                                                      accuracy=/user/biadmin/accuracy.csv
+                                                      confusion=/user/biadmin/confusion.csv
+\end{verbatim}
+
+\noindent{\bf References}
+
+\begin{itemize}
+\item W. T. Vetterling and B. P. Flannery. \newblock{\em Conjugate Gradient Methods in Multidimensions in 
+Numerical Recipes in C - The Art in Scientific Computing}. \newblock W. H. Press and S. A. Teukolsky
+(eds.), Cambridge University Press, 1992.
+\item J. Nocedal and  S. J. Wright. Numerical Optimization, Springer-Verlag, 1999.
+\item C-J Hsieh, K-W Chang, C-J Lin, S. S. Keerthi and S. Sundararajan. \newblock{\em A Dual Coordinate 
+Descent Method for Large-scale Linear SVM.} \newblock International Conference of Machine Learning
+(ICML), 2008.
+\item Olivier Chapelle. \newblock{\em Training a Support Vector Machine in the Primal}. \newblock Neural 
+Computation, 2007.
+\end{itemize}
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/Cox.tex
----------------------------------------------------------------------
diff --git a/alg-ref/Cox.tex b/alg-ref/Cox.tex
new file mode 100644
index 0000000..a355df7
--- /dev/null
+++ b/alg-ref/Cox.tex
@@ -0,0 +1,340 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Cox Proportional Hazard Regression Model}
+
+\noindent{\bf Description}
+\smallskip
+
+
+The Cox (proportional hazard or PH) is a semi-parametric statistical approach commonly used for analyzing survival data.
+Unlike non-parametric approaches, e.g., the Kaplan-Meier estimates (Section \ref{sec:kaplan-meier}), which can be used to analyze single sample of survival data or to compare between groups of survival times, the Cox PH models the dependency of the survival times on the values of {\it explanatory variables} (i.e., covariates) recorded for each individual at the time origin. Our focus is on covariates that do not change value over time, i.e., time-independent covariates, and that may be categorical (ordinal or nominal) as well as continuous-valued. \\  
+
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}Cox.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} TE=}path/file
+{\tt{} F=}path/file
+{\tt{} R=}path/file
+{\tt{} M=}path/file
+{\tt{} S=}path/file
+{\tt{} T=}path/file
+{\tt{} COV=}path/file
+{\tt{} RT=}path/file
+{\tt{} XO=}path/file
+{\tt{} MF=}path/file
+{\tt{} alpha=}double
+{\tt{} fmt=}format
+
+}
+
+\smallskip
+\noindent{\bf Arguments --- Model Fitting/Prediction}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the input matrix of the survival data containing: 
+\begin{Itemize}
+	\item timestamps,
+	\item whether event occurred (1) or data is censored (0),
+	\item feature vectors
+\end{Itemize}
+\item[{\tt Y}:]
+Location (on HDFS) to the read matrix used for prediction 
+\item[{\tt TE}:]
+Location (on HDFS) to read the 1-column matrix $TE$ that contains the column indices of the input matrix $X$ corresponding to timestamps (first entry) and event information (second entry)
+\item[{\tt F}:]
+Location (on HDFS) to read the 1-column matrix $F$ that contains the column indices of the input matrix $X$ corresponding to the features to be used for fitting the Cox model
+\item[{\tt R}:] (default:\mbox{ }{\tt " "})
+If factors (i.e., categorical features) are available in the input matrix $X$, location (on HDFS) to read matrix $R$ containing the start (first column) and end (second column) indices of each factor in $X$;
+alternatively, user can specify the indices of the baseline level of each factor which needs to be removed from $X$. If $R$ is not provided by default all variables are considered to be continuous-valued.
+\item[{\tt M}:]							
+Location (on HDFS) to store the results of Cox regression analysis including regression coefficients $\beta_j$s, their standard errors, confidence intervals, and $P$-values  
+\item[{\tt S}:] (default:\mbox{ }{\tt " "})
+Location (on HDFS) to store a summary of some statistics of the fitted model including number of records, number of events, log-likelihood, AIC, Rsquare (Cox \& Snell), and maximum possible Rsquare 
+\item[{\tt T}:] (default:\mbox{ }{\tt " "})
+Location (on HDFS) to store the results of Likelihood ratio test, Wald test, and Score (log-rank) test of the fitted model
+\item[{\tt COV}:]
+Location (on HDFS) to store the variance-covariance matrix of $\beta_j$s; note that parameter {\tt COV} needs to provided as input to prediction.
+\item[{\tt RT}:]
+Location (on HDFS) to store matrix $RT$ containing the order-preserving recoded timestamps from $X$; note that parameter {\tt RT} needs to provided as input for prediction.
+\item[{\tt XO}:]
+Location (on HDFS) to store the input matrix $X$ ordered by the timestamps; note that parameter {\tt XO} needs to provided as input for prediction.
+\item[{\tt MF}:]
+Location (on HDFS) to store column indices of $X$ excluding the baseline factors if available; note that parameter {\tt MF} needs to provided as input for prediction.
+\item[{\tt P}] 
+Location (on HDFS) to store matrix $P$ containing the results of prediction
+\item[{\tt alpha}](default:\mbox{ }{\tt 0.05})
+Parameter to compute a $100(1-\alpha)\%$ confidence interval for $\beta_j$s 
+\item[{\tt tol}](default:\mbox{ }{\tt 0.000001})
+Tolerance (epsilon) used in the convergence criterion
+\item[{\tt moi}:] (default:\mbox{ }{\tt 100})
+Maximum number of outer (Fisher scoring) iterations
+\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
+Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
+limit provided
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+ \smallskip
+ \noindent{\bf Usage: Cox Prediction}
+ \smallskip
+ 
+ {\hangindent=\parindent\noindent\it%
+ 	{\tt{}-f }path/\/{\tt{}Cox-predict.dml}
+ 	{\tt{} -nvargs}
+ 	{\tt{} X=}path/file
+ 	{\tt{} RT=}path/file
+ 	{\tt{} M=}path/file
+ 	{\tt{} Y=}path/file
+ 	{\tt{} COV=}path/file
+ 	{\tt{} MF=}path/file
+ 	{\tt{} P=}path/file
+ 	{\tt{} fmt=}format
+ 	
+ }\smallskip
+ 
+% \noindent{\bf Arguments --- Prediction}
+% \begin{Description}
+% 	\item[{\tt X}:]
+%	Location (on HDFS) to read the input matrix of the survival data sorted by the timestamps including: 
+%	\begin{Itemize}
+%		\item timestamps,
+%		\item whether event occurred (1) or data is censored (0),
+%		\item feature vectors
+%	\end{Itemize}
+% 	\item[{\tt RT}:]
+% 	Location to read column matrix $RT$ containing the (order preserving) recoded timestamps from X (output by {\tt Cox.dml})
+% 	\item[{\tt M}:]
+% 	Location to read matrix $M$ containing the fitted Cox model (see below for the schema) 
+% 	\item[{\tt Y}:]
+%	Location to the read matrix used for prediction    
+% 	\item[{\tt COV}:] 
+% 	Location to read the variance-covariance matrix of the regression coefficients (output by {\tt Cox.dml})
+% 	\item[{\tt MF}] 
+% 	Location to store column indices of $X$ excluding the baseline factors if available (output by {\tt Cox.dml})
+% 	\item[{\tt P}] 
+% 	Location to store matrix $P$ containing the results of prediction
+% 	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+% 	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}
+% \end{Description}
+ 
+
+
+\noindent{\bf Details}
+\smallskip
+
+ 
+In Cox PH regression model the relationship between the hazard function---i.e., the probability of event occurrence at a given time---and the covariates is described as
+\begin{equation}
+h_i(t)=h_0(t)\exp\Bigl\{ \sum_{j=1}^{p} \beta_jx_{ij} \Bigr\}, \label{eq:coxph}
+\end{equation} 
+where the hazard function for the $i$th individual ($i\in\{1,2,\ldots,n\}$) depends on a set of $p$ covariates $x_i=(x_{i1},x_{i2},\ldots,x_{ip})$, whose importance is measured by the magnitude of the corresponding coefficients 
+$\beta=(\beta_1,\beta_2,\ldots,\beta_p)$. The term $h_0(t)$ is the baseline hazard and is related to a hazard value if all covariates equal 0. 
+In the Cox PH model the hazard function for the individuals may vary over time, however the baseline hazard is estimated non-parametrically and can take any form.
+Note that re-writing~(\ref{eq:coxph}) we have 
+\begin{equation*}
+\log\biggl\{ \frac{h_i(t)}{h_0(t)} \biggr\} = \sum_{j=1}^{p} \beta_jx_{ij}.
+\end{equation*}
+Thus, the Cox PH model is essentially a linear model for the logarithm of the hazard ratio and the hazard of event for any individual is a constant multiple of the hazard of any other. 
+%Consequently, the Cox model is a proportional hazard model.
+We follow similar notation and methodology as in~\cite[Sec.~3]{collett2003:kaplanmeier}.
+For completeness we briefly discuss the equations used in our implementation.
+
+
+\textbf{Factors in the model.} 
+Note that if some of the feature variables are factors they need to {\it dummy code} as follows. 
+Let $\alpha$ be such a variable (i.e., a factor) with $a$ levels. 
+We introduce $a-1$ indicator (or dummy coded) variables $X_2,X_3\ldots,X_a$ with $X_j=1$ if $\alpha=j$ and 0 otherwise, for $j\in\{ 2,3,\ldots,a\}$.
+In particular, one of $a$ levels of $\alpha$ will be considered as the baseline and is not included in the model.
+In our implementation, user can specify a baseline level for each of the factor (as selecting the baseline level for each factor is arbitrary). 
+On the other hand, if for a given factor $\alpha$ no baseline is specified by the user, the most frequent level of $\alpha$ will be considered as the baseline.   
+
+
+\textbf{Fitting the model.}
+We estimate the coefficients of the Cox model via negative log-likelihood method.
+In particular the Cox PH model is fitted by using trust region Newton method with conjugate gradient~\cite{Nocedal2006:Optimization}.
+%The likelihood for the PH hazard model is given by
+%\begin{equation*}
+%\prod_{i=1}^{n} {\Bigg\{ \frac{\exp(\vec{\beta}^\top\vec{x_i})}{\sum_{l\in %R(t_i)\exp(\vec{\beta}\vec{x}_l)}} \Biggr\}}^\delta_i,
+%\end{equation*}
+%where $\delta_i$ is an event indicator, which is 0 if the $i$th survival time is censored or 1 otherwise, and $R(t_i)$ is the risk set defined as the set of individuals who die at time $t_i$ or later.
+Define the risk set $R(t_j)$ at time $t_j$ to be the set of individuals who die at time $t_i$ or later. 
+The PH model assumes that survival times are distinct. In order to handle tied observations
+we use the \emph{Breslow} approximation of the likelihood function
+\begin{equation*}
+\mathcal{L}=\prod_{j=1}^{r} \frac{\exp(\beta^\top s_j)}{{\bigg\{ \sum_{l\in R(t_j)} \exp(\beta^\top x_l) \biggr\}}^{d_j}},
+\end{equation*}
+where $d_j$ is number individuals who die at time $t_j$ and $s_j$ denotes the element-wise sum of the covariates for those individuals who die at time $t_j$, $j=1,2,\ldots,r$, i.e.,
+the $h$th element of $s_j$ is given by $s_{hj}=\sum_{k=1}^{d_j}x_{hjk}$, where $x_{hjk}$ is the value of $h$th variable ($h\in \{1,2,\ldots,p\}$) for the $k$th of the $d_j$ individuals ($k\in\{ 1,2,\ldots,d_j \}$) who die at the $j$th death time ($j\in\{ 1,2,\ldots,r \}$).  
+
+\textbf{Standard error and confidence interval for coefficients.}
+Note that the variance-covariance matrix of the estimated coefficients $\hat{\beta}$ can be approximated by the inverse of the Hessian evaluated at $\hat{\beta}$. The square root of the diagonal elements of this matrix are the standard errors of estimated coefficients.  
+Once the standard errors of the coefficients $se(\hat{\beta})$ is obtained we can compute a $100(1-\alpha)\%$ confidence interval using $\hat{\beta}\pm z_{\alpha/2}se(\hat{\beta})$, where $z_{\alpha/2}$ is the upper $\alpha/2$-point of the standard normal distribution.
+In {\tt Cox.dml}, we utilize the build-in function {\tt inv()} to compute the inverse of the Hessian. Note that this build-in function can be used only if the Hessian fits in the main memory of a single machine.   
+
+
+\textbf{Wald test, likelihood ratio test, and log-rank test.}
+In order to test the {\it null hypothesis} that all of the coefficients $\beta_j$s are 0, our implementation provides three statistical test: {\it Wald test}, {\it likelihood ratio test}, the {\it log-rank test} (also known as the {\it score test}). 
+Let $p$ be the number of coefficients.
+The Wald test is based on the test statistic ${\hat{\beta}}^2/{se(\hat{\beta})}^2$, which is compared to percentage points of the Chi-squared distribution to obtain the $P$-value.
+The likelihood ratio test relies on the test statistic $-2\log\{ {L}(\textbf{0})/{L}(\hat{\beta}) \}$ ($\textbf{0}$ denotes a zero vector of size $p$ ) which has an approximate Chi-squared distribution with $p$ degrees of freedom under the null hypothesis that all $\beta_j$s are 0.
+The Log-rank test is based on the test statistic 
+$l=\nabla^\top L(\textbf{0}) {\mathcal{H}}^{-1}(\textbf{0}) \nabla L(\textbf{0})$, 
+where $\nabla L(\textbf{0})$ is the gradient of $L$ and $\mathcal{H}(\textbf{0})$ is the Hessian of $L$ evaluated at \textbf{0}. Under the null hypothesis that $\beta=\textbf{0}$, $l$ has a Chi-squared distribution on $p$ degrees of freedom.
+
+
+% Scoring
+\textbf{Prediction.}
+Once the parameters of the model are fitted, we compute the following predictions together with their standard errors
+\begin{itemize}
+	\item linear predictors,
+	\item risk, and
+	\item estimated cumulative hazard. 
+\end{itemize}
+Given feature vector $X_i$ for individual $i$, we obtain the above predictions at time $t$ as follows.
+The linear predictors (denoted as $\mathcal{LP}$) as well as the risk (denoted as $\mathcal{R}$) are computed relative to a baseline whose feature values are the mean of the values in the corresponding features.
+Let $X_i^\text{rel} = X_i - \mu$, where $\mu$ is a row vector that contains the mean values for each feature.  
+We have  $\mathcal{LP}=X_i^\text{rel} \hat{\beta}$ and $\mathcal{R}=\exp\{ X_i^\text{rel}\hat{\beta} \}$.
+The standard errors of the linear predictors $se\{\mathcal{LP} \}$ are computed as the square root of ${(X_i^\text{rel})}^\top V(\hat{\beta}) X_i^\text{rel}$ and the standard error of the risk $se\{ \mathcal{R} \}$ are given by the square root of 
+${(X_i^\text{rel} \odot \mathcal{R})}^\top V(\hat{\beta}) (X_i^\text{rel} \odot \mathcal{R})$, where $V(\hat{\beta})$ is the variance-covariance matrix of the coefficients and $\odot$ is the element-wise multiplication.     
+
+We estimate the cumulative hazard function for individual $i$ by
+\begin{equation*}
+\hat{H}_i(t) = \exp(\hat{\beta}^\top X_i) \hat{H}_0(t), 
+\end{equation*}
+where $\hat{H}_0(t)$ is the \emph{Breslow estimate} of the cumulative baseline hazard given by
+\begin{equation*}
+\hat{H}_0(t) = \sum_{j=1}^{k} \frac{d_j}{\sum_{l\in R(t_{(j)})} \exp(\hat{\beta}^\top X_l)}.
+\end{equation*}
+In the equation above, as before, $d_j$ is the number of deaths, and $R(t_{(j)})$ is the risk set at time $t_{(j)}$, for $t_{(k)} \leq t \leq t_{(k+1)}$, $k=1,2,\ldots,r-1$.
+The standard error of $\hat{H}_i(t)$ is obtained using the estimation
+\begin{equation*}
+se\{ \hat{H}_i(t) \} = \sum_{j=1}^{k} \frac{d_j}{ {\left[ \sum_{l\in R(t_{(j)})} \exp(X_l\hat{\beta}) \right]}^2 } + J_i^\top(t) V(\hat{\beta}) J_i(t),
+\end{equation*}
+where 
+\begin{equation*}
+J_i(t) = \sum_{j-1}^{k} d_j \frac{\sum_{l\in R(t_{(j)})} (X_l-X_i)\exp \{ (X_l-X_i)\hat{\beta} \}}{ {\left[ \sum_{l\in R(t_{(j)})} \exp\{(X_l-X_i)\hat{\beta}\} \right]}^2  },
+\end{equation*}
+for $t_{(k)} \leq t \leq t_{(k+1)}$, $k=1,2,\ldots,r-1$. 
+
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+  
+Blow we list the results of fitting a Cox regression model stored in matrix {\tt M} with the following schema:
+\begin{itemize}
+	\item Column 1: estimated regression coefficients $\hat{\beta}$
+	\item Column 2: $\exp(\hat{\beta})$
+	\item Column 3: standard error of the estimated coefficients $se\{\hat{\beta}\}$
+	\item Column 4: ratio of $\hat{\beta}$ to $se\{\hat{\beta}\}$ denoted by $Z$  
+	\item Column 5: $P$-value of $Z$ 
+	\item Column 6: lower bound of $100(1-\alpha)\%$ confidence interval for $\hat{\beta}$
+	\item Column 7: upper bound of $100(1-\alpha)\%$ confidence interval for $\hat{\beta}$.
+\end{itemize}
+Note that above $Z$ is the Wald test statistic which is asymptotically standard normal under the hypothesis that $\beta=\textbf{0}$.
+
+Moreover, {\tt Cox.dml} outputs two log files {\tt S} and {\tt T} containing a summary statistics of the fitted model as follows.
+File {\tt S} stores the following information 
+\begin{itemize}
+	\item Line 1: total number of observations
+	\item Line 2: total number of events
+	\item Line 3: log-likelihood (of the fitted model)
+	\item Line 4: AIC
+	\item Line 5: Cox \& Snell Rsquare
+	\item Line 6: maximum possible Rsquare. 
+\end{itemize}
+Above, the AIC is computed as in (\ref{eq:AIC}),
+the Cox \& Snell Rsquare is equal to $1-\exp\{ -l/n \}$, where $l$ is the log-rank test statistic as discussed above and $n$ is total number of observations,
+and the maximum possible Rsquare computed as $1-\exp\{ -2 L(\textbf{0})/n \}$ , where $L(\textbf{0})$ denotes the initial likelihood. 
+
+
+File {\tt T} contains the following information
+\begin{itemize}
+	\item Line 1: Likelihood ratio test statistic, degree of freedom of the corresponding Chi-squared distribution, $P$-value
+	\item Line 2: Wald test statistic, degree of freedom of the corresponding Chi-squared distribution, $P$-value
+	\item Line 3: Score (log-rank) test statistic, degree of freedom of the corresponding Chi-squared distribution, $P$-value.
+\end{itemize}
+
+Additionally, the following matrices will be stored. Note that these matrices are required for prediction.
+\begin{itemize}
+	 \item Order-preserving recoded timestamps $RT$, i.e., contiguously numbered from 1 $\ldots$ \#timestamps
+	 \item Feature matrix ordered by the timestamps $XO$
+	 \item Variance-covariance matrix of the coefficients $COV$
+	 \item Column indices of the feature matrix with baseline factors removed (if available) $MF$.  
+\end{itemize}
+
+
+\textbf{Prediction}
+Finally, the results of prediction is stored in Matrix $P$ with the following schema
+\begin{itemize}
+	\item Column 1: linear predictors
+	\item Column 2: standard error of the linear predictors
+	\item Column 3: risk
+	\item Column 4: standard error of the risk
+	\item Column 5: estimated cumulative hazard
+	\item Column 6: standard error of the estimated cumulative hazard.
+\end{itemize}
+
+
+
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+	\hml -f Cox.dml -nvargs X=/user/biadmin/X.mtx TE=/user/biadmin/TE
+	F=/user/biadmin/F R=/user/biadmin/R M=/user/biadmin/model.csv
+	T=/user/biadmin/test.csv COV=/user/biadmin/var-covar.csv XO=/user/biadmin/X-sorted.mtx fmt=csv
+	
+}\smallskip
+
+{\hangindent=\parindent\noindent\tt
+	\hml -f Cox.dml -nvargs X=/user/biadmin/X.mtx TE=/user/biadmin/TE
+	F=/user/biadmin/F R=/user/biadmin/R M=/user/biadmin/model.csv
+	T=/user/biadmin/test.csv COV=/user/biadmin/var-covar.csv 
+	RT=/user/biadmin/recoded-timestamps.csv XO=/user/biadmin/X-sorted.csv 
+	MF=/user/biadmin/baseline.csv alpha=0.01 tol=0.000001 moi=100 mii=20 fmt=csv
+	
+}\smallskip
+
+\noindent To compute predictions:
+
+{\hangindent=\parindent\noindent\tt
+	\hml -f Cox-predict.dml -nvargs X=/user/biadmin/X-sorted.mtx 
+	RT=/user/biadmin/recoded-timestamps.csv
+	M=/user/biadmin/model.csv Y=/user/biadmin/Y.mtx COV=/user/biadmin/var-covar.csv 
+	MF=/user/biadmin/baseline.csv P=/user/biadmin/predictions.csv fmt=csv
+	
+}
+
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/DecisionTrees.tex
----------------------------------------------------------------------
diff --git a/alg-ref/DecisionTrees.tex b/alg-ref/DecisionTrees.tex
new file mode 100644
index 0000000..cea26a4
--- /dev/null
+++ b/alg-ref/DecisionTrees.tex
@@ -0,0 +1,312 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Decision Trees}
+\label{sec:decision_trees}
+
+\noindent{\bf Description}
+\smallskip
+
+
+Decision tree (for classification) is a classifier that is considered
+more interpretable than other statistical classifiers. This implementation
+is well-suited to handle large-scale data and builds a (binary) decision 
+tree in parallel.\\
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+	{\tt{}-f }path/\/{\tt{}decision-tree.dml}
+	{\tt{} -nvargs}
+	{\tt{} X=}path/file
+	{\tt{} Y=}path/file
+	{\tt{} R=}path/file
+	{\tt{} bins=}integer
+	{\tt{} depth=}integer
+	{\tt{} num\_leaf=}integer
+	{\tt{} num\_samples=}integer
+	{\tt{} impurity=}Gini$\mid$entropy
+	{\tt{} M=}path/file
+	{\tt{} O=}path/file
+	{\tt{} S\_map=}path/file
+	{\tt{} C\_map=}path/file
+	{\tt{} fmt=}format
+	
+}
+
+ \smallskip
+ \noindent{\bf Usage: Prediction}
+ \smallskip
+ 
+ {\hangindent=\parindent\noindent\it%
+ 	{\tt{}-f }path/\/{\tt{}decision-tree-predict.dml}
+ 	{\tt{} -nvargs}
+ 	{\tt{} X=}path/file
+ 	{\tt{} Y=}path/file
+ 	{\tt{} R=}path/file
+ 	{\tt{} M=}path/file
+ 	{\tt{} P=}path/file
+ 	{\tt{} A=}path/file
+ 	{\tt{} CM=}path/file
+ 	{\tt{} fmt=}format
+ 	
+ }\smallskip
+ 
+ 
+\noindent{\bf Arguments}
+\begin{Description}
+	\item[{\tt X}:]
+	Location (on HDFS) to read the matrix of feature vectors; 
+	each row constitutes one feature vector. Note that categorical features in $X$ need to be both recoded and dummy coded.
+	\item[{\tt Y}:]
+	Location (on HDFS) to read the matrix of (categorical) 
+	labels that correspond to feature vectors in $X$. Note that class labels are assumed to be both recoded and dummy coded. 
+	This argument is optional for prediction. 
+	\item[{\tt R}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to read matrix $R$ which for each feature in $X$ contains column-ids (first column), start indices (second column), and end indices (third column).
+	If $R$ is not provided by default all features are assumed to be continuous-valued.   
+	\item[{\tt bins}:] (default:\mbox{ }{\tt 20})
+	Number of thresholds to choose for each continuous-valued feature (determined by equi-height binning). 
+	\item[{\tt depth}:] (default:\mbox{ }{\tt 25})
+	Maximum depth of the learned tree
+	\item[{\tt num\_leaf}:] (default:\mbox{ }{\tt 10})
+	Parameter that controls pruning. The tree
+	is not expanded if a node receives less than {\tt num\_leaf} training examples.
+	\item[{\tt num\_samples}:] (default:\mbox{ }{\tt 3000})
+	Parameter that decides when to switch to in-memory building of subtrees. If a node $v$ receives less than {\tt num\_samples}
+	training examples then this implementation switches to an in-memory subtree
+	building procedure to build the subtree under $v$ in its entirety.
+	\item[{\tt impurity}:] (default:\mbox{ }{\tt "Gini"})
+	Impurity measure used at internal nodes of the tree for selecting which features to split on. Possible value are entropy or Gini.
+	\item[{\tt M}:] 
+	Location (on HDFS) to write matrix $M$ containing the learned decision tree (see below for the schema) 
+	\item[{\tt O}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to store the training accuracy (\%). Note that this argument is optional.
+	\item[{\tt A}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to store the testing accuracy (\%) from a 
+	held-out test set during prediction. Note that this argument is optional.
+	\item[{\tt P}:] 
+	Location (on HDFS) to store predictions for a held-out test set
+	\item[{\tt CM}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to store the confusion matrix computed using a held-out test set. Note that this argument is optional.
+	\item[{\tt S\_map}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to write the mappings from the continuous-valued feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
+	\item[{\tt C\_map}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to write the mappings from the categorical feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
+	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+	see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+ \noindent{\bf Details}
+ \smallskip
+
+ 
+Decision trees~\cite{BreimanFOS84:dtree} are simple models of
+classification that,  due to their structure,  are easy to
+interpret. Given an example feature vector, each node in the learned
+tree runs a simple test on it. Based on the result of the test, the
+example is either diverted to the left subtree or to the right
+subtree. Once the example reaches a leaf, then the label stored at the
+leaf is returned as the prediction for the example.
+
+
+Building a decision tree from a fully labeled training set entails
+choosing appropriate splitting tests for each internal node in the tree and this is usually performed in a top-down manner. 
+The splitting test (denoted by $s$) requires
+first choosing a feature $j$ and depending on the type of $j$, either
+a threshold $\sigma$, in case $j$ is continuous-valued, or a subset of
+values $S \subseteq \text{Dom}(j)$ where $\text{Dom}(j)$ denotes
+domain of $j$, in case it is categorical. For continuous-valued
+features the test is thus of form $x_j < \sigma$ and for categorical
+features it is of form $x_j \in S$, where $x_j$ denotes the $j$th
+feature value of feature vector $x$. One way to determine which test
+to include, is to compare impurities of the tree nodes induced by the test.
+The {\it node impurity} measures the homogeneity of the labels at the node. This implementation supports two commonly used impurity measures (denoted by $\mathcal{I}$): {\it Entropy} $\mathcal{E}=\sum_{i=1}^{C}-f_i \log f_i$, as well as {\it Gini impurity} $\mathcal{G}=\sum_{i=1}^{C}f_i (1-f_i)$, where $C$ denotes the number of unique labels and $f_i$ is the frequency of label $i$.
+Once the impurity at the tree nodes has been obtained, the {\it best split} is chosen from a set of possible splits that maximizes the {\it information gain} at the node, i.e., $\arg\max_{s}\mathcal{IG}(X,s)$, where $\mathcal{IG}(X,s)$ denotes the information gain when the splitting test $s$ partitions the feature matrix $X$. 
+Assuming that $s$ partitions $X$ that contains $N$ feature vectors into $X_\text{left}$ and $X_\text{right}$ each including $N_\text{left}$ and $N_\text{right}$ feature vectors, respectively, $\mathcal{IG}(X,s)$ is given by 
+\begin{equation*}
+\mathcal{IG}(X,s)=\mathcal{I}(X)-\frac{N_\text{left}}{N}\mathcal{I}(X_\text{left})-\frac{N_\text{right}}{N}\mathcal{I}(X_\text{right}),
+\end{equation*}
+where $\mathcal{I}\in\{\mathcal{E},\mathcal{G}\}$.
+In the following we discuss the implementation details specific to {\tt decision-tree.dml}. 
+
+
+\textbf{Input format.} 
+In general implementations of the decision tree algorithm do not require categorical features to be dummy coded. For improved efficiency and reducing the training time, our implementation however assumes dummy coded categorical features and dummy coded class labels.  
+
+
+\textbf{Tree construction.}
+Learning a decision tree on large-scale data has received some
+attention in the literature. The current implementation includes logic
+for choosing tests for multiple nodes that belong to the same level in
+the decision tree in parallel (breadth-first expansion) and for
+building entire subtrees under multiple nodes in parallel (depth-first
+subtree building). Empirically it has been demonstrated that it is
+advantageous to perform breadth-first expansion for the nodes
+belonging to the top levels of the tree and to perform depth-first
+subtree building for nodes belonging to the lower levels of the tree~\cite{PandaHBB09:dtree}. The parameter {\tt num\_samples} controls when we
+switch to  depth-first subtree building. Any node in the decision tree
+that receives $\leq$ {\tt num\_samples} training examples, the subtree
+under it is built in its entirety in one shot.
+
+
+\textbf{Stopping rule and pruning.} 
+The splitting of data at the internal nodes stops when at least one the following criteria is satisfied:
+\begin{itemize}
+	\item the depth of the internal node reaches the input parameter {\tt depth} controlling the maximum depth of the learned tree, or
+	\item no candidate split achieves information gain.
+\end{itemize}
+This implementation also allows for some automated pruning via the argument {\tt num\_leaf}. If
+a node receives $\leq$ {\tt num\_leaf} training examples, then a leaf
+is built in its place.
+
+
+\textbf{Continuous-valued features.}
+For a continuous-valued feature
+$j$ the number of candidate thresholds $\sigma$ to choose from is of
+the order of the number of examples present in the training set. Since
+for large-scale data this can result in a large number of candidate
+thresholds, the user can limit this number via the arguments {\tt bins} which controls the number of candidate thresholds considered
+for each continuous-valued feature. For each continuous-valued
+feature, the implementation computes an equi-height histogram to
+generate one candidate threshold per equi-height bin.
+
+
+\textbf{Categorical features.}
+In order to determine the best value subset to split on in the case of categorical features, this implementation greedily includes values from the feature's domain until the information gain stops improving.
+In particular, for a categorical feature $j$ the $|Dom(j)|$ feature values are sorted by impurity and the resulting split candidates $|Dom(j)|-1$ are examined; the sequence of feature values which results in the maximum information gain is then selected.
+
+
+\textbf{Description of the model.} 
+The learned decision tree is represented in a matrix $M$ that
+contains at least 6 rows. Each column in the matrix contains the parameters relevant to a single node in the tree. 
+Note that for building the tree model, our implementation splits the feature matrix $X$ into $X_\text{cont}$ containing continuous-valued features and $X_\text{cat}$ containing categorical features. In the following, the continuous-valued (resp. categorical) feature-ids correspond to the indices of the features in $X_\text{cont}$ (resp. $X_\text{cat}$). 
+Moreover, we refer to an internal node as a continuous-valued (categorical) node if the feature that this nodes looks at is continuous-valued (categorical).
+Below is a description of what each row in the matrix contains.
+\begin{itemize}
+\item Row 1: stores the node-ids. These ids correspond to the node-ids in a complete binary tree.
+\item Row 2: for internal nodes stores the offsets (the number of columns) in $M$ to the left child, and otherwise 0.
+\item Row 3: stores the feature index of the feature (id of a continuous-valued feature in $X_\text{cont}$ if the feature is continuous-valued or id of a categorical feature in $X_\text{cat}$ if the feature is categorical) that this node looks at if the node is an internal node, otherwise 0. 
+\item Row 4: store the type of the feature that this node looks at if the node is an internal node: 1 for continuous-valued and 2 for categorical features, 
+otherwise the label this leaf node is supposed to predict.
+\item Row 5: for the internal nodes contains 1 if the feature chosen for the node is continuous-valued, or the size of the subset of values used for splitting at the node stored in rows 6,7,$\ldots$ if the feature chosen for the node is categorical. For the leaf nodes, Row 5 contains the number of misclassified training examples reaching at this node. 
+\item Row 6,7,$\ldots$: for the internal nodes, row 6 stores the threshold to which the example's feature value is compared if the feature chosen for this node is continuous-valued, otherwise if the feature chosen for this node is categorical rows 6,7,$\ldots$ store the value subset chosen for the node.
+For the leaf nodes, row 6 contains 1 if the node is impure and the number of training examples at the node is greater than {\tt num\_leaf}, otherwise 0. 	
+\end{itemize}
+As an example, Figure~\ref{dtree} shows a decision tree with $5$ nodes and its matrix
+representation.
+
+\begin{figure}
+\begin{minipage}{0.3\linewidth}
+\begin{center}
+\begin{tikzpicture}
+\node (labelleft) [draw,shape=circle,minimum size=16pt] at (2,0) {$2$};
+\node (labelright) [draw,shape=circle,minimum size=16pt] at (3.25,0) {$1$};
+
+\node (rootleft) [draw,shape=rectangle,minimum size=16pt] at (2.5,1) {$x_5 \in \{2,3\}$};
+\node (rootlabel) [draw,shape=circle,minimum size=16pt] at (0.9,1) {$1$};
+\node (root) [draw,shape=rectangle,minimum size=16pt] at (1.75,2) {$x_3 < 0.45$};
+
+\draw[-latex] (root) -- (rootleft);
+\draw[-latex] (root) -- (rootlabel);
+\draw[-latex] (rootleft) -- (labelleft);
+\draw[-latex] (rootleft) -- (labelright);
+
+\end{tikzpicture}
+\end{center}
+\begin{center}
+(a)
+\end{center}
+\end{minipage}
+\hfill
+\begin{minipage}{0.65\linewidth}
+\begin{center}
+\begin{tabular}{c|c|c|c|c|c|}
+& Col 1 & Col 2 & Col 3 & Col 4 & Col 5\\
+\hline
+Row 1 & 1 & 2 & 3 & 6 & 7 \\
+\hline
+Row 2 & 1 & 0 & 1 & 0 & 0 \\
+\hline
+Row 3 & 3 & 5 & 0 & 0 & 0 \\
+\hline
+Row 4 & 1 & 1 & 2 & 2 & 1 \\
+\hline
+Row 5 & 1 & 0 & 2 & 0 & 0 \\
+\hline
+Row 6 & 0.45 & 0 & 2 & 0 & 0 \\
+\hline
+Row 7 &  &  & 3 &  & \\
+\hline
+\end{tabular}
+\end{center}
+\begin{center}
+(b)
+\end{center}
+\end{minipage}
+\caption{(a) An example tree and its (b) matrix representation. $x$ denotes an example and $x_j$ is the value of the $j$th continuous-valued (resp. categorical) feature in $X_\text{cont}$ (resp. $X_\text{cat}$). In this example all leaf nodes are pure and no training example is misclassified.}
+\label{dtree}
+\end{figure}
+
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+
+The matrix corresponding to the learned model as well as the training accuracy (if requested) is written to a file in the format specified. See
+details where the structure of the model matrix is described.
+Recall that in our implementation $X$ is split into $X_\text{cont}$ and $X_\text{cat}$. If requested, the mappings of the continuous-valued feature-ids in $X_\text{cont}$ (stored at {\tt S\_map}) and the categorical feature-ids in $X_\text{cat}$ (stored at {\tt C\_map}) to the global feature-ids in $X$ will be provided. 
+Depending on what arguments are provided during
+invocation, the {\tt decision-tree-predict.dml} script may compute one or more of predictions, accuracy and confusion matrix in the requested output format. 
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+	\hml -f decision-tree.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
+	R=/user/biadmin/R.csv M=/user/biadmin/model.csv
+	bins=20 depth=25 num\_leaf=10 num\_samples=3000 impurity=Gini fmt=csv
+	
+}\smallskip
+
+
+\noindent To compute predictions:
+
+{\hangindent=\parindent\noindent\tt
+	\hml -f decision-tree-predict.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx R=/user/biadmin/R.csv
+	M=/user/biadmin/model.csv  P=/user/biadmin/predictions.csv
+	A=/user/biadmin/accuracy.csv CM=/user/biadmin/confusion.csv fmt=csv
+	
+}\smallskip
+
+
+%\noindent{\bf References}
+%
+%\begin{itemize}
+%\item B. Panda, J. Herbach, S. Basu, and R. Bayardo. \newblock{PLANET: massively parallel learning of tree ensembles with MapReduce}. In Proceedings of the VLDB Endowment, 2009.
+%\item L. Breiman, J. Friedman, R. Olshen, and C. Stone. \newblock{Classification and Regression Trees}. Wadsworth and Brooks, 1984.
+%\end{itemize}


[17/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1181] Change sqlContext to spark in MLContext docs

Posted by de...@apache.org.
[SYSTEMML-1181] Change sqlContext to spark in MLContext docs

The variable sqlContext is not available by default in the spark shell anymore,
instead spark should be used to create DataFrames. Where methods expect an
instance of SqlContext, arguments are replaced with spark.sqlContext.

Closes #371.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/61f25f2b
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/61f25f2b
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/61f25f2b

Branch: refs/heads/gh-pages
Commit: 61f25f2b682446249ceb94c94da4e5b546cb3eec
Parents: b9d878c
Author: Felix Schueler <fe...@ibm.com>
Authored: Thu Feb 2 16:08:08 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Thu Feb 2 16:08:08 2017 -0800

----------------------------------------------------------------------
 spark-mlcontext-programming-guide.md | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/61f25f2b/spark-mlcontext-programming-guide.md
----------------------------------------------------------------------
diff --git a/spark-mlcontext-programming-guide.md b/spark-mlcontext-programming-guide.md
index dcaa125..8c0a79f 100644
--- a/spark-mlcontext-programming-guide.md
+++ b/spark-mlcontext-programming-guide.md
@@ -141,7 +141,7 @@ val numRows = 10000
 val numCols = 1000
 val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCols)(Random.nextDouble)) }
 val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
-val df = sqlContext.createDataFrame(data, schema)
+val df = spark.createDataFrame(data, schema)
 {% endhighlight %}
 </div>
 
@@ -167,7 +167,7 @@ data: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[1] a
 
 scala> val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
 schema: org.apache.spark.sql.types.StructType = StructType(StructField(C0,DoubleType,true), StructField(C1,DoubleType,true), StructField(C2,DoubleType,true), StructField(C3,DoubleType,true), StructField(C4,DoubleType,true), StructField(C5,DoubleType,true), StructField(C6,DoubleType,true), StructField(C7,DoubleType,true), StructField(C8,DoubleType,true), StructField(C9,DoubleType,true), StructField(C10,DoubleType,true), StructField(C11,DoubleType,true), StructField(C12,DoubleType,true), StructField(C13,DoubleType,true), StructField(C14,DoubleType,true), StructField(C15,DoubleType,true), StructField(C16,DoubleType,true), StructField(C17,DoubleType,true), StructField(C18,DoubleType,true), StructField(C19,DoubleType,true), StructField(C20,DoubleType,true), StructField(C21,DoubleType,true), ...
-scala> val df = sqlContext.createDataFrame(data, schema)
+scala> val df = spark.createDataFrame(data, schema)
 df: org.apache.spark.sql.DataFrame = [C0: double, C1: double, C2: double, C3: double, C4: double, C5: double, C6: double, C7: double, C8: double, C9: double, C10: double, C11: double, C12: double, C13: double, C14: double, C15: double, C16: double, C17: double, C18: double, C19: double, C20: double, C21: double, C22: double, C23: double, C24: double, C25: double, C26: double, C27: double, C28: double, C29: double, C30: double, C31: double, C32: double, C33: double, C34: double, C35: double, C36: double, C37: double, C38: double, C39: double, C40: double, C41: double, C42: double, C43: double, C44: double, C45: double, C46: double, C47: double, C48: double, C49: double, C50: double, C51: double, C52: double, C53: double, C54: double, C55: double, C56: double, C57: double, C58: double, C5...
 
 {% endhighlight %}
@@ -1540,7 +1540,7 @@ val numRows = 10000
 val numCols = 1000
 val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCols)(Random.nextDouble)) }
 val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
-val df = sqlContext.createDataFrame(data, schema)
+val df = spark.createDataFrame(data, schema)
 val mm = new MatrixMetadata(numRows, numCols)
 val minMaxMeanScript = dml(minMaxMean).in("Xin", df, mm).out("minOut", "maxOut", "meanOut")
 val minMaxMeanScript = dml(minMaxMean).in("Xin", df, mm).out("minOut", "maxOut", "meanOut")
@@ -1561,7 +1561,7 @@ val numRows = 10000
 val numCols = 1000
 val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCols)(Random.nextDouble)) }
 val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
-val df = sqlContext.createDataFrame(data, schema)
+val df = spark.createDataFrame(data, schema)
 val mm = new MatrixMetadata(numRows, numCols)
 val bbm = new BinaryBlockMatrix(df, mm)
 val minMaxMeanScript = dml(minMaxMean).in("Xin", bbm).out("minOut", "maxOut", "meanOut")
@@ -1852,7 +1852,7 @@ data: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[1] a
 scala> val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
 schema: org.apache.spark.sql.types.StructType = StructType(StructField(C0,DoubleType,true), StructField(C1,DoubleType,true), StructField(C2,DoubleType,true), StructField(C3,DoubleType,true), StructField(C4,DoubleType,true), StructField(C5,DoubleType,true), StructField(C6,DoubleType,true), StructField(C7,DoubleType,true), StructField(C8,DoubleType,true), StructField(C9,DoubleType,true), StructField(C10,DoubleType,true), StructField(C11,DoubleType,true), StructField(C12,DoubleType,true), StructField(C13,DoubleType,true), StructField(C14,DoubleType,true), StructField(C15,DoubleType,true), StructField(C16,DoubleType,true), StructField(C17,DoubleType,true), StructField(C18,DoubleType,true), StructField(C19,DoubleType,true), StructField(C20,DoubleType,true), StructField(C21,DoubleType,true), ...
 
-scala> val df = sqlContext.createDataFrame(data, schema)
+scala> val df = spark.createDataFrame(data, schema)
 df: org.apache.spark.sql.DataFrame = [C0: double, C1: double, C2: double, C3: double, C4: double, C5: double, C6: double, C7: double, C8: double, C9: double, C10: double, C11: double, C12: double, C13: double, C14: double, C15: double, C16: double, C17: double, C18: double, C19: double, C20: double, C21: double, C22: double, C23: double, C24: double, C25: double, C26: double, C27: double, C28: double, C29: double, C30: double, C31: double, C32: double, C33: double, C34: double, C35: double, C36: double, C37: double, C38: double, C39: double, C40: double, C41: double, C42: double, C43: double, C44: double, C45: double, C46: double, C47: double, C48: double, C49: double, C50: double, C51: double, C52: double, C53: double, C54: double, C55: double, C56: double, C57: double, C58: double, C5...
 
 {% endhighlight %}
@@ -1867,7 +1867,7 @@ val numRows = 100000
 val numCols = 1000
 val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCols)(Random.nextDouble)) }
 val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
-val df = sqlContext.createDataFrame(data, schema)
+val df = spark.createDataFrame(data, schema)
 {% endhighlight %}
 </div>
 
@@ -1889,7 +1889,7 @@ scala> import org.apache.sysml.api.MLOutput
 import org.apache.sysml.api.MLOutput
 
 scala> def getScalar(outputs: MLOutput, symbol: String): Any =
-     | outputs.getDF(sqlContext, symbol).first()(1)
+     | outputs.getDF(spark.sqlContext, symbol).first()(1)
 getScalar: (outputs: org.apache.sysml.api.MLOutput, symbol: String)Any
 
 scala> def getScalarDouble(outputs: MLOutput, symbol: String): Double =
@@ -1907,7 +1907,7 @@ getScalarInt: (outputs: org.apache.sysml.api.MLOutput, symbol: String)Int
 {% highlight scala %}
 import org.apache.sysml.api.MLOutput
 def getScalar(outputs: MLOutput, symbol: String): Any =
-outputs.getDF(sqlContext, symbol).first()(1)
+outputs.getDF(spark.sqlContext, symbol).first()(1)
 def getScalarDouble(outputs: MLOutput, symbol: String): Double =
 getScalar(outputs, symbol).asInstanceOf[Double]
 def getScalarInt(outputs: MLOutput, symbol: String): Int =
@@ -2264,7 +2264,7 @@ The Spark `LinearDataGenerator` is used to generate test data for the Spark ML a
 {% highlight scala %}
 // Generate data
 import org.apache.spark.mllib.util.LinearDataGenerator
-import sqlContext.implicits._
+import spark.implicits._
 
 val numRows = 10000
 val numCols = 1000
@@ -2549,7 +2549,7 @@ This cell contains helper methods to return `Double` and `Int` values from outpu
 import org.apache.sysml.api.MLOutput
 
 def getScalar(outputs: MLOutput, symbol: String): Any =
-    outputs.getDF(sqlContext, symbol).first()(1)
+    outputs.getDF(spark.sqlContext, symbol).first()(1)
 
 def getScalarDouble(outputs: MLOutput, symbol: String): Double =
     getScalar(outputs, symbol).asInstanceOf[Double]
@@ -2638,7 +2638,7 @@ val outputs = ml.executeScript(linearReg)
 val trainingTime = (System.currentTimeMillis() - start).toDouble / 1000.0
 
 // Get outputs
-val B = outputs.getDF(sqlContext, "beta_out").sort("ID").drop("ID")
+val B = outputs.getDF(spark.sqlContext, "beta_out").sort("ID").drop("ID")
 val r2 = getScalarDouble(outputs, "R2")
 val iters = getScalarInt(outputs, "totalIters")
 val trainingTimePerIter = trainingTime / iters
@@ -2815,7 +2815,7 @@ outputs = ml.executeScript(pnmf, {"X": X_train, "maxiter": 100, "rank": 10}, ["W
 
 {% highlight python %}
 # Plot training loss over time
-losses = outputs.getDF(sqlContext, "losses")
+losses = outputs.getDF(spark.sqlContext, "losses")
 xy = losses.sort(losses.ID).map(lambda r: (r[0], r[1])).collect()
 x, y = zip(*xy)
 plt.plot(x, y)


[30/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1238] Updated the default parameters of mllearn to match that of scikit learn.

Posted by de...@apache.org.
[SYSTEMML-1238] Updated the default parameters of mllearn to match that of
scikit learn.

- Also updated the test to compare our algorithm to scikit-learn.

Closes #398.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/0fb74b94
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/0fb74b94
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/0fb74b94

Branch: refs/heads/gh-pages
Commit: 0fb74b94af9e244b5695745ac7b3651b485b812f
Parents: bb97a4b
Author: Niketan Pansare <np...@us.ibm.com>
Authored: Fri Feb 17 14:54:23 2017 -0800
Committer: Niketan Pansare <np...@us.ibm.com>
Committed: Fri Feb 17 14:59:49 2017 -0800

----------------------------------------------------------------------
 algorithms-regression.md  | 8 ++++----
 beginners-guide-python.md | 2 +-
 python-reference.md       | 6 +++---
 3 files changed, 8 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/0fb74b94/algorithms-regression.md
----------------------------------------------------------------------
diff --git a/algorithms-regression.md b/algorithms-regression.md
index 992862e..80b38a3 100644
--- a/algorithms-regression.md
+++ b/algorithms-regression.md
@@ -83,8 +83,8 @@ efficient when the number of features $m$ is relatively small
 <div data-lang="Python" markdown="1">
 {% highlight python %}
 from systemml.mllearn import LinearRegression
-# C = 1/reg
-lr = LinearRegression(sqlCtx, fit_intercept=True, C=1.0, solver='direct-solve')
+# C = 1/reg (to disable regularization, use float("inf"))
+lr = LinearRegression(sqlCtx, fit_intercept=True, normalize=False, C=float("inf"), solver='direct-solve')
 # X_train, y_train and X_test can be NumPy matrices or Pandas DataFrame or SciPy Sparse Matrix
 y_test = lr.fit(X_train, y_train)
 # df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"
@@ -125,8 +125,8 @@ y_test = lr.fit(df_train)
 <div data-lang="Python" markdown="1">
 {% highlight python %}
 from systemml.mllearn import LinearRegression
-# C = 1/reg
-lr = LinearRegression(sqlCtx, fit_intercept=True, max_iter=100, tol=0.000001, C=1.0, solver='newton-cg')
+# C = 1/reg (to disable regularization, use float("inf"))
+lr = LinearRegression(sqlCtx, fit_intercept=True, normalize=False, max_iter=100, tol=0.000001, C=float("inf"), solver='newton-cg')
 # X_train, y_train and X_test can be NumPy matrices or Pandas DataFrames or SciPy Sparse matrices
 y_test = lr.fit(X_train, y_train)
 # df_train is DataFrame that contains two columns: "features" (of type Vector) and "label". df_test is a DataFrame that contains the column "features"

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/0fb74b94/beginners-guide-python.md
----------------------------------------------------------------------
diff --git a/beginners-guide-python.md b/beginners-guide-python.md
index 4d1b098..ffab09e 100644
--- a/beginners-guide-python.md
+++ b/beginners-guide-python.md
@@ -228,7 +228,7 @@ X_test = diabetes_X[-20:]
 y_train = diabetes.target[:-20]
 y_test = diabetes.target[-20:]
 # Create linear regression object
-regr = LinearRegression(sqlCtx, fit_intercept=True, C=1, solver='direct-solve')
+regr = LinearRegression(sqlCtx, fit_intercept=True, C=float("inf"), solver='direct-solve')
 # Train the model using the training sets
 regr.fit(X_train, y_train)
 y_predicted = regr.predict(X_test)

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/0fb74b94/python-reference.md
----------------------------------------------------------------------
diff --git a/python-reference.md b/python-reference.md
index 65dcb5c..8d38598 100644
--- a/python-reference.md
+++ b/python-reference.md
@@ -731,7 +731,7 @@ LogisticRegression score: 0.922222
 
 ### Reference documentation
 
- *class*`systemml.mllearn.estimators.LinearRegression`(*sqlCtx*, *fit\_intercept=True*, *max\_iter=100*, *tol=1e-06*, *C=1.0*, *solver='newton-cg'*, *transferUsingDF=False*)(#systemml.mllearn.estimators.LinearRegression "Permalink to this definition")
+ *class*`systemml.mllearn.estimators.LinearRegression`(*sqlCtx*, *fit\_intercept=True*, *normalize=False*, *max\_iter=100*, *tol=1e-06*, *C=float("inf")*, *solver='newton-cg'*, *transferUsingDF=False*)(#systemml.mllearn.estimators.LinearRegression "Permalink to this definition")
 :   Bases: `systemml.mllearn.estimators.BaseSystemMLRegressor`{.xref .py
     .py-class .docutils .literal}
 
@@ -760,7 +760,7 @@ LogisticRegression score: 0.922222
         >>> # The mean square error
         >>> print("Residual sum of squares: %.2f" % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
 
- *class*`systemml.mllearn.estimators.LogisticRegression`(*sqlCtx*, *penalty='l2'*, *fit\_intercept=True*, *max\_iter=100*, *max\_inner\_iter=0*, *tol=1e-06*, *C=1.0*, *solver='newton-cg'*, *transferUsingDF=False*)(#systemml.mllearn.estimators.LogisticRegression "Permalink to this definition")
+ *class*`systemml.mllearn.estimators.LogisticRegression`(*sqlCtx*, *penalty='l2'*, *fit\_intercept=True*, *normalize=False*,  *max\_iter=100*, *max\_inner\_iter=0*, *tol=1e-06*, *C=1.0*, *solver='newton-cg'*, *transferUsingDF=False*)(#systemml.mllearn.estimators.LogisticRegression "Permalink to this definition")
 :   Bases: `systemml.mllearn.estimators.BaseSystemMLClassifier`{.xref
     .py .py-class .docutils .literal}
 
@@ -817,7 +817,7 @@ LogisticRegression score: 0.922222
         >>> prediction = model.transform(test)
         >>> prediction.show()
 
- *class*`systemml.mllearn.estimators.SVM`(*sqlCtx*, *fit\_intercept=True*, *max\_iter=100*, *tol=1e-06*, *C=1.0*, *is\_multi\_class=False*, *transferUsingDF=False*)(#systemml.mllearn.estimators.SVM "Permalink to this definition")
+ *class*`systemml.mllearn.estimators.SVM`(*sqlCtx*, *fit\_intercept=True*, *normalize=False*, *max\_iter=100*, *tol=1e-06*, *C=1.0*, *is\_multi\_class=False*, *transferUsingDF=False*)(#systemml.mllearn.estimators.SVM "Permalink to this definition")
 :   Bases: `systemml.mllearn.estimators.BaseSystemMLClassifier`{.xref
     .py .py-class .docutils .literal}
 


[24/50] [abbrv] incubator-systemml git commit: [SYSTEMML-855] Update the documentation for Python users

Posted by de...@apache.org.
[SYSTEMML-855] Update the documentation for Python users


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/51da13ee
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/51da13ee
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/51da13ee

Branch: refs/heads/gh-pages
Commit: 51da13ee3d85e79b37c941a1020db91cca213de1
Parents: 7c17feb
Author: Niketan Pansare <np...@us.ibm.com>
Authored: Thu Feb 9 15:33:01 2017 -0800
Committer: Niketan Pansare <np...@us.ibm.com>
Committed: Thu Feb 9 15:33:00 2017 -0800

----------------------------------------------------------------------
 beginners-guide-python.md | 19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/51da13ee/beginners-guide-python.md
----------------------------------------------------------------------
diff --git a/beginners-guide-python.md b/beginners-guide-python.md
index 8a05ca6..4d1b098 100644
--- a/beginners-guide-python.md
+++ b/beginners-guide-python.md
@@ -71,8 +71,23 @@ brew install apache-spark16
 
 ### Install SystemML
 
-We are working towards uploading the python package on PyPi. Until then, please use following
-commands: 
+To install released SystemML, please use following commands:
+
+<div class="codetabs">
+<div data-lang="Python 2" markdown="1">
+```bash
+pip install systemml
+```
+</div>
+<div data-lang="Python 3" markdown="1">
+```bash
+pip3 install systemml
+```
+</div>
+</div>
+
+
+If you want to try out the bleeding edge version, please use following commands: 
 
 <div class="codetabs">
 <div data-lang="Python 2" markdown="1">


[46/50] [abbrv] incubator-systemml git commit: [MINOR] Added common errors and troubleshooting tricks

Posted by de...@apache.org.
[MINOR] Added common errors and troubleshooting tricks

Closes #428.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/bd232241
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/bd232241
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/bd232241

Branch: refs/heads/gh-pages
Commit: bd232241b432dbe28e952ae36f1dce03f5658e23
Parents: 358cfc9
Author: Niketan Pansare <np...@us.ibm.com>
Authored: Mon Mar 13 13:53:45 2017 -0800
Committer: Niketan Pansare <np...@us.ibm.com>
Committed: Mon Mar 13 14:53:45 2017 -0700

----------------------------------------------------------------------
 troubleshooting-guide.md | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/bd232241/troubleshooting-guide.md
----------------------------------------------------------------------
diff --git a/troubleshooting-guide.md b/troubleshooting-guide.md
index db8f060..629bcf5 100644
--- a/troubleshooting-guide.md
+++ b/troubleshooting-guide.md
@@ -94,3 +94,45 @@ Note: The default `SystemML-config.xml` is located in `<path to SystemML root>/c
     hadoop jar SystemML.jar [-? | -help | -f <filename>] (-config=<config_filename>) ([-args | -nvargs] <args-list>)
     
 See [Invoking SystemML in Hadoop Batch Mode](hadoop-batch-mode.html) for details of the syntax. 
+
+## Total size of serialized results is bigger than spark.driver.maxResultSize
+
+Spark aborts a job if the estimated result size of collect is greater than maxResultSize to avoid out-of-memory errors in driver.
+However, SystemML's optimizer has estimates the memory required for each operator and provides guards against these out-of-memory errors in driver.
+So, we recommend setting the configuration `--conf spark.driver.maxResultSize=0`.
+
+## File does not exist on HDFS/LFS error from remote parfor
+
+This error usually comes from incorrect HDFS configuration on the worker nodes. To investigate this, we recommend
+
+- Testing if HDFS is accessible from the worker node: `hadoop fs -ls <file path>`
+- Synchronize hadoop configuration across the worker nodes.
+- Set the environment variable `HADOOP_CONF_DIR`. You may have to restart the cluster-manager to get the hadoop configuration. 
+
+## JVM Garbage Collection related flags
+
+We recommend providing 10% of maximum memory to young generation and using `-server` flag for robust garbage collection policy. 
+For example: if you intend to use 20G driver and 60G executor, then please add following to your configuration:
+
+	 spark-submit --driver-memory 20G --executor-memory 60G --conf "spark.executor.extraJavaOptions=-Xmn6G -server" --conf  "spark.driver.extraJavaOptions=-Xmn2G -server" ... 
+
+## Memory overhead
+
+Spark sets `spark.yarn.executor.memoryOverhead`, `spark.yarn.driver.memoryOverhead` and `spark.yarn.am.memoryOverhead` to be 10% of memory provided
+to the executor, driver and YARN Application Master respectively (with minimum of 384 MB). For certain workloads, the user may have to increase this
+overhead to 12-15% of the memory budget.
+
+## Network timeout
+
+To avoid false-positive errors due to network failures in case of compute-bound scripts, the user may have to increase the timeout `spark.network.timeout` (default: 120s).
+
+## Advanced developer statistics
+
+Few of our operators (for example: convolution-related operator) and GPU backend allows an expert user to get advanced statistics
+by setting the configuration `systemml.stats.extraGPU` and `systemml.stats.extraDNN` in the file SystemML-config.xml. 
+
+## Out-Of-Memory on executors
+
+Out-Of-Memory on executors is often caused due to side-effects of lazy evaluation and in-memory input data of Spark for large-scale problems. 
+Though we are constantly improving our optimizer to address this scenario, a quick hack to resolve this is reducing the number of cores allocated to the executor.
+We would highly appreciate if you file a bug report on our [issue tracker](https://issues.apache.org/jira/browse/SYSTEMML) if and when you encounter OOM.
\ No newline at end of file


[44/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

Posted by de...@apache.org.
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/DescriptiveBivarStats.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/DescriptiveBivarStats.tex b/Algorithms Reference/DescriptiveBivarStats.tex
deleted file mode 100644
index a2d3db1..0000000
--- a/Algorithms Reference/DescriptiveBivarStats.tex	
+++ /dev/null
@@ -1,438 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Bivariate Statistics}
-
-\noindent{\bf Description}
-\smallskip
-
-Bivariate statistics are used to quantitatively describe the association between
-two features, such as test their statistical (in-)dependence or measure
-the accuracy of one data feature predicting the other feature, in a sample.
-The \BivarScriptName{} script computes common bivariate statistics,
-such as \NameStatR{} and \NameStatChi{}, in parallel for many pairs
-of data features.  For a given dataset matrix, script \BivarScriptName{} computes
-certain bivariate statistics for the given feature (column) pairs in the
-matrix.  The feature types govern the exact set of statistics computed for that pair.
-For example, \NameStatR{} can only be computed on two quantitative (scale)
-features like `Height' and `Temperature'. 
-It does not make sense to compute the linear correlation of two categorical attributes
-like `Hair Color'. 
-
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%\tolerance=0
-{\tt{}-f }path/\/\BivarScriptName{}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} index1=}path/file
-{\tt{} index2=}path/file
-{\tt{} types1=}path/file
-{\tt{} types2=}path/file
-{\tt{} OUTDIR=}path
-% {\tt{} fmt=}format
-
-}
-
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the data matrix $X$ whose columns are the features
-that we want to compare and correlate with bivariate statistics.
-\item[{\tt index1}:] % (default:\mbox{ }{\tt " "})
-Location (on HDFS) to read the single-row matrix that lists the column indices
-of the \emph{first-argument} features in pairwise statistics.
-Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the
-index $k$ of column \texttt{X[,$\,k$]} in the data matrix
-whose bivariate statistics need to be computed.
-% The default value means ``use all $X$-columns from the first to the last.''
-\item[{\tt index2}:] % (default:\mbox{ }{\tt " "})
-Location (on HDFS) to read the single-row matrix that lists the column indices
-of the \emph{second-argument} features in pairwise statistics.
-Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the
-index $l$ of column \texttt{X[,$\,l$]} in the data matrix
-whose bivariate statistics need to be computed.
-% The default value means ``use all $X$-columns from the first to the last.''
-\item[{\tt types1}:] % (default:\mbox{ }{\tt " "})
-Location (on HDFS) to read the single-row matrix that lists the \emph{types}
-of the \emph{first-argument} features in pairwise statistics.
-Its $i^{\textrm{th}}$ entry (i.e.\ $i^{\textrm{th}}$ column-cell) contains the type
-of column \texttt{X[,$\,k$]} in the data matrix, where $k$ is the $i^{\textrm{th}}$
-entry in the {\tt index1} matrix.  Feature types must be encoded by
-integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
-% The default value means ``treat all referenced $X$-columns as scale.''
-\item[{\tt types2}:] % (default:\mbox{ }{\tt " "})
-Location (on HDFS) to read the single-row matrix that lists the \emph{types}
-of the \emph{second-argument} features in pairwise statistics.
-Its $j^{\textrm{th}}$ entry (i.e.\ $j^{\textrm{th}}$ column-cell) contains the type
-of column \texttt{X[,$\,l$]} in the data matrix, where $l$ is the $j^{\textrm{th}}$
-entry in the {\tt index2} matrix.  Feature types must be encoded by
-integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
-% The default value means ``treat all referenced $X$-columns as scale.''
-\item[{\tt OUTDIR}:]
-Location path (on HDFS) where the output matrices with computed bivariate
-statistics will be stored.  The matrices' file names and format are defined
-in Table~\ref{table:bivars}.
-% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-% see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-\begin{table}[t]\hfil
-\begin{tabular}{|lll|}
-\hline\rule{0pt}{12pt}%
-Ouput File / Matrix         & Row$\,$\# & Name of Statistic   \\[2pt]
-\hline\hline\rule{0pt}{12pt}%
-\emph{All Files}            &     1     & 1-st feature column \\
-\rule{1em}{0pt}"            &     2     & 2-nd feature column \\[2pt]
-\hline\rule{0pt}{12pt}%
-bivar.scale.scale.stats     &     3     & \NameStatR          \\[2pt]
-\hline\rule{0pt}{12pt}%
-bivar.nominal.nominal.stats &     3     & \NameStatChi        \\
-\rule{1em}{0pt}"            &     4     & Degrees of freedom  \\
-\rule{1em}{0pt}"            &     5     & \NameStatPChi       \\
-\rule{1em}{0pt}"            &     6     & \NameStatV          \\[2pt]
-\hline\rule{0pt}{12pt}%
-bivar.nominal.scale.stats   &     3     & \NameStatEta        \\
-\rule{1em}{0pt}"            &     4     & \NameStatF          \\[2pt]
-\hline\rule{0pt}{12pt}%
-bivar.ordinal.ordinal.stats &     3     & \NameStatRho        \\[2pt]
-\hline
-\end{tabular}\hfil
-\caption{%
-The output matrices of \BivarScriptName{} have one row per one bivariate
-statistic and one column per one pair of input features.  This table lists
-the meaning of each matrix and each row.%
-% Signs ``+'' show applicability to scale or/and to categorical features.
-}
-\label{table:bivars}
-\end{table}
-
-
-
-\pagebreak[2]
-
-\noindent{\bf Details}
-\smallskip
-
-Script \BivarScriptName{} takes an input matrix \texttt{X} whose columns represent
-the features and whose rows represent the records of a data sample.
-Given \texttt{X}, the script computes certain relevant bivariate statistics
-for specified pairs of feature columns \texttt{X[,$\,i$]} and \texttt{X[,$\,j$]}.
-Command-line parameters \texttt{index1} and \texttt{index2} specify the files with
-column pairs of interest to the user.  Namely, the file given by \texttt{index1}
-contains the vector of the 1st-attribute column indices and the file given
-by \texttt{index2} has the vector of the 2nd-attribute column indices, with
-``1st'' and ``2nd'' referring to their places in bivariate statistics.
-Note that both \texttt{index1} and \texttt{index2} files should contain a 1-row matrix
-of positive integers.
-
-The bivariate statistics to be computed depend on the \emph{types}, or
-\emph{measurement levels}, of the two columns.
-The types for each pair are provided in the files whose locations are specified by
-\texttt{types1} and \texttt{types2} command-line parameters.
-These files are also 1-row matrices, i.e.\ vectors, that list the 1st-attribute and
-the 2nd-attribute column types in the same order as their indices in the
-\texttt{index1} and \texttt{index2} files.  The types must be provided as per
-the following convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
-
-The script orgainizes its results into (potentially) four output matrices, one per
-each type combination.  The types of bivariate statistics are defined using the types
-of the columns that were used for their arguments, with ``ordinal'' sometimes
-retrogressing to ``nominal.''  Table~\ref{table:bivars} describes what each column
-in each output matrix contains.  In particular, the script includes the following
-statistics:
-\begin{Itemize}
-\item For a pair of scale (quantitative) columns, \NameStatR;
-\item For a pair of nominal columns (with finite-sized, fixed, unordered domains), 
-the \NameStatChi{} and its p-value;
-\item For a pair of one scale column and one nominal column, \NameStatF{};
-\item For a pair of ordinal columns (ordered domains depicting ranks), \NameStatRho.
-\end{Itemize}
-Note that, as shown in Table~\ref{table:bivars}, the output matrices contain the
-column indices of the features involved in each statistic.
-Moreover, if the output matrix does not contain
-a value in a certain cell then it should be interpreted as a~$0$
-(sparse matrix representation).
-
-Below we list all bivariate statistics computed by script \BivarScriptName.
-The statistics are collected into several groups by the type of their input
-features.  We refer to the two input features as $v_1$ and $v_2$ unless
-specified otherwise; the value pairs are $(v_{1,i}, v_{2,i})$ for $i=1,\ldots,n$,
-where $n$ is the number of rows in \texttt{X}, i.e.\ the sample size.
-
-
-\paragraph{Scale-vs-scale statistics.}
-Sample statistics that describe association between two quantitative (scale) features.
-A scale feature has numerical values, with the natural ordering relation.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatR]:
-A measure of linear dependence between two numerical features:
-\begin{equation*}
-r \,\,=\,\, \frac{\Cov(v_1, v_2)}{\sqrt{\Var v_1 \Var v_2}}
-\,\,=\,\, \frac{\sum_{i=1}^n (v_{1,i} - \bar{v}_1) (v_{2,i} - \bar{v}_2)}%
-{\sqrt{\sum_{i=1}^n (v_{1,i} - \bar{v}_1)^{2\mathstrut} \cdot \sum_{i=1}^n (v_{2,i} - \bar{v}_2)^{2\mathstrut}}}
-\end{equation*}
-Commonly denoted by~$r$, correlation ranges between $-1$ and $+1$, reaching ${\pm}1$ when all value
-pairs $(v_{1,i}, v_{2,i})$ lie on the same line.  Correlation near~0 means that a line is not a good
-way to represent the dependence between the two features; however, this does not imply independence.
-The sign indicates direction of the linear association: $r > 0$ ($r < 0$) if one feature tends to
-linearly increase (decrease) when the other feature increases.  Nonlinear association, if present,
-may disobey this sign.
-\NameStatR{} is symmetric: $r(v_1, v_2) = r(v_2, v_1)$; it does not change if we transform $v_1$ and $v_2$
-to $a + b v_1$ and $c + d v_2$ where $a, b, c, d$ are constants and $b, d > 0$.
-
-Suppose that we use simple linear regression to represent one feature given the other, say
-represent $v_{2,i} \approx \alpha + \beta v_{1,i}$ by selecting $\alpha$ and $\beta$
-to minimize the least-squares error $\sum_{i=1}^n (v_{2,i} - \alpha - \beta v_{1,i})^2$.
-Then the best error equals
-\begin{equation*}
-\min_{\alpha, \beta} \,\,\sum_{i=1}^n \big(v_{2,i} - \alpha - \beta v_{1,i}\big)^2 \,\,=\,\,
-(1 - r^2) \,\sum_{i=1}^n \big(v_{2,i} - \bar{v}_2\big)^2
-\end{equation*}
-In other words, $1\,{-}\,r^2$ is the ratio of the residual sum of squares to
-the total sum of squares.  Hence, $r^2$ is an accuracy measure of the linear regression.
-\end{Description}
-
-
-\paragraph{Nominal-vs-nominal statistics.}
-Sample statistics that describe association between two nominal categorical features.
-Both features' value domains are encoded with positive integers in arbitrary order:
-nominal features do not order their value domains.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatChi]:
-A measure of how much the frequencies of value pairs of two categorical features deviate from
-statistical independence.  Under independence, the probability of every value pair must equal
-the product of probabilities of each value in the pair:
-$\Prob[a, b] - \Prob[a]\,\Prob[b] = 0$.  But we do not know these (hypothesized) probabilities;
-we only know the sample frequency counts.  Let $n_{a,b}$ be the frequency count of pair
-$(a, b)$, let $n_a$ and $n_b$ be the frequency counts of $a$~alone and of $b$~alone.  Under
-independence, difference $n_{a,b}{/}n - (n_a{/}n)(n_b{/}n)$ is unlikely to be exactly~0 due
-to sample randomness, yet it is unlikely to be too far from~0.  For some pairs $(a,b)$ it may
-deviate from~0 farther than for other pairs.  \NameStatChi{}~is an aggregate measure that
-combines squares of these differences across all value pairs:
-\begin{equation*}
-\chi^2 \,\,=\,\, \sum_{a,\,b} \Big(\frac{n_a n_b}{n}\Big)^{-1} \Big(n_{a,b} - \frac{n_a n_b}{n}\Big)^2
-\,=\,\, \sum_{a,\,b} \frac{(O_{a,b} - E_{a,b})^2}{E_{a,b}}
-\end{equation*}
-where $O_{a,b} = n_{a,b}$ are the \emph{observed} frequencies and $E_{a,b} = (n_a n_b){/}n$ are
-the \emph{expected} frequencies for all pairs~$(a,b)$.  Under independence (plus other standard
-assumptions) the sample~$\chi^2$ closely follows a well-known distribution, making it a basis for
-statistical tests for independence, see~\emph{\NameStatPChi} for details.  Note that \NameStatChi{}
-does \emph{not} measure the strength of dependence: even very weak dependence may result in a
-significant deviation from independence if the counts are large enough.  Use~\NameStatV{} instead
-to measure the strength of dependence.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Degrees of freedom]:
-An integer parameter required for the interpretation of~\NameStatChi{} measure.  Under independence
-(plus other standard assumptions) the sample~$\chi^2$ statistic is approximately distributed as the
-sum of $d$~squares of independent normal random variables with mean~0 and variance~1, where $d$ is
-this integer parameter.  For a pair of categorical features such that the $1^{\textrm{st}}$~feature
-has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories, the number of degrees
-of freedom is $d = (k_1 - 1)(k_2 - 1)$.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatPChi]:
-A measure of how likely we would observe the current frequencies of value pairs of two categorical
-features assuming their statistical independence.  More precisely, it computes the probability that
-the sum of $d$~squares of independent normal random variables with mean~0 and variance~1
-(called the $\chi^2$~distribution with $d$ degrees of freedom) generates a value at least as large
-as the current sample \NameStatChi.  The $d$ parameter is \emph{degrees of freedom}, see above.
-Under independence (plus other standard assumptions) the sample \NameStatChi{} closely follows the
-$\chi^2$~distribution and is unlikely to land very far into its tail.  On the other hand, if the
-two features are dependent, their sample \NameStatChi{} becomes arbitrarily large as $n\to\infty$
-and lands extremely far into the tail of the $\chi^2$~distribution given a large enough data sample.
-\NameStatPChi{} returns the tail ``weight'' on the right-hand side of \NameStatChi:
-\begin{equation*}
-P\,\,=\,\, \Prob\big[r \geq \textrm{\NameStatChi} \,\,\big|\,\, r \sim \textrm{the $\chi^2$ distribution}\big]
-\end{equation*}
-As any probability, $P$ ranges between 0 and~1.  If $P\leq 0.05$, the dependence between the two
-features may be considered statistically significant (i.e.\ their independence is considered
-statistically ruled out).  For highly dependent features, it is not unusual to have $P\leq 10^{-20}$
-or less, in which case our script will simply return $P = 0$.  Independent features should have
-their $P\geq 0.05$ in about 95\% cases.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatV]:
-A measure for the strength of association, i.e.\ of statistical dependence, between two categorical
-features, conceptually similar to \NameStatR.  It divides the observed~\NameStatChi{} by the maximum
-possible~$\chi^2_{\textrm{max}}$ given $n$ and the number $k_1, k_2$~of categories in each feature,
-then takes the square root.  Thus, \NameStatV{} ranges from 0 to~1,
-where 0 implies no association and 1 implies the maximum possible association (one-to-one
-correspondence) between the two features.  See \emph{\NameStatChi} for the computation of~$\chi^2$;
-its maximum${} = {}$%
-$n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}$ where the $1^{\textrm{st}}$~feature
-has $k_1$ categories and the $2^{\textrm{nd}}$~feature has $k_2$ categories~\cite{AcockStavig1979:CramersV},
-so
-\begin{equation*}
-\textrm{\NameStatV} \,\,=\,\, \sqrt{\frac{\textrm{\NameStatChi}}{n\cdot\min\{k_1\,{-}\,1, k_2\,{-}\,1\}}}
-\end{equation*}
-As opposed to \NameStatPChi, which goes to~0 (rapidly) as the features' dependence increases,
-\NameStatV{} goes towards~1 (slowly) as the dependence increases.  Both \NameStatChi{} and
-\NameStatPChi{} are very sensitive to~$n$, but in \NameStatV{} this is mitigated by taking the
-ratio.
-\end{Description}
-
-
-\paragraph{Nominal-vs-scale statistics.}
-Sample statistics that describe association between a categorical feature
-(order ignored) and a quantitative (scale) feature.
-The values of the categorical feature must be coded as positive integers.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatEta]:
-A measure for the strength of association (statistical dependence) between a nominal feature
-and a scale feature, conceptually similar to \NameStatR.  Ranges from 0 to~1, approaching 0
-when there is no association and approaching 1 when there is a strong association.  
-The nominal feature, treated as the independent variable, is assumed to have relatively few
-possible values, all with large frequency counts.  The scale feature is treated as the dependent
-variable.  Denoting the nominal feature by~$x$ and the scale feature by~$y$, we have:
-\begin{equation*}
-\eta^2 \,=\, 1 - \frac{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2},
-\,\,\,\,\textrm{where}\,\,\,\,
-\hat{y}[x] = \frac{1}{\mathop{\mathrm{freq}}(x)}\sum_{i=1}^n  
-\,\left\{\!\!\begin{array}{rl} y_i & \textrm{if $x_i = x$}\\ 0 & \textrm{otherwise}\end{array}\right.\!\!\!
-\end{equation*}
-and $\bar{y} = (1{/}n)\sum_{i=1}^n y_i$ is the mean.  Value $\hat{y}[x]$ is the average 
-of~$y_i$ among all records where $x_i = x$; it can also be viewed as the ``predictor'' 
-of $y$ given~$x$.  Then $\sum_{i=1}^{n} (y_i - \hat{y}[x_i])^2$ is the residual error
-sum-of-squares and $\sum_{i=1}^{n} (y_i - \bar{y})^2$ is the total sum-of-squares for~$y$. 
-Hence, $\eta^2$ measures the accuracy of predicting $y$ with~$x$, just like the
-``R-squared'' statistic measures the accuracy of linear regression.  Our output $\eta$
-is the square root of~$\eta^2$.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatF]:
-A measure of how much the values of the scale feature, denoted here by~$y$,
-deviate from statistical independence on the nominal feature, denoted by~$x$.
-The same measure appears in the one-way analysis of vari\-ance (ANOVA).
-Like \NameStatChi, \NameStatF{} is used to test the hypothesis that
-$y$~is independent from~$x$, given the following assumptions:
-\begin{Itemize}
-\item The scale feature $y$ has approximately normal distribution whose mean
-may depend only on~$x$ and variance is the same for all~$x$;
-\item The nominal feature $x$ has relatively small value domain with large
-frequency counts, the $x_i$-values are treated as fixed (non-random);
-\item All records are sampled independently of each other.
-\end{Itemize}
-To compute \NameStatF{}, we first compute $\hat{y}[x]$ as the average of~$y_i$
-among all records where $x_i = x$.  These $\hat{y}[x]$ can be viewed as
-``predictors'' of $y$ given~$x$; if $y$ is independent on~$x$, they should
-``predict'' only the global mean~$\bar{y}$.  Then we form two sums-of-squares:
-\begin{Itemize}
-\item \emph{Residual} sum-of-squares of the ``predictor'' accuracy: $y_i - \hat{y}[x_i]$;
-\item \emph{Explained} sum-of-squares of the ``predictor'' variability: $\hat{y}[x_i] - \bar{y}$.
-\end{Itemize}
-\NameStatF{} is the ratio of the explained sum-of-squares to
-the residual sum-of-squares, each divided by their corresponding degrees
-of freedom:
-\begin{equation*}
-F \,\,=\,\, 
-\frac{\sum_{x}\, \mathop{\mathrm{freq}}(x) \, \big(\hat{y}[x] - \bar{y}\big)^2 \,\big/\,\, (k\,{-}\,1)}%
-{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2 \,\big/\,\, (n\,{-}\,k)} \,\,=\,\,
-\frac{n\,{-}\,k}{k\,{-}\,1} \cdot \frac{\eta^2}{1 - \eta^2}
-\end{equation*}
-Here $k$ is the domain size of the nominal feature~$x$.  The $k$ ``predictors''
-lose 1~freedom due to their linear dependence with~$\bar{y}$; similarly,
-the $n$~$y_i$-s lose $k$~freedoms due to the ``predictors''.
-
-The statistic can test if the independence hypothesis of $y$ from $x$ is reasonable;
-more generally (with relaxed normality assumptions) it can test the hypothesis that
-\emph{the mean} of $y$ among records with a given~$x$ is the same for all~$x$.
-Under this hypothesis \NameStatF{} has, or approximates, the $F(k\,{-}\,1, n\,{-}\,k)$-distribution.
-But if the mean of $y$ given $x$ depends on~$x$, \NameStatF{}
-becomes arbitrarily large as $n\to\infty$ (with $k$~fixed) and lands extremely far
-into the tail of the $F(k\,{-}\,1, n\,{-}\,k)$-distribution given a large enough data sample.
-\end{Description}
-
-
-\paragraph{Ordinal-vs-ordinal statistics.}
-Sample statistics that describe association between two ordinal categorical features.
-Both features' value domains are encoded with positive integers, so that the natural
-order of the integers coincides with the order in each value domain.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it\NameStatRho]:
-A measure for the strength of association (statistical dependence) between
-two ordinal features, conceptually similar to \NameStatR.  Specifically, it is \NameStatR{}
-applied to the feature vectors in which all values are replaced by their ranks, i.e.\ 
-their positions if the vector is sorted.  The ranks of identical (duplicate) values
-are replaced with their average rank.  For example, in vector
-$(15, 11, 26, 15, 8)$ the value ``15'' occurs twice with ranks 3 and~4 per the sorted
-order $(8_1, 11_2, 15_3, 15_4, 26_5)$; so, both values are assigned their average
-rank of $3.5 = (3\,{+}\,4)\,{/}\,2$ and the vector is replaced by~$(3.5,\, 2,\, 5,\, 3.5,\, 1)$.
-
-Our implementation of \NameStatRho{} is geared towards features having small value domains
-and large counts for the values.  Given the two input vectors, we form a contingency table $T$
-of pairwise frequency counts, as well as a vector of frequency counts for each feature: $f_1$
-and~$f_2$.  Here in $T_{i,j}$, $f_{1,i}$, $f_{2,j}$ indices $i$ and~$j$ refer to the
-order-preserving integer encoding of the feature values.
-We use prefix sums over $f_1$ and~$f_2$ to compute the values' average ranks:
-$r_{1,i} = \sum_{j=1}^{i-1} f_{1,j} + (f_{1,i}\,{+}\,1){/}2$, and analogously for~$r_2$.
-Finally, we compute rank variances for $r_1, r_2$ weighted by counts $f_1, f_2$ and their
-covariance weighted by~$T$, before applying the standard formula for \NameStatR:
-\begin{equation*}
-\rho \,\,=\,\, \frac{\Cov_T(r_1, r_2)}{\sqrt{\Var_{f_1}(r_1)\Var_{f_2}(r_2)}}
-\,\,=\,\, \frac{\sum_{i,j} T_{i,j} (r_{1,i} - \bar{r}_1) (r_{2,j} - \bar{r}_2)}%
-{\sqrt{\sum_i f_{1,i} (r_{1,i} - \bar{r}_1)^{2\mathstrut} \cdot \sum_j f_{2,j} (r_{2,j} - \bar{r}_2)^{2\mathstrut}}}
-\end{equation*}
-where $\bar{r}_1 = \sum_i r_{1,i} f_{1,i}{/}n$, analogously for~$\bar{r}_2$.
-The value of $\rho$ lies between $-1$ and $+1$, with sign indicating the prevalent direction
-of the association: $\rho > 0$ ($\rho < 0$) means that one feature tends to increase (decrease)
-when the other feature increases.  The correlation becomes~1 when the two features are
-monotonically related.
-\end{Description}
-
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-A collection of (potentially) 4 matrices.  Each matrix contains bivariate statistics that
-resulted from a different combination of feature types.  There is one matrix for scale-scale
-statistics (which includes \NameStatR), one for nominal-nominal statistics (includes \NameStatChi{}),
-one for nominal-scale statistics (includes \NameStatF) and one for ordinal-ordinal statistics
-(includes \NameStatRho).  If any of these matrices is not produced, then no pair of columns required
-the corresponding type combination.  See Table~\ref{table:bivars} for the matrix naming and
-format details.
-
-
-\smallskip
-\pagebreak[2]
-
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-\hml -f \BivarScriptName{} -nvargs
-X=/user/biadmin/X.mtx 
-index1=/user/biadmin/S1.mtx 
-index2=/user/biadmin/S2.mtx 
-types1=/user/biadmin/K1.mtx 
-types2=/user/biadmin/K2.mtx 
-OUTDIR=/user/biadmin/stats.mtx
-
-}
-

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/DescriptiveStats.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/DescriptiveStats.tex b/Algorithms Reference/DescriptiveStats.tex
deleted file mode 100644
index 5a59ad4..0000000
--- a/Algorithms Reference/DescriptiveStats.tex	
+++ /dev/null
@@ -1,115 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\newcommand{\UnivarScriptName}{\texttt{\tt Univar-Stats.dml}}
-\newcommand{\BivarScriptName}{\texttt{\tt bivar-stats.dml}}
-
-\newcommand{\OutputRowIDMinimum}{1}
-\newcommand{\OutputRowIDMaximum}{2}
-\newcommand{\OutputRowIDRange}{3}
-\newcommand{\OutputRowIDMean}{4}
-\newcommand{\OutputRowIDVariance}{5}
-\newcommand{\OutputRowIDStDeviation}{6}
-\newcommand{\OutputRowIDStErrorMean}{7}
-\newcommand{\OutputRowIDCoeffVar}{8}
-\newcommand{\OutputRowIDQuartiles}{?, 13, ?}
-\newcommand{\OutputRowIDMedian}{13}
-\newcommand{\OutputRowIDIQMean}{14}
-\newcommand{\OutputRowIDSkewness}{9}
-\newcommand{\OutputRowIDKurtosis}{10}
-\newcommand{\OutputRowIDStErrorSkewness}{11}
-\newcommand{\OutputRowIDStErrorCurtosis}{12}
-\newcommand{\OutputRowIDNumCategories}{15}
-\newcommand{\OutputRowIDMode}{16}
-\newcommand{\OutputRowIDNumModes}{17}
-\newcommand{\OutputRowText}[1]{\mbox{(output row~{#1})\hspace{0.5pt}:}}
-
-\newcommand{\NameStatR}{Pearson's correlation coefficient}
-\newcommand{\NameStatChi}{Pearson's~$\chi^2$}
-\newcommand{\NameStatPChi}{$P\textrm{-}$value of Pearson's~$\chi^2$}
-\newcommand{\NameStatV}{Cram\'er's~$V$}
-\newcommand{\NameStatEta}{Eta statistic}
-\newcommand{\NameStatF}{$F$~statistic}
-\newcommand{\NameStatRho}{Spearman's rank correlation coefficient}
-
-Descriptive statistics are used to quantitatively describe the main characteristics of the data.
-They provide meaningful summaries computed over different observations or data records
-collected in a study.  These summaries typically form the basis of the initial data exploration
-as part of a more extensive statistical analysis.  Such a quantitative analysis assumes that
-every variable (also known as, attribute, feature, or column) in the data has a specific
-\emph{level of measurement}~\cite{Stevens1946:scales}.
-
-The measurement level of a variable, often called as {\bf variable type}, can either be
-\emph{scale} or \emph{categorical}.  A \emph{scale} variable represents the data measured on
-an interval scale or ratio scale.  Examples of scale variables include `Height', `Weight',
-`Salary', and `Temperature'.  Scale variables are also referred to as \emph{quantitative}
-or \emph{continuous} variables.  In contrast, a \emph{categorical} variable has a fixed
-limited number of distinct values or categories.  Examples of categorical variables
-include `Gender', `Region', `Hair color', `Zipcode', and `Level of Satisfaction'.
-Categorical variables can further be classified into two types, \emph{nominal} and
-\emph{ordinal}, depending on whether the categories in the variable can be ordered via an
-intrinsic ranking.  For example, there is no meaningful ranking among distinct values in
-`Hair color' variable, while the categories in `Level of Satisfaction' can be ranked from
-highly dissatisfied to highly satisfied.
-
-The input dataset for descriptive statistics is provided in the form of a matrix, whose
-rows are the records (data points) and whose columns are the features (i.e.~variables).
-Some scripts allow this matrix to be vertically split into two or three matrices.  Descriptive
-statistics are computed over the specified features (columns) in the matrix.  Which
-statistics are computed depends on the types of the features.  It is important to keep
-in mind the following caveats and restrictions:
-\begin{Enumerate}
-\item  Given a finite set of data records, i.e.~a \emph{sample}, we take their feature
-values and compute their \emph{sample statistics}.  These statistics
-will vary from sample to sample even if the underlying distribution of feature values
-remains the same.  Sample statistics are accurate for the given sample only.
-If the goal is to estimate the \emph{distribution statistics} that are parameters of
-the (hypothesized) underlying distribution of the features, the corresponding sample
-statistics may sometimes be used as approximations, but their accuracy will vary.
-\item  In particular, the accuracy of the estimated distribution statistics will be low
-if the number of values in the sample is small.  That is, for small samples, the computed
-statistics may depend on the randomness of the individual sample values more than on
-the underlying distribution of the features.
-\item  The accuracy will also be low if the sample records cannot be assumed mutually
-independent and identically distributed (i.i.d.), that is, sampled at random from the
-same underlying distribution.  In practice, feature values in one record often depend
-on other features and other records, including unknown ones.
-\item  Most of the computed statistics will have low estimation accuracy in the presence of
-extreme values (outliers) or if the underlying distribution has heavy tails, for example
-obeys a power law.  However, a few of the computed statistics, such as the median and
-\NameStatRho{}, are \emph{robust} to outliers.
-\item  Some sample statistics are reported with their \emph{sample standard errors}
-in an attempt to quantify their accuracy as distribution parameter estimators.  But these
-sample standard errors, in turn, only estimate the underlying distribution's standard
-errors and will have low accuracy for small or \mbox{non-i.i.d.} samples, outliers in samples,
-or heavy-tailed distributions.
-\item  We assume that the quantitative (scale) feature columns do not contain missing
-values, infinite values, \texttt{NaN}s, or coded non-numeric values, unless otherwise
-specified.  We assume that each categorical feature column contains positive integers
-from 1 to the number of categories; for ordinal features, the natural order on
-the integers should coincide with the order on the categories.
-\end{Enumerate}
-
-\input{DescriptiveUnivarStats}
-
-\input{DescriptiveBivarStats}
-
-\input{DescriptiveStratStats}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/DescriptiveStratStats.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/DescriptiveStratStats.tex b/Algorithms Reference/DescriptiveStratStats.tex
deleted file mode 100644
index be0cffd..0000000
--- a/Algorithms Reference/DescriptiveStratStats.tex	
+++ /dev/null
@@ -1,306 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Stratified Bivariate Statistics}
-
-\noindent{\bf Description}
-\smallskip
-
-The {\tt stratstats.dml} script computes common bivariate statistics, such
-as correlation, slope, and their p-value, in parallel for many pairs of input
-variables in the presence of a confounding categorical variable.  The values
-of this confounding variable group the records into strata (subpopulations),
-in which all bivariate pairs are assumed free of confounding.  The script
-uses the same data model as in one-way analysis of covariance (ANCOVA), with
-strata representing population samples.  It also outputs univariate stratified
-and bivariate unstratified statistics.
-
-\begin{table}[t]\hfil
-\begin{tabular}{|l|ll|ll|ll||ll|}
-\hline
-Month of the year & \multicolumn{2}{l|}{October} & \multicolumn{2}{l|}{November} &
-    \multicolumn{2}{l||}{December} & \multicolumn{2}{c|}{Oct$\,$--$\,$Dec} \\
-Customers, millions    & 0.6 & 1.4 & 1.4 & 0.6 & 3.0 & 1.0 & 5.0 & 3.0 \\
-\hline
-Promotion (0 or 1)     & 0   & 1   & 0   & 1   & 0   & 1   & 0   & 1   \\
-Avg.\ sales per 1000   & 0.4 & 0.5 & 0.9 & 1.0 & 2.5 & 2.6 & 1.8 & 1.3 \\
-\hline
-\end{tabular}\hfil
-\caption{Stratification example: the effect of the promotion on average sales
-becomes reversed and amplified (from $+0.1$ to $-0.5$) if we ignore the months.}
-\label{table:stratexample}
-\end{table}
-
-To see how data stratification mitigates confounding, consider an (artificial)
-example in Table~\ref{table:stratexample}.  A highly seasonal retail item
-was marketed with and without a promotion over the final 3~months of the year.
-In each month the sale was more likely with the promotion than without it.
-But during the peak holiday season, when shoppers came in greater numbers and
-bought the item more often, the promotion was less frequently used.  As a result,
-if the 4-th quarter data is pooled together, the promotion's effect becomes
-reversed and magnified.  Stratifying by month restores the positive correlation.
-
-The script computes its statistics in parallel over all possible pairs from two
-specified sets of covariates.  The 1-st covariate is a column in input matrix~$X$
-and the 2-nd covariate is a column in input matrix~$Y$; matrices $X$ and~$Y$ may
-be the same or different.  The columns of interest are given by their index numbers
-in special matrices.  The stratum column, specified in its own matrix, is the same
-for all covariate pairs.
-
-Both covariates in each pair must be numerical, with the 2-nd covariate normally
-distributed given the 1-st covariate (see~Details).  Missing covariate values or
-strata are represented by~``NaN''.  Records with NaN's are selectively omitted
-wherever their NaN's are material to the output statistic.
-
-\smallskip
-\pagebreak[3]
-
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}stratstats.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} Xcid=}path/file
-{\tt{} Y=}path/file
-{\tt{} Ycid=}path/file
-{\tt{} S=}path/file
-{\tt{} Scid=}int
-{\tt{} O=}path/file
-{\tt{} fmt=}format
-
-}
-
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read matrix $X$ whose columns we want to use as
-the 1-st covariate (i.e.~as the feature variable)
-\item[{\tt Xcid}:] (default:\mbox{ }{\tt " "})
-Location to read the single-row matrix that lists all index numbers
-of the $X$-columns used as the 1-st covariate; the default value means
-``use all $X$-columns''
-\item[{\tt Y}:] (default:\mbox{ }{\tt " "})
-Location to read matrix $Y$ whose columns we want to use as the 2-nd
-covariate (i.e.~as the response variable); the default value means
-``use $X$ in place of~$Y$''
-\item[{\tt Ycid}:] (default:\mbox{ }{\tt " "})
-Location to read the single-row matrix that lists all index numbers
-of the $Y$-columns used as the 2-nd covariate; the default value means
-``use all $Y$-columns''
-\item[{\tt S}:] (default:\mbox{ }{\tt " "})
-Location to read matrix $S$ that has the stratum column.
-Note: the stratum column must contain small positive integers; all fractional
-values are rounded; stratum IDs of value ${\leq}\,0$ or NaN are treated as
-missing.  The default value for {\tt S} means ``use $X$ in place of~$S$''
-\item[{\tt Scid}:] (default:\mbox{ }{\tt 1})
-The index number of the stratum column in~$S$
-\item[{\tt O}:]
-Location to store the output matrix defined in Table~\ref{table:stratoutput}
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
-\begin{table}[t]\small\hfil
-\begin{tabular}{|rcl|rcl|}
-\hline
-& Col.\# & Meaning & & Col.\# & Meaning \\
-\hline
-\multirow{9}{*}{\begin{sideways}1-st covariate\end{sideways}}\hspace{-1em}
-& 01     & $X$-column number                & 
-\multirow{9}{*}{\begin{sideways}2-nd covariate\end{sideways}}\hspace{-1em}
-& 11     & $Y$-column number                \\
-& 02     & presence count for $x$           & 
-& 12     & presence count for $y$           \\
-& 03     & global mean $(x)$                & 
-& 13     & global mean $(y)$                \\
-& 04     & global std.\ dev. $(x)$          & 
-& 14     & global std.\ dev. $(y)$          \\
-& 05     & stratified std.\ dev. $(x)$      & 
-& 15     & stratified std.\ dev. $(y)$      \\
-& 06     & $R^2$ for $x \sim {}$strata      & 
-& 16     & $R^2$ for $y \sim {}$strata      \\
-& 07     & adjusted $R^2$ for $x \sim {}$strata      & 
-& 17     & adjusted $R^2$ for $y \sim {}$strata      \\
-& 08     & p-value, $x \sim {}$strata       & 
-& 18     & p-value, $y \sim {}$strata       \\
-& 09--10 & reserved                         & 
-& 19--20 & reserved                         \\
-\hline
-\multirow{9}{*}{\begin{sideways}$y\sim x$, NO strata\end{sideways}}\hspace{-1.15em}
-& 21     & presence count $(x, y)$          &
-\multirow{10}{*}{\begin{sideways}$y\sim x$ AND strata$\!\!\!\!$\end{sideways}}\hspace{-1.15em}
-& 31     & presence count $(x, y, s)$       \\
-& 22     & regression slope                 &
-& 32     & regression slope                 \\
-& 23     & regres.\ slope std.\ dev.        &
-& 33     & regres.\ slope std.\ dev.        \\
-& 24     & correlation${} = \pm\sqrt{R^2}$  &
-& 34     & correlation${} = \pm\sqrt{R^2}$  \\
-& 25     & residual std.\ dev.              &
-& 35     & residual std.\ dev.              \\
-& 26     & $R^2$ in $y$ due to $x$          &
-& 36     & $R^2$ in $y$ due to $x$          \\
-& 27     & adjusted $R^2$ in $y$ due to $x$ &
-& 37     & adjusted $R^2$ in $y$ due to $x$ \\
-& 28     & p-value for ``slope = 0''        &
-& 38     & p-value for ``slope = 0''        \\
-& 29     & reserved                         &
-& 39     & \# strata with ${\geq}\,2$ count \\
-& 30     & reserved                         &
-& 40     & reserved                         \\
-\hline
-\end{tabular}\hfil
-\caption{The {\tt stratstats.dml} output matrix has one row per each distinct
-pair of 1-st and 2-nd covariates, and 40 columns with the statistics described
-here.}
-\label{table:stratoutput}
-\end{table}
-
-
-
-
-\noindent{\bf Details}
-\smallskip
-
-Suppose we have $n$ records of format $(i, x, y)$, where $i\in\{1,\ldots, k\}$ is
-a stratum number and $(x, y)$ are two numerical covariates.  We want to analyze
-conditional linear relationship between $y$ and $x$ conditioned by~$i$.
-Note that $x$, but not~$y$, may represent a categorical variable if we assign a
-numerical value to each category, for example 0 and 1 for two categories.
-
-We assume a linear regression model for~$y$:
-\begin{equation}
-y_{i,j} \,=\, \alpha_i + \beta x_{i,j} + \eps_{i,j}\,, \quad\textrm{where}\,\,\,\,
-\eps_{i,j} \sim \Normal(0, \sigma^2)
-\label{eqn:stratlinmodel}
-\end{equation}
-Here $i = 1\ldots k$ is a stratum number and $j = 1\ldots n_i$ is a record number
-in stratum~$i$; by $n_i$ we denote the number of records available in stratum~$i$.
-The noise term~$\eps_{i,j}$ is assumed to have the same variance in all strata.
-When $n_i\,{>}\,0$, we can estimate the means of $x_{i, j}$ and $y_{i, j}$ in
-stratum~$i$ as
-\begin{equation*}
-\bar{x}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,x_{i, j}\Big) / n_i\,;\quad
-\bar{y}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,y_{i, j}\Big) / n_i
-\end{equation*}
-If $\beta$ is known, the best estimate for $\alpha_i$ is $\bar{y}_i - \beta \bar{x}_i$,
-which gives the prediction error sum-of-squares of
-\begin{equation}
-\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \beta x_{i,j} - (\bar{y}_i - \beta \bar{x}_i)\big)^2
-\,\,=\,\, \beta^{2\,}V_x \,-\, 2\beta \,V_{x,y} \,+\, V_y
-\label{eqn:stratsumsq}
-\end{equation}
-where $V_x$, $V_y$, and $V_{x, y}$ are, correspondingly, the ``stratified'' sample
-estimates of variance $\Var(x)$ and $\Var(y)$ and covariance $\Cov(x,y)$ computed as
-\begin{align*}
-V_x     \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)^2; \quad
-V_y     \,=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - \bar{y}_i\big)^2;\\
-V_{x,y} \,&=\, \sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(x_{i,j} - \bar{x}_i\big)\big(y_{i,j} - \bar{y}_i\big)
-\end{align*}
-They are stratified because we compute the sample (co-)variances in each stratum~$i$
-separately, then combine by summation.  The stratified estimates for $\Var(X)$ and $\Var(Y)$
-tend to be smaller than the non-stratified ones (with the global mean instead of $\bar{x}_i$
-and~$\bar{y}_i$) since $\bar{x}_i$ and $\bar{y}_i$ fit closer to $x_{i,j}$ and $y_{i,j}$
-than the global means.  The stratified variance estimates the uncertainty in $x_{i,j}$ 
-and~$y_{i,j}$ given their stratum~$i$.
-
-Minimizing over~$\beta$ the error sum-of-squares~(\ref{eqn:stratsumsq})
-gives us the regression slope estimate \mbox{$\hat{\beta} = V_{x,y} / V_x$},
-with~(\ref{eqn:stratsumsq}) becoming the residual sum-of-squares~(RSS):
-\begin{equation*}
-\mathrm{RSS} \,\,=\, \,
-\sum\nolimits_{i=1}^k \sum\nolimits_{j=1}^{n_i} \big(y_{i,j} - 
-\hat{\beta} x_{i,j} - (\bar{y}_i - \hat{\beta} \bar{x}_i)\big)^2
-\,\,=\,\,  V_y \,\big(1 \,-\, V_{x,y}^2 / (V_x V_y)\big)
-\end{equation*}
-The quantity $\hat{R}^2 = V_{x,y}^2 / (V_x V_y)$, called \emph{$R$-squared}, estimates the fraction
-of stratified variance in~$y_{i,j}$ explained by covariate $x_{i, j}$ in the linear 
-regression model~(\ref{eqn:stratlinmodel}).  We define \emph{stratified correlation} as the
-square root of~$\hat{R}^2$ taken with the sign of~$V_{x,y}$.  We also use RSS to estimate
-the residual standard deviation $\sigma$ in~(\ref{eqn:stratlinmodel}) that models the prediction error
-of $y_{i,j}$ given $x_{i,j}$ and the stratum:
-\begin{equation*}
-\hat{\beta}\, =\, \frac{V_{x,y}}{V_x}; \,\,\,\, \hat{R} \,=\, \frac{V_{x,y}}{\sqrt{V_x V_y}};
-\,\,\,\, \hat{R}^2 \,=\, \frac{V_{x,y}^2}{V_x V_y};
-\,\,\,\, \hat{\sigma} \,=\, \sqrt{\frac{\mathrm{RSS}}{n - k - 1}}\,\,\,\,
-\Big(n = \sum_{i=1}^k n_i\Big)
-\end{equation*}
-
-The $t$-test and the $F$-test for the null-hypothesis of ``$\beta = 0$'' are
-obtained by considering the effect of $\hat{\beta}$ on the residual sum-of-squares,
-measured by the decrease from $V_y$ to~RSS.
-The $F$-statistic is the ratio of the ``explained'' sum-of-squares
-to the residual sum-of-squares, divided by their corresponding degrees of freedom.
-There are $n\,{-}\,k$ degrees of freedom for~$V_y$, parameter $\beta$ reduces that
-to $n\,{-}\,k\,{-}\,1$ for~RSS, and their difference $V_y - {}$RSS has just 1 degree
-of freedom:
-\begin{equation*}
-F \,=\, \frac{(V_y - \mathrm{RSS})/1}{\mathrm{RSS}/(n\,{-}\,k\,{-}\,1)}
-\,=\, \frac{\hat{R}^2\,(n\,{-}\,k\,{-}\,1)}{1-\hat{R}^2}; \quad
-t \,=\, \hat{R}\, \sqrt{\frac{n\,{-}\,k\,{-}\,1}{1-\hat{R}^2}}.
-\end{equation*}
-The $t$-statistic is simply the square root of the $F$-statistic with the appropriate
-choice of sign.  If the null hypothesis and the linear model are both true, the $t$-statistic
-has Student $t$-distribution with $n\,{-}\,k\,{-}\,1$ degrees of freedom.  We can
-also compute it if we divide $\hat{\beta}$ by its estimated standard deviation:
-\begin{equation*}
-\stdev(\hat{\beta})_{\mathrm{est}} \,=\, \hat{\sigma}\,/\sqrt{V_x} \quad\Longrightarrow\quad
-t \,=\, \hat{R}\sqrt{V_y} \,/\, \hat{\sigma} \,=\, \beta \,/\, \stdev(\hat{\beta})_{\mathrm{est}}
-\end{equation*}
-The standard deviation estimate for~$\beta$ is included in {\tt stratstats.dml} output.
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-The output matrix format is defined in Table~\ref{table:stratoutput}.
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-\hml -f stratstats.dml -nvargs X=/user/biadmin/X.mtx Xcid=/user/biadmin/Xcid.mtx
-  Y=/user/biadmin/Y.mtx Ycid=/user/biadmin/Ycid.mtx S=/user/biadmin/S.mtx Scid=2
-  O=/user/biadmin/Out.mtx fmt=csv
-
-}
-{\hangindent=\parindent\noindent\tt
-\hml -f stratstats.dml -nvargs X=/user/biadmin/Data.mtx Xcid=/user/biadmin/Xcid.mtx
-  Ycid=/user/biadmin/Ycid.mtx Scid=7 O=/user/biadmin/Out.mtx
-
-}
-
-%\smallskip
-%\noindent{\bf See Also}
-%\smallskip
-%
-%For non-stratified bivariate statistics with a wider variety of input data types
-%and statistical tests, see \ldots.  For general linear regression, see
-%{\tt LinearRegDS.dml} and {\tt LinearRegCG.dml}.  For logistic regression, appropriate
-%when the response variable is categorical, see {\tt MultiLogReg.dml}.
-

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/DescriptiveUnivarStats.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/DescriptiveUnivarStats.tex b/Algorithms Reference/DescriptiveUnivarStats.tex
deleted file mode 100644
index 5838e3e..0000000
--- a/Algorithms Reference/DescriptiveUnivarStats.tex	
+++ /dev/null
@@ -1,603 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Univariate Statistics}
-
-\noindent{\bf Description}
-\smallskip
-
-\emph{Univariate statistics} are the simplest form of descriptive statistics in data
-analysis.  They are used to quantitatively describe the main characteristics of each
-feature in the data.  For a given dataset matrix, script \UnivarScriptName{} computes
-certain univariate statistics for each feature column in the
-matrix.  The feature type governs the exact set of statistics computed for that feature.
-For example, the statistic \emph{mean} can only be computed on a quantitative (scale)
-feature like `Height' and `Temperature'.  It does not make sense to compute the mean
-of a categorical attribute like `Hair Color'.
-
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%\tolerance=0
-{\tt{}-f } \UnivarScriptName{}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} TYPES=}path/file
-{\tt{} STATS=}path/file
-% {\tt{} fmt=}format
-
-}
-
-
-\medskip
-\pagebreak[2]
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the data matrix $X$ whose columns we want to
-analyze as the features.
-\item[{\tt TYPES}:] % (default:\mbox{ }{\tt " "})
-Location (on HDFS) to read the single-row matrix whose $i^{\textrm{th}}$
-column-cell contains the type of the $i^{\textrm{th}}$ feature column
-\texttt{X[,$\,i$]} in the data matrix.  Feature types must be encoded by
-integer numbers: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
-% The default value means ``treat all $X$-columns as scale.''
-\item[{\tt STATS}:]
-Location (on HDFS) where the output matrix of computed statistics
-will be stored.  The format of the output matrix is defined by
-Table~\ref{table:univars}.
-% \item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-% Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-% see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-\begin{table}[t]\hfil
-\begin{tabular}{|rl|c|c|}
-\hline
-\multirow{2}{*}{Row}& \multirow{2}{*}{Name of Statistic} & \multicolumn{2}{c|}{Applies to:} \\
-                            &                            & Scale & Categ.\\
-\hline
-\OutputRowIDMinimum         & Minimum                    &   +   &       \\
-\OutputRowIDMaximum         & Maximum                    &   +   &       \\
-\OutputRowIDRange           & Range                      &   +   &       \\
-\OutputRowIDMean            & Mean                       &   +   &       \\
-\OutputRowIDVariance        & Variance                   &   +   &       \\
-\OutputRowIDStDeviation     & Standard deviation         &   +   &       \\
-\OutputRowIDStErrorMean     & Standard error of mean     &   +   &       \\
-\OutputRowIDCoeffVar        & Coefficient of variation   &   +   &       \\
-\OutputRowIDSkewness        & Skewness                   &   +   &       \\
-\OutputRowIDKurtosis        & Kurtosis                   &   +   &       \\
-\OutputRowIDStErrorSkewness & Standard error of skewness &   +   &       \\
-\OutputRowIDStErrorCurtosis & Standard error of kurtosis &   +   &       \\
-\OutputRowIDMedian          & Median                     &   +   &       \\
-\OutputRowIDIQMean          & Inter quartile mean        &   +   &       \\
-\OutputRowIDNumCategories   & Number of categories       &       &   +   \\
-\OutputRowIDMode            & Mode                       &       &   +   \\
-\OutputRowIDNumModes        & Number of modes            &       &   +   \\
-\hline
-\end{tabular}\hfil
-\caption{The output matrix of \UnivarScriptName{} has one row per each
-univariate statistic and one column per input feature.  This table lists
-the meaning of each row.  Signs ``+'' show applicability to scale or/and
-to categorical features.}
-\label{table:univars}
-\end{table}
-
-
-\pagebreak[1]
-
-\smallskip
-\noindent{\bf Details}
-\smallskip
-
-Given an input matrix \texttt{X}, this script computes the set of all
-relevant univariate statistics for each feature column \texttt{X[,$\,i$]}
-in~\texttt{X}.  The list of statistics to be computed depends on the
-\emph{type}, or \emph{measurement level}, of each column.
-The \textrm{TYPES} command-line argument points to a vector containing
-the types of all columns.  The types must be provided as per the following
-convention: $1 = {}$scale, $2 = {}$nominal, $3 = {}$ordinal.
-
-Below we list all univariate statistics computed by script \UnivarScriptName.
-The statistics are collected by relevance into several groups, namely: central
-tendency, dispersion, shape, and categorical measures.  The first three groups
-contain statistics computed for a quantitative (also known as: numerical, scale,
-or continuous) feature; the last group contains the statistics for a categorical
-(either nominal or ordinal) feature.  
-
-Let~$n$ be the number of data records (rows) with feature values.
-In what follows we fix a column index \texttt{idx} and consider
-sample statistics of feature column \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}.
-Let $v = (v_1, v_2, \ldots, v_n)$ be the values of \texttt{X[}$\,$\texttt{,}$\,$\texttt{idx]}
-in their original unsorted order: $v_i = \texttt{X[}i\texttt{,}\,\texttt{idx]}$.
-Let $v^s = (v^s_1, v^s_2, \ldots, v^s_n)$ be the same values in the sorted order,
-preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
-
-\paragraph{Central tendency measures.}
-Sample statistics that describe the location of the quantitative (scale) feature distribution,
-represent it with a single value.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Mean]
-\OutputRowText{\OutputRowIDMean}
-The arithmetic average over a sample of a quantitative feature.
-Computed as the ratio between the sum of values and the number of values:
-$\left(\sum_{i=1}^n v_i\right)\!/n$.
-Example: the mean of sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-equals~5.2.
-
-Note that the mean is significantly affected by extreme values in the sample
-and may be misleading as a central tendency measure if the feature varies on
-exponential scale.  For example, the mean of $\{$0.01, 0.1, 1.0, 10.0, 100.0$\}$
-is 22.222, greater than all the sample values except the~largest.
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\begin{figure}[t]
-\setlength{\unitlength}{10pt}
-\begin{picture}(33,12)
-\put( 6.2, 0.0){\small 2.2}
-\put(10.2, 0.0){\small 3.2}
-\put(12.2, 0.0){\small 3.7}
-\put(15.0, 0.0){\small 4.4}
-\put(18.6, 0.0){\small 5.3}
-\put(20.2, 0.0){\small 5.7}
-\put(21.75,0.0){\small 6.1}
-\put(23.05,0.0){\small 6.4}
-\put(26.2, 0.0){\small 7.2}
-\put(28.6, 0.0){\small 7.8}
-\put( 0.5, 0.7){\small 0.0}
-\put( 0.1, 3.2){\small 0.25}
-\put( 0.5, 5.7){\small 0.5}
-\put( 0.1, 8.2){\small 0.75}
-\put( 0.5,10.7){\small 1.0}
-\linethickness{1.5pt}
-\put( 2.0, 1.0){\line(1,0){4.8}}
-\put( 6.8, 1.0){\line(0,1){1.0}}
-\put( 6.8, 2.0){\line(1,0){4.0}}
-\put(10.8, 2.0){\line(0,1){1.0}}
-\put(10.8, 3.0){\line(1,0){2.0}}
-\put(12.8, 3.0){\line(0,1){1.0}}
-\put(12.8, 4.0){\line(1,0){2.8}}
-\put(15.6, 4.0){\line(0,1){1.0}}
-\put(15.6, 5.0){\line(1,0){3.6}}
-\put(19.2, 5.0){\line(0,1){1.0}}
-\put(19.2, 6.0){\line(1,0){1.6}}
-\put(20.8, 6.0){\line(0,1){1.0}}
-\put(20.8, 7.0){\line(1,0){1.6}}
-\put(22.4, 7.0){\line(0,1){1.0}}
-\put(22.4, 8.0){\line(1,0){1.2}}
-\put(23.6, 8.0){\line(0,1){1.0}}
-\put(23.6, 9.0){\line(1,0){3.2}}
-\put(26.8, 9.0){\line(0,1){1.0}}
-\put(26.8,10.0){\line(1,0){2.4}}
-\put(29.2,10.0){\line(0,1){1.0}}
-\put(29.2,11.0){\line(1,0){4.8}}
-\linethickness{0.3pt}
-\put( 6.8, 1.0){\circle*{0.3}}
-\put(10.8, 1.0){\circle*{0.3}}
-\put(12.8, 1.0){\circle*{0.3}}
-\put(15.6, 1.0){\circle*{0.3}}
-\put(19.2, 1.0){\circle*{0.3}}
-\put(20.8, 1.0){\circle*{0.3}}
-\put(22.4, 1.0){\circle*{0.3}}
-\put(23.6, 1.0){\circle*{0.3}}
-\put(26.8, 1.0){\circle*{0.3}}
-\put(29.2, 1.0){\circle*{0.3}}
-\put( 6.8, 1.0){\vector(1,0){27.2}}
-\put( 2.0, 1.0){\vector(0,1){10.8}}
-\put( 2.0, 3.5){\line(1,0){10.8}}
-\put( 2.0, 6.0){\line(1,0){17.2}}
-\put( 2.0, 8.5){\line(1,0){21.6}}
-\put( 2.0,11.0){\line(1,0){27.2}}
-\put(12.8, 1.0){\line(0,1){2.0}}
-\put(19.2, 1.0){\line(0,1){5.0}}
-\put(20.0, 1.0){\line(0,1){5.0}}
-\put(23.6, 1.0){\line(0,1){7.0}}
-\put( 9.0, 4.0){\line(1,0){3.8}}
-\put( 9.2, 2.7){\vector(0,1){0.8}}
-\put( 9.2, 4.8){\vector(0,-1){0.8}}
-\put(19.4, 8.0){\line(1,0){3.0}}
-\put(19.6, 7.2){\vector(0,1){0.8}}
-\put(19.6, 9.3){\vector(0,-1){0.8}}
-\put(13.0, 2.2){\small $q_{25\%}$}
-\put(17.3, 2.2){\small $q_{50\%}$}
-\put(23.8, 2.2){\small $q_{75\%}$}
-\put(20.15,3.5){\small $\mu$}
-\put( 8.0, 3.75){\small $\phi_1$}
-\put(18.35,7.8){\small $\phi_2$}
-\end{picture}
-\label{fig:example_quartiles}
-\caption{The computation of quartiles, median, and interquartile mean from the
-empirical distribution function of the 10-point
-sample $\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$.  Each vertical step in
-the graph has height~$1{/}n = 0.1$.  Values $q_{25\%}$, $q_{50\%}$, and $q_{75\%}$ denote
-the $1^{\textrm{st}}$, $2^{\textrm{nd}}$, and $3^{\textrm{rd}}$ quartiles correspondingly;
-value~$\mu$ denotes the median.  Values $\phi_1$ and $\phi_2$ show the partial contribution
-of border points (quartiles) $v_3=3.7$ and $v_8=6.4$ into the interquartile mean.}
-\end{figure}
-
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Median]
-\OutputRowText{\OutputRowIDMedian}
-The ``middle'' value that separates the higher half of the sample values
-(in a sorted order) from the lower half.
-To compute the median, we sort the sample in the increasing order, preserving
-duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
-If $n$ is odd, the median equals $v^s_i$ where $i = (n\,{+}\,1)\,{/}\,2$,
-same as the $50^{\textrm{th}}$~percentile of the sample.
-If $n$ is even, there are two ``middle'' values $v^s_{n/2}$ and $v^s_{n/2\,+\,1}$,
-so we compute the median as the mean of these two values.
-(For even~$n$ we compute the $50^{\textrm{th}}$~percentile as~$v^s_{n/2}$,
-not as the median.)  Example: the median of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-equals $(5.3\,{+}\,5.7)\,{/}\,2$~${=}$~5.5, see Figure~\ref{fig:example_quartiles}.
-
-Unlike the mean, the median is not sensitive to extreme values in the sample,
-i.e.\ it is robust to outliers.  It works better as a measure of central tendency
-for heavy-tailed distributions and features that vary on exponential scale.
-However, the median is sensitive to small sample size.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Interquartile mean]
-\OutputRowText{\OutputRowIDIQMean}
-For a sample of a quantitative feature, this is
-the mean of the values greater than or equal to the $1^{\textrm{st}}$ quartile
-and less than or equal the $3^{\textrm{rd}}$ quartile.
-In other words, it is a ``truncated mean'' where the lowest 25$\%$ and
-the highest 25$\%$ of the sorted values are omitted in its computation.
-The two ``border values'', i.e.\ the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$
-quartiles themselves, contribute to this mean only partially.
-This measure is occasionally used as the ``robust'' version of the mean
-that is less sensitive to the extreme values.
-
-To compute the measure, we sort the sample in the increasing order,
-preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
-We set $j = \lceil n{/}4 \rceil$ for the $1^{\textrm{st}}$ quartile index
-and $k = \lceil 3n{/}4 \rceil$ for the $3^{\textrm{rd}}$ quartile index,
-then compute the following weighted mean:
-\begin{equation*}
-\frac{1}{3{/}4 - 1{/}4} \left[
-\left(\frac{j}{n} - \frac{1}{4}\right) v^s_j \,\,+ 
-\sum_{j<i<k} \left(\frac{i}{n} - \frac{i\,{-}\,1}{n}\right) v^s_i 
-\,\,+\,\, \left(\frac{3}{4} - \frac{k\,{-}\,1}{n}\right) v^s_k\right]
-\end{equation*}
-In other words, all sample values between the $1^{\textrm{st}}$ and the $3^{\textrm{rd}}$
-quartile enter the sum with weights $2{/}n$, times their number of duplicates, while the
-two quartiles themselves enter the sum with reduced weights.  The weights are proportional
-to the vertical steps in the empirical distribution function of the sample, see
-Figure~\ref{fig:example_quartiles} for an illustration.
-Example: the interquartile mean of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals the sum
-$0.1 (3.7\,{+}\,6.4) + 0.2 (4.4\,{+}\,5.3\,{+}\,5.7\,{+}\,6.1)$,
-which equals~5.31.
-\end{Description}
-
-
-\paragraph{Dispersion measures.}
-Statistics that describe the amount of variation or spread in a quantitative
-(scale) data feature.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Variance]
-\OutputRowText{\OutputRowIDVariance}
-A measure of dispersion, or spread-out, of sample values around their mean,
-expressed in units that are the square of those of the feature itself.
-Computed as the sum of squared differences between the values
-in the sample and their mean, divided by one less than the number of
-values: $\sum_{i=1}^n (v_i - \bar{v})^2\,/\,(n\,{-}\,1)$ where 
-$\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
-Example: the variance of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~3.24.
-Note that at least two values ($n\geq 2$) are required to avoid division
-by zero.  Sample variance is sensitive to outliers, even more than the mean.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Standard deviation]
-\OutputRowText{\OutputRowIDStDeviation}
-A measure of dispersion around the mean, the square root of variance.
-Computed by taking the square root of the sample variance;
-see \emph{Variance} above on computing the variance.
-Example: the standard deviation of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$ equals~1.8.
-At least two values are required to avoid division by zero.
-Note that standard deviation is sensitive to outliers.  
-
-Standard deviation is used in conjunction with the mean to determine
-an interval containing a given percentage of the feature values,
-assuming the normal distribution.  In a large sample from a normal
-distribution, around 68\% of the cases fall within one standard
-deviation and around 95\% of cases fall within two standard deviations
-of the mean.  For example, if the mean age is 45 with a standard deviation
-of 10, around 95\% of the cases would be between 25 and 65 in a normal
-distribution.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Coefficient of variation]
-\OutputRowText{\OutputRowIDCoeffVar}
-The ratio of the standard deviation to the mean, i.e.\ the
-\emph{relative} standard deviation, of a quantitative feature sample.
-Computed by dividing the sample \emph{standard deviation} by the
-sample \emph{mean}, see above for their computation details.
-Example: the coefficient of variation for sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-equals 1.8$\,{/}\,$5.2~${\approx}$~0.346.
-
-This metric is used primarily with non-negative features such as
-financial or population data.  It is sensitive to outliers.
-Note: zero mean causes division by zero, returning infinity or \texttt{NaN}.
-At least two values (records) are required to compute the standard deviation.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Minimum]
-\OutputRowText{\OutputRowIDMinimum}
-The smallest value of a quantitative sample, computed as $\min v = v^s_1$.
-Example: the minimum of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-equals~2.2.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Maximum]
-\OutputRowText{\OutputRowIDMaximum}
-The largest value of a quantitative sample, computed as $\max v = v^s_n$.
-Example: the maximum of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-equals~7.8.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Range]
-\OutputRowText{\OutputRowIDRange}
-The difference between the largest and the smallest value of a quantitative
-sample, computed as $\max v - \min v = v^s_n - v^s_1$.
-It provides information about the overall spread of the sample values.
-Example: the range of sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-equals 7.8$\,{-}\,$2.2~${=}$~5.6.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Standard error of the mean]
-\OutputRowText{\OutputRowIDStErrorMean}
-A measure of how much the value of the sample mean may vary from sample
-to sample taken from the same (hypothesized) distribution of the feature.
-It helps to roughly bound the distribution mean, i.e.\
-the limit of the sample mean as the sample size tends to infinity.
-Under certain assumptions (e.g.\ normality and large sample), the difference
-between the distribution mean and the sample mean is unlikely to exceed
-2~standard errors.
-
-The measure is computed by dividing the sample standard deviation
-by the square root of the number of values~$n$; see \emph{standard deviation}
-for its computation details.  Ensure $n\,{\geq}\,2$ to avoid division by~0.
-Example: for sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-with the mean of~5.2 the standard error of the mean
-equals 1.8$\,{/}\sqrt{10}$~${\approx}$~0.569.
-
-Note that the standard error itself is subject to sample randomness.
-Its accuracy as an error estimator may be low if the sample size is small
-or \mbox{non-i.i.d.}, if there are outliers, or if the distribution has
-heavy tails.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-% \item[\it Quartiles]
-% \OutputRowText{\OutputRowIDQuartiles}
-% %%% dsDefn %%%%
-% The values of a quantitative feature
-% that divide an ordered/sorted set of data records into four equal-size groups.
-% The $1^{\textrm{st}}$ quartile, or the $25^{\textrm{th}}$ percentile, splits
-% the sorted data into the lowest $25\%$ and the highest~$75\%$.  In other words,
-% it is the middle value between the minimum and the median.  The $2^{\textrm{nd}}$
-% quartile is the median itself, the value that separates the higher half of
-% the data (in the sorted order) from the lower half.  Finally, the $3^{\textrm{rd}}$
-% quartile, or the $75^{\textrm{th}}$ percentile, divides the sorted data into
-% lowest $75\%$ and highest~$25\%$.\par
-% %%% dsComp %%%%
-% To compute the quartiles for a data column \texttt{X[,i]} with $n$ numerical values
-% we sort it in the increasing order, preserving duplicates, then return 
-% \texttt{X}${}^{\textrm{sort}}$\texttt{[}$k$\texttt{,i]}
-% where $k = \lceil pn \rceil$ for $p = 0.25$, $0.5$, and~$0.75$.
-% When $n$ is even, the $2^{\textrm{nd}}$ quartile (the median) is further adjusted
-% to equal the mean of two middle values
-% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\texttt{,i]}$ and
-% $\texttt{X}^{\textrm{sort}}\texttt{[}n{/}2\,{+}\,1\texttt{,i]}$.
-% %%% dsWarn %%%%
-% We assume that the feature column does not contain \texttt{NaN}s or coded non-numeric values.
-% %%% dsExmpl %%%
-% \textbf{Example(s).}
-\end{Description}
-
-
-\paragraph{Shape measures.}
-Statistics that describe the shape and symmetry of the quantitative (scale)
-feature distribution estimated from a sample of its values.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Skewness]
-\OutputRowText{\OutputRowIDSkewness}
-It measures how symmetrically the values of a feature are spread out
-around the mean.  A significant positive skewness implies a longer (or fatter)
-right tail, i.e. feature values tend to lie farther away from the mean on the
-right side.  A significant negative skewness implies a longer (or fatter) left
-tail.  The normal distribution is symmetric and has a skewness value of~0;
-however, its sample skewness is likely to be nonzero, just close to zero.
-As a guideline, a skewness value more than twice its standard error is taken
-to indicate a departure from symmetry.
-
-Skewness is computed as the $3^{\textrm{rd}}$~central moment divided by the cube
-of the standard deviation.  We estimate the $3^{\textrm{rd}}$~central moment as
-the sum of cubed differences between the values in the feature column and their
-sample mean, divided by the number of values:  
-$\sum_{i=1}^n (v_i - \bar{v})^3 / n$
-where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
-The standard deviation is computed
-as described above in \emph{standard deviation}.  To avoid division by~0,
-at least two different sample values are required.  Example: for sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-with the mean of~5.2 and the standard deviation of~1.8
-skewness is estimated as $-1.0728\,{/}\,1.8^3 \approx -0.184$.
-Note: skewness is sensitive to outliers.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Standard error in skewness]
-\OutputRowText{\OutputRowIDStErrorSkewness}
-A measure of how much the sample skewness may vary from sample to sample,
-assuming that the feature is normally distributed, which makes its
-distribution skewness equal~0.  
-Given the number~$n$ of sample values, the standard error is computed as
-\begin{equation*}
-\sqrt{\frac{6n\,(n-1)}{(n-2)(n+1)(n+3)}}
-\end{equation*}
-This measure can tell us, for example:
-\begin{Itemize}
-\item If the sample skewness lands within two standard errors from~0, its
-positive or negative sign is non-significant, may just be accidental.
-\item If the sample skewness lands outside this interval, the feature
-is unlikely to be normally distributed.
-\end{Itemize}
-At least 3~values ($n\geq 3$) are required to avoid arithmetic failure.
-Note that the standard error is inaccurate if the feature distribution is
-far from normal or if the number of samples is small.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Kurtosis]
-\OutputRowText{\OutputRowIDKurtosis}
-As a distribution parameter, kurtosis is a measure of the extent to which
-feature values cluster around a central point.  In other words, it quantifies
-``peakedness'' of the distribution: how tall and sharp the central peak is
-relative to a standard bell curve.
-
-Positive kurtosis (\emph{leptokurtic} distribution) indicates that, relative
-to a normal distribution:
-\begin{Itemize}
-\item observations cluster more about the center (peak-shaped),
-\item the tails are thinner at non-extreme values, 
-\item the tails are thicker at extreme values.
-\end{Itemize}
-Negative kurtosis (\emph{platykurtic} distribution) indicates that, relative
-to a normal distribution:
-\begin{Itemize}
-\item observations cluster less about the center (box-shaped),
-\item the tails are thicker at non-extreme values, 
-\item the tails are thinner at extreme values.
-\end{Itemize}
-Kurtosis of a normal distribution is zero; however, the sample kurtosis
-(computed here) is likely to deviate from zero.
-
-Sample kurtosis is computed as the $4^{\textrm{th}}$~central moment divided
-by the $4^{\textrm{th}}$~power of the standard deviation, minus~3.
-We estimate the $4^{\textrm{th}}$~central moment as the sum of the
-$4^{\textrm{th}}$~powers of differences between the values in the feature column
-and their sample mean, divided by the number of values:
-$\sum_{i=1}^n (v_i - \bar{v})^4 / n$
-where $\bar{v}=\left(\sum_{i=1}^n v_i\right)\!/n$.
-The standard deviation is computed as described above, see \emph{standard deviation}.
-
-Note that kurtosis is sensitive to outliers, and requires at least two different
-sample values.  Example: for sample
-$\{$2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8$\}$
-with the mean of~5.2 and the standard deviation of~1.8,
-sample kurtosis equals $16.6962\,{/}\,1.8^4 - 3 \approx -1.41$.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Standard error in kurtosis]
-\OutputRowText{\OutputRowIDStErrorCurtosis}
-A measure of how much the sample kurtosis may vary from sample to sample,
-assuming that the feature is normally distributed, which makes its
-distribution kurtosis equal~0.
-Given the number~$n$ of sample values, the standard error is computed as
-\begin{equation*}
-\sqrt{\frac{24n\,(n-1)^2}{(n-3)(n-2)(n+3)(n+5)}}
-\end{equation*}
-This measure can tell us, for example:
-\begin{Itemize}
-\item If the sample kurtosis lands within two standard errors from~0, its
-positive or negative sign is non-significant, may just be accidental.
-\item If the sample kurtosis lands outside this interval, the feature
-is unlikely to be normally distributed.
-\end{Itemize}
-At least 4~values ($n\geq 4$) are required to avoid arithmetic failure.
-Note that the standard error is inaccurate if the feature distribution is
-far from normal or if the number of samples is small.
-\end{Description}
-
-
-\paragraph{Categorical measures.}  Statistics that describe the sample of
-a categorical feature, either nominal or ordinal.  We represent all
-categories by integers from~1 to the number of categories; we call
-these integers \emph{category~IDs}.
-\begin{Description}
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Number of categories]
-\OutputRowText{\OutputRowIDNumCategories}
-The maximum category~ID that occurs in the sample.  Note that some
-categories with~IDs \emph{smaller} than this maximum~ID may have
-no~occurrences in the sample, without reducing the number of categories.
-However, any categories with~IDs \emph{larger} than the maximum~ID with
-no occurrences in the sample will not be counted.
-Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
-the number of categories is reported as~8.  Category~IDs 2 and~6, which have
-zero occurrences, are still counted; but if there is a category with
-ID${}=9$ and zero occurrences, it is not counted.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Mode]
-\OutputRowText{\OutputRowIDMode}
-The most frequently occurring category value.
-If several values share the greatest frequency of occurrence, then each
-of them is a mode; but here we report only the smallest of these modes.
-Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
-the modes are 3 and~7, with 3 reported.
-
-Computed by counting the number of occurrences for each category,
-then taking the smallest category~ID that has the maximum count.
-Note that the sample modes may be different from the distribution modes,
-i.e.\ the categories whose (hypothesized) underlying probability is the
-maximum over all categories.
-%%%%%%%%%%%%%%%%%%%% DESCRIPTIVE STATISTIC %%%%%%%%%%%%%%%%%%%%
-\item[\it Number of modes]
-\OutputRowText{\OutputRowIDNumModes}
-The number of category values that each have the largest frequency
-count in the sample.  
-Example: in sample $\{$1, 3, 3, 3, 3, 4, 4, 5, 7, 7, 7, 7, 8, 8, 8$\}$
-there are two category IDs (3 and~7) that occur the maximum count of 4~times;
-hence, we return~2.
-
-Computed by counting the number of occurrences for each category,
-then counting how many categories have the maximum count.
-Note that the sample modes may be different from the distribution modes,
-i.e.\ the categories whose (hypothesized) underlying probability is the
-maximum over all categories.
-\end{Description}
-
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-The output matrix containing all computed statistics is of size $17$~rows and
-as many columns as in the input matrix~\texttt{X}.  Each row corresponds to
-a particular statistic, according to the convention specified in
-Table~\ref{table:univars}.  The first $14$~statistics are applicable for
-\emph{scale} columns, and the last $3$~statistics are applicable for categorical,
-i.e.\ nominal and ordinal, columns.
-
-
-\pagebreak[2]
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-\hml -f \UnivarScriptName{} -nvargs X=/user/biadmin/X.mtx
-  TYPES=/user/biadmin/types.mtx
-  STATS=/user/biadmin/stats.mtx
-
-}


[34/50] [abbrv] incubator-systemml git commit: [SYSTEMML-259] Function with no return value not require lvalue

Posted by de...@apache.org.
[SYSTEMML-259] Function with no return value not require lvalue

If a user-defined function does not return a value, don't require
that the function is assigned to a variable since there is nothing
to assign. Add corresponding MLContext tests.

Closes #411.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/032bc376
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/032bc376
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/032bc376

Branch: refs/heads/gh-pages
Commit: 032bc376e70b2f45bf0b8495e2f1972a9d6d6d63
Parents: be4eaaf
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Mon Mar 6 15:26:37 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Mon Mar 6 15:26:37 2017 -0800

----------------------------------------------------------------------
 beginners-guide-to-dml-and-pydml.md | 1 -
 1 file changed, 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/032bc376/beginners-guide-to-dml-and-pydml.md
----------------------------------------------------------------------
diff --git a/beginners-guide-to-dml-and-pydml.md b/beginners-guide-to-dml-and-pydml.md
index e82909d..9d19cc8 100644
--- a/beginners-guide-to-dml-and-pydml.md
+++ b/beginners-guide-to-dml-and-pydml.md
@@ -641,7 +641,6 @@ parfor(i in 0:nrow(A)-1):
 
 Functions encapsulate useful functionality in SystemML. In addition to built-in functions, users can define their own functions.
 Functions take 0 or more parameters and return 0 or more values.
-Currently, if a function returns nothing, it still needs to be assigned to a variable.
 
 <div class="codetabs2">
 


[38/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

Posted by de...@apache.org.
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/LinReg.tex
----------------------------------------------------------------------
diff --git a/alg-ref/LinReg.tex b/alg-ref/LinReg.tex
new file mode 100644
index 0000000..67273c2
--- /dev/null
+++ b/alg-ref/LinReg.tex
@@ -0,0 +1,328 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Linear Regression}
+\label{sec:LinReg}
+
+\noindent{\bf Description}
+\smallskip
+
+Linear Regression scripts are used to model the relationship between one numerical
+response variable and one or more explanatory (feature) variables.
+The scripts are given a dataset $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a
+numerical vector of feature variables and $y_i$ is a numerical response value for
+each training data record.  The feature vectors are provided as a matrix $X$ of size
+$n\,{\times}\,m$, where $n$ is the number of records and $m$ is the number of features.
+The observed response values are provided as a 1-column matrix~$Y$, with a numerical
+value $y_i$ for each~$x_i$ in the corresponding row of matrix~$X$.
+
+In linear regression, we predict the distribution of the response~$y_i$ based on
+a fixed linear combination of the features in~$x_i$.  We assume that
+there exist constant regression coefficients $\beta_0, \beta_1, \ldots, \beta_m$
+and a constant residual variance~$\sigma^2$ such that
+\begin{equation}
+y_i \sim \Normal(\mu_i, \sigma^2) \,\,\,\,\textrm{where}\,\,\,\,
+\mu_i \,=\, \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m}
+\label{eqn:linregdef}
+\end{equation}
+Distribution $y_i \sim \Normal(\mu_i, \sigma^2)$ models the ``unexplained'' residual
+noise and is assumed independent across different records.
+
+The goal is to estimate the regression coefficients and the residual variance.
+Once they are accurately estimated, we can make predictions about $y_i$ given~$x_i$
+in new records.  We can also use the $\beta_j$'s to analyze the influence of individual
+features on the response value, and assess the quality of this model by comparing
+residual variance in the response, left after prediction, with its total variance.
+
+There are two scripts in our library, both doing the same estimation, but using different
+computational methods.  Depending on the size and the sparsity of the feature matrix~$X$,
+one or the other script may be more efficient.  The ``direct solve'' script
+{\tt LinearRegDS} is more efficient when the number of features $m$ is relatively small
+($m \sim 1000$ or less) and matrix~$X$ is either tall or fairly dense
+(has~${\gg}\:m^2$ nonzeros); otherwise, the ``conjugate gradient'' script {\tt LinearRegCG}
+is more efficient.  If $m > 50000$, use only {\tt LinearRegCG}.
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}LinearRegDS.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} Y=}path/file
+{\tt{} B=}path/file
+{\tt{} O=}path/file
+{\tt{} icpt=}int
+{\tt{} reg=}double
+{\tt{} fmt=}format
+
+}\smallskip
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}LinearRegCG.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} Y=}path/file
+{\tt{} B=}path/file
+{\tt{} O=}path/file
+{\tt{} Log=}path/file
+{\tt{} icpt=}int
+{\tt{} reg=}double
+{\tt{} tol=}double
+{\tt{} maxi=}int
+{\tt{} fmt=}format
+
+}
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the matrix of feature vectors, each row constitutes
+one feature vector
+\item[{\tt Y}:]
+Location to read the 1-column matrix of response values
+\item[{\tt B}:]
+Location to store the estimated regression parameters (the $\beta_j$'s), with the
+intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
+\item[{\tt O}:] (default:\mbox{ }{\tt " "})
+Location to store the CSV-file of summary statistics defined in
+Table~\ref{table:linreg:stats}, the default is to print it to the standard output
+\item[{\tt Log}:] (default:\mbox{ }{\tt " "}, {\tt LinearRegCG} only)
+Location to store iteration-specific variables for monitoring and debugging purposes,
+see Table~\ref{table:linreg:log} for details.
+\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
+Intercept presence and shifting/rescaling the features in~$X$:\\
+{\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\
+{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
+{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
+\item[{\tt reg}:] (default:\mbox{ }{\tt 0.000001})
+L2-regularization parameter~\mbox{$\lambda\geq 0$}; set to nonzero for highly dependent,
+sparse, or numerous ($m \gtrsim n/10$) features
+\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}, {\tt LinearRegCG} only)
+Tolerance \mbox{$\eps\geq 0$} used in the convergence criterion: we terminate conjugate
+gradient iterations when the $\beta$-residual reduces in L2-norm by this factor
+\item[{\tt maxi}:] (default:\mbox{ }{\tt 0}, {\tt LinearRegCG} only)
+Maximum number of conjugate gradient iterations, or~0 if no maximum
+limit provided
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+\begin{table}[t]\small\centerline{%
+\begin{tabular}{|ll|}
+\hline
+Name & Meaning \\
+\hline
+{\tt AVG\_TOT\_Y}          & Average of the response value $Y$ \\
+{\tt STDEV\_TOT\_Y}        & Standard Deviation of the response value $Y$ \\
+{\tt AVG\_RES\_Y}          & Average of the residual $Y - \mathop{\mathrm{pred}}(Y|X)$, i.e.\ residual bias \\
+{\tt STDEV\_RES\_Y}        & Standard Deviation of the residual $Y - \mathop{\mathrm{pred}}(Y|X)$ \\
+{\tt DISPERSION}           & GLM-style dispersion, i.e.\ residual sum of squares / \#deg.\ fr. \\
+{\tt PLAIN\_R2}            & Plain $R^2$ of residual with bias included vs.\ total average \\
+{\tt ADJUSTED\_R2}         & Adjusted $R^2$ of residual with bias included vs.\ total average \\
+{\tt PLAIN\_R2\_NOBIAS}    & Plain $R^2$ of residual with bias subtracted vs.\ total average \\
+{\tt ADJUSTED\_R2\_NOBIAS} & Adjusted $R^2$ of residual with bias subtracted vs.\ total average \\
+{\tt PLAIN\_R2\_VS\_0}     & ${}^*$Plain $R^2$ of residual with bias included vs.\ zero constant \\
+{\tt ADJUSTED\_R2\_VS\_0}  & ${}^*$Adjusted $R^2$ of residual with bias included vs.\ zero constant \\
+\hline
+\multicolumn{2}{r}{${}^{*\mathstrut}$ The last two statistics are only printed if there is no intercept ({\tt icpt=0})} \\
+\end{tabular}}
+\caption{Besides~$\beta$, linear regression scripts compute a few summary statistics
+listed above.  The statistics are provided in CSV format, one comma-separated name-value
+pair per each line.}
+\label{table:linreg:stats}
+\end{table}
+
+\begin{table}[t]\small\centerline{%
+\begin{tabular}{|ll|}
+\hline
+Name & Meaning \\
+\hline
+{\tt CG\_RESIDUAL\_NORM}  & L2-norm of conjug.\ grad.\ residual, which is $A \pxp \beta - t(X) \pxp y$ \\
+                          & where $A = t(X) \pxp X + \diag (\lambda)$, or a similar quantity \\
+{\tt CG\_RESIDUAL\_RATIO} & Ratio of current L2-norm of conjug.\ grad.\ residual over the initial \\
+\hline
+\end{tabular}}
+\caption{
+The {\tt Log} file for {\tt{}LinearRegCG} script contains the above \mbox{per-}iteration
+variables in CSV format, each line containing triple (Name, Iteration\#, Value) with
+Iteration\# being~0 for initial values.}
+\label{table:linreg:log}
+\end{table}
+
+
+\noindent{\bf Details}
+\smallskip
+
+To solve a linear regression problem over feature matrix~$X$ and response vector~$Y$,
+we can find coefficients $\beta_0, \beta_1, \ldots, \beta_m$ and $\sigma^2$ that maximize
+the joint likelihood of all $y_i$ for $i=1\ldots n$, defined by the assumed statistical
+model~(\ref{eqn:linregdef}).  Since the joint likelihood of the independent
+$y_i \sim \Normal(\mu_i, \sigma^2)$ is proportional to the product of
+$\exp\big({-}\,(y_i - \mu_i)^2 / (2\sigma^2)\big)$, we can take the logarithm of this
+product, then multiply by $-2\sigma^2 < 0$ to obtain a least squares problem:
+\begin{equation}
+\sum_{i=1}^n \, (y_i - \mu_i)^2 \,\,=\,\, 
+\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
+\,\,\to\,\,\min
+\label{eqn:linregls}
+\end{equation}
+This may not be enough, however.  The minimum may sometimes be attained over infinitely many
+$\beta$-vectors, for example if $X$ has an all-0 column, or has linearly dependent columns,
+or has fewer rows than columns~\mbox{($n < m$)}.  Even if~(\ref{eqn:linregls}) has a unique
+solution, other $\beta$-vectors may be just a little suboptimal\footnote{Smaller likelihood
+difference between two models suggests less statistical evidence to pick one model over the
+other.}, yet give significantly different predictions for new feature vectors.  This results
+in \emph{overfitting}: prediction error for the training data ($X$ and~$Y$) is much smaller
+than for the test data (new records).
+
+Overfitting and degeneracy in the data is commonly mitigated by adding a regularization penalty
+term to the least squares function:
+\begin{equation}
+\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
+\,+\,\, \lambda \sum_{j=1}^m \beta_j^2
+\,\,\to\,\,\min
+\label{eqn:linreglsreg}
+\end{equation}
+The choice of $\lambda>0$, the regularization constant, typically involves cross-validation
+where the dataset is repeatedly split into a training part (to estimate the~$\beta_j$'s) and
+a test part (to evaluate prediction accuracy), with the goal of maximizing the test accuracy.
+In our scripts, $\lambda$~is provided as input parameter~{\tt reg}.
+
+The solution to least squares problem~(\ref{eqn:linreglsreg}), through taking the derivative
+and setting it to~0, has the matrix linear equation form
+\begin{equation}
+A\left[\textstyle\beta_{1:m}\atop\textstyle\beta_0\right] \,=\, \big[X,\,1\big]^T Y,\,\,\,
+\textrm{where}\,\,\,
+A \,=\, \big[X,\,1\big]^T \big[X,\,1\big]\,+\,\hspace{0.5pt} \diag(\hspace{0.5pt}
+\underbrace{\raisebox{0pt}[0pt][0.5pt]{$\lambda,\ldots, \lambda$}}_{\raisebox{2pt}{$\scriptstyle m$}}
+\hspace{0.5pt}, 0)
+\label{eqn:linregeq}
+\end{equation}
+where $[X,\,1]$ is $X$~with an extra column of~1s appended on the right, and the
+diagonal matrix of $\lambda$'s has a zero to keep the intercept~$\beta_0$ unregularized.
+If the intercept is disabled by setting {\tt icpt=0}, the equation is simply
+\mbox{$X^T X \beta = X^T Y$}.
+
+We implemented two scripts for solving equation~(\ref{eqn:linregeq}): one is a ``direct solver''
+that computes $A$ and then solves $A\beta = [X,\,1]^T Y$ by calling an external package,
+the other performs linear conjugate gradient~(CG) iterations without ever materializing~$A$.
+The CG~algorithm closely follows Algorithm~5.2 in Chapter~5 of~\cite{Nocedal2006:Optimization}.
+Each step in the CG~algorithm computes a matrix-vector multiplication $q = Ap$ by first computing
+$[X,\,1]\, p$ and then $[X,\,1]^T [X,\,1]\, p$.  Usually the number of such multiplications,
+one per CG iteration, is much smaller than~$m$.  The user can put a hard bound on it with input 
+parameter~{\tt maxi}, or use the default maximum of~$m+1$ (or~$m$ if no intercept) by
+having {\tt maxi=0}.  The CG~iterations terminate when the L2-norm of vector
+$r = A\beta - [X,\,1]^T Y$ decreases from its initial value (for~$\beta=0$) by the tolerance
+factor specified in input parameter~{\tt tol}.
+
+The CG algorithm is more efficient if computing
+$[X,\,1]^T \big([X,\,1]\, p\big)$ is much faster than materializing $A$,
+an $(m\,{+}\,1)\times(m\,{+}\,1)$ matrix.  The Direct Solver~(DS) is more efficient if
+$X$ takes up a lot more memory than $A$ (i.e.\ $X$~has a lot more nonzeros than~$m^2$)
+and if $m^2$ is small enough for the external solver ($m \lesssim 50000$).  A more precise
+determination between CG and~DS is subject to further research.
+
+In addition to the $\beta$-vector, the scripts estimate the residual standard
+deviation~$\sigma$ and the~$R^2$, the ratio of ``explained'' variance to the total
+variance of the response variable.  These statistics only make sense if the number
+of degrees of freedom $n\,{-}\,m\,{-}\,1$ is positive and the regularization constant
+$\lambda$ is negligible or zero.  The formulas for $\sigma$ and $R^2$~are:
+\begin{equation*}
+R^2_{\textrm{plain}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}},\quad
+\sigma \,=\, \sqrt{\frac{\mathrm{RSS}}{n - m - 1}},\quad
+R^2_{\textrm{adj.}} = 1 - \frac{\sigma^2 (n-1)}{\mathrm{TSS}}
+\end{equation*}
+where
+\begin{equation*}
+\mathrm{RSS} \,=\, \sum_{i=1}^n \Big(y_i - \hat{\mu}_i - 
+\frac{1}{n} \sum_{i'=1}^n \,(y_{i'} - \hat{\mu}_{i'})\Big)^2; \quad
+\mathrm{TSS} \,=\, \sum_{i=1}^n \Big(y_i - \frac{1}{n} \sum_{i'=1}^n y_{i'}\Big)^2
+\end{equation*}
+Here $\hat{\mu}_i$ are the predicted means for $y_i$ based on the estimated
+regression coefficients and the feature vectors.  They may be biased when no
+intercept is present, hence the RSS formula subtracts the bias.
+
+Lastly, note that by choosing the input option {\tt icpt=2} the user can shift
+and rescale the columns of~$X$ to have zero average and the variance of~1.
+This is particularly important when using regularization over highly disbalanced
+features, because regularization tends to penalize small-variance columns (which
+need large~$\beta_j$'s) more than large-variance columns (with small~$\beta_j$'s).
+At the end, the estimated regression coefficients are shifted and rescaled to
+apply to the original features.
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+The estimated regression coefficients (the $\hat{\beta}_j$'s) are populated into
+a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
+input argument.  What this matrix contains, and its size, depends on the input
+argument {\tt icpt}, which specifies the user's intercept and rescaling choice:
+\begin{Description}
+\item[{\tt icpt=0}:] No intercept, matrix~$B$ has size $m\,{\times}\,1$, with
+$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to~$m = {}$ncol$(X)$.
+\item[{\tt icpt=1}:] There is intercept, but no shifting/rescaling of~$X$; matrix~$B$
+has size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from 1 to~$m$,
+and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept coefficient.
+\item[{\tt icpt=2}:] There is intercept, and the features in~$X$ are shifted to
+mean${} = 0$ and rescaled to variance${} = 1$; then there are two versions of
+the~$\hat{\beta}_j$'s, one for the original features and another for the
+shifted/rescaled features.  Now matrix~$B$ has size $(m\,{+}\,1) \times 2$, with
+$B[\cdot, 1]$ for the original features and $B[\cdot, 2]$ for the shifted/rescaled
+features, in the above format.  Note that $B[\cdot, 2]$ are iteratively estimated
+and $B[\cdot, 1]$ are obtained from $B[\cdot, 2]$ by complementary shifting and
+rescaling.
+\end{Description}
+The estimated summary statistics, including residual standard deviation~$\sigma$ and
+the~$R^2$, are printed out or sent into a file (if specified) in CSV format as
+defined in Table~\ref{table:linreg:stats}.  For conjugate gradient iterations,
+a log file with monitoring variables can also be made available, see
+Table~\ref{table:linreg:log}.
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f LinearRegCG.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
+  B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv
+  icpt=2 reg=1.0 tol=0.00000001 maxi=100 Log=/user/biadmin/log.csv
+
+}
+{\hangindent=\parindent\noindent\tt
+\hml -f LinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
+  B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv
+  icpt=2 reg=1.0
+
+}
+
+% \smallskip
+% \noindent{\bf See Also}
+% \smallskip
+% 
+% In case of binary classification problems, please consider using L2-SVM or
+% binary logistic regression; for multiclass classification, use multiclass~SVM
+% or multinomial logistic regression.  For more complex distributions of the
+% response variable use the Generalized Linear Models script.

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/LogReg.tex
----------------------------------------------------------------------
diff --git a/alg-ref/LogReg.tex b/alg-ref/LogReg.tex
new file mode 100644
index 0000000..43d4e15
--- /dev/null
+++ b/alg-ref/LogReg.tex
@@ -0,0 +1,287 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Multinomial Logistic Regression}
+
+\noindent{\bf Description}
+\smallskip
+
+Our logistic regression script performs both binomial and multinomial logistic regression.
+The script is given a dataset $(X, Y)$ where matrix $X$ has $m$~columns and matrix $Y$ has
+one column; both $X$ and~$Y$ have $n$~rows.  The rows of $X$ and~$Y$ are viewed as a collection
+of records: $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a numerical vector of explanatory
+(feature) variables and $y_i$ is a categorical response variable.
+Each row~$x_i$ in~$X$ has size~\mbox{$\dim x_i = m$}, while its corresponding $y_i$ is an
+integer that represents the observed response value for record~$i$.
+
+The goal of logistic regression is to learn a linear model over the feature vector
+$x_i$ that can be used to predict how likely each categorical label is expected to
+be observed as the actual~$y_i$.
+Note that logistic regression predicts more than a label: it predicts the probability
+for every possible label.  The binomial case allows only two possible labels, the
+multinomial case has no such restriction.
+
+Just as linear regression estimates the mean value $\mu_i$ of a numerical response
+variable, logistic regression does the same for category label probabilities.
+In linear regression, the mean of $y_i$ is estimated as a linear combination of the features:
+$\mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m}$.
+In logistic regression, the
+label probability has to lie between 0 and~1, so a link function is applied to connect
+it to $\beta_0 + x_i\beta_{1:m}$.  If there are just two possible category labels, for example
+0~and~1, the logistic link looks as follows:
+\begin{equation*}
+\Prob[y_i\,{=}\,1\mid x_i; \beta] \,=\, 
+\frac{e^{\,\beta_0 + x_i\beta_{1:m}}}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}};
+\quad
+\Prob[y_i\,{=}\,0\mid x_i; \beta] \,=\, 
+\frac{1}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}}
+\end{equation*}
+Here category label~0 serves as the \emph{baseline}, and function
+$\exp(\beta_0 + x_i\beta_{1:m})$
+shows how likely we expect to see ``$y_i = 1$'' in comparison to the baseline.
+Like in a loaded coin, the predicted odds of seeing 1~versus~0 are
+$\exp(\beta_0 + x_i\beta_{1:m})$ to~1,
+with each feature $x_{i,j}$ multiplying its own factor $\exp(\beta_j x_{i,j})$ to the odds.
+Given a large collection of pairs $(x_i, y_i)$, $i=1\ldots n$, logistic regression seeks
+to find the $\beta_j$'s that maximize the product of probabilities
+\hbox{$\Prob[y_i\mid x_i; \beta]$}
+for actually observed $y_i$-labels (assuming no regularization).
+
+Multinomial logistic regression~\cite{Agresti2002:CDA} extends this link to $k \geq 3$ possible
+categories.  Again we identify one category as the baseline, for example the $k$-th category.
+Instead of a coin, here we have a loaded multisided die, one side per category.  Each non-baseline
+category $l = 1\ldots k\,{-}\,1$ has its own vector $(\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l})$
+of regression parameters with the intercept, making up a matrix $B$ of size
+$(m\,{+}\,1)\times(k\,{-}\,1)$.  The predicted odds of seeing non-baseline category~$l$ versus
+the baseline~$k$ are $\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)$
+to~1, and the predicted probabilities are:
+\begin{align}
+l < k:\quad\Prob[y_i\,{=}\,\makebox[0.5em][c]{$l$}\mid x_i; B] \,\,\,{=}\,\,\,&
+\frac{\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)}%
+{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)};
+\label{eqn:mlogreg:nonbaseprob}\\
+\Prob[y_i\,{=}\,\makebox[0.5em][c]{$k$}\mid x_i; B] \,\,\,{=}\,\,\,& \frac{1}%
+{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)}.
+\label{eqn:mlogreg:baseprob}
+\end{align}
+The goal of the regression is to estimate the parameter matrix~$B$ from the provided dataset
+$(X, Y) = (x_i, y_i)_{i=1}^n$ by maximizing the product of \hbox{$\Prob[y_i\mid x_i; B]$}
+over the observed labels~$y_i$.  Taking its logarithm, negating, and adding a regularization term
+gives us a minimization objective:
+\begin{equation}
+f(B; X, Y) \,\,=\,\,
+-\sum_{i=1}^n \,\log \Prob[y_i\mid x_i; B] \,+\,
+\frac{\lambda}{2} \sum_{j=1}^m \sum_{l=1}^{k-1} |\beta_{j,l}|^2
+\,\,\to\,\,\min
+\label{eqn:mlogreg:loss}
+\end{equation}
+The optional regularization term is added to mitigate overfitting and degeneracy in the data;
+to reduce bias, the intercepts $\beta_{0,l}$ are not regularized.  Once the~$\beta_{j,l}$'s
+are accurately estimated, we can make predictions about the category label~$y$ for a new
+feature vector~$x$ using Eqs.~(\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob}).
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}MultiLogReg.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} Y=}path/file
+{\tt{} B=}path/file
+{\tt{} Log=}path/file
+{\tt{} icpt=}int
+{\tt{} reg=}double
+{\tt{} tol=}double
+{\tt{} moi=}int
+{\tt{} mii=}int
+{\tt{} fmt=}format
+
+}
+
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the input matrix of feature vectors; each row constitutes
+one feature vector.
+\item[{\tt Y}:]
+Location to read the input one-column matrix of category labels that correspond to
+feature vectors in~{\tt X}.  Note the following:\\
+-- Each non-baseline category label must be a positive integer.\\
+-- If all labels are positive, the largest represents the baseline category.\\
+-- If non-positive labels such as $-1$ or~$0$ are present, then they represent the (same)
+baseline category and are converted to label $\max(\texttt{Y})\,{+}\,1$.
+\item[{\tt B}:]
+Location to store the matrix of estimated regression parameters (the $\beta_{j, l}$'s),
+with the intercept parameters~$\beta_{0, l}$ at position {\tt B[}$m\,{+}\,1$, $l${\tt ]}
+if available.  The size of {\tt B} is $(m\,{+}\,1)\times (k\,{-}\,1)$ with the intercepts
+or $m \times (k\,{-}\,1)$ without the intercepts, one column per non-baseline category
+and one row per feature.
+\item[{\tt Log}:] (default:\mbox{ }{\tt " "})
+Location to store iteration-specific variables for monitoring and debugging purposes,
+see Table~\ref{table:mlogreg:log} for details.
+\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
+Intercept and shifting/rescaling of the features in~$X$:\\
+{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\
+{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
+{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
+\item[{\tt reg}:] (default:\mbox{ }{\tt 0.0})
+L2-regularization parameter (lambda)
+\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
+Tolerance (epsilon) used in the convergence criterion
+\item[{\tt moi}:] (default:\mbox{ }{\tt 100})
+Maximum number of outer (Fisher scoring) iterations
+\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
+Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
+limit provided
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+\begin{table}[t]\small\centerline{%
+\begin{tabular}{|ll|}
+\hline
+Name & Meaning \\
+\hline
+{\tt LINEAR\_TERM\_MIN}  & The minimum value of $X \pxp B$, used to check for overflows \\
+{\tt LINEAR\_TERM\_MAX}  & The maximum value of $X \pxp B$, used to check for overflows \\
+{\tt NUM\_CG\_ITERS}     & Number of inner (Conj.\ Gradient) iterations in this outer iteration \\
+{\tt IS\_TRUST\_REACHED} & $1 = {}$trust region boundary was reached, $0 = {}$otherwise \\
+{\tt POINT\_STEP\_NORM}  & L2-norm of iteration step from old point (matrix $B$) to new point \\
+{\tt OBJECTIVE}          & The loss function we minimize (negative regularized log-likelihood) \\
+{\tt OBJ\_DROP\_REAL}    & Reduction in the objective during this iteration, actual value \\
+{\tt OBJ\_DROP\_PRED}    & Reduction in the objective predicted by a quadratic approximation \\
+{\tt OBJ\_DROP\_RATIO}   & Actual-to-predicted reduction ratio, used to update the trust region \\
+{\tt IS\_POINT\_UPDATED} & $1 = {}$new point accepted; $0 = {}$new point rejected, old point restored \\
+{\tt GRADIENT\_NORM}     & L2-norm of the loss function gradient (omitted if point is rejected) \\
+{\tt TRUST\_DELTA}       & Updated trust region size, the ``delta'' \\
+\hline
+\end{tabular}}
+\caption{
+The {\tt Log} file for multinomial logistic regression contains the above \mbox{per-}iteration
+variables in CSV format, each line containing triple (Name, Iteration\#, Value) with Iteration\#
+being~0 for initial values.}
+\label{table:mlogreg:log}
+\end{table}
+
+
+\noindent{\bf Details}
+\smallskip
+
+We estimate the logistic regression parameters via L2-regularized negative
+log-likelihood minimization~(\ref{eqn:mlogreg:loss}).
+The optimization method used in the script closely follows the trust region
+Newton method for logistic regression described in~\cite{Lin2008:logistic}.
+For convenience, let us make some changes in notation:
+\begin{Itemize}
+\item Convert the input vector of observed category labels into an indicator matrix $Y$
+of size $n \times k$ such that $Y_{i, l} = 1$ if the $i$-th category label is~$l$ and
+$Y_{i, l} = 0$ otherwise;
+\item Append an extra column of all ones, i.e.\ $(1, 1, \ldots, 1)^T$, as the
+$m\,{+}\,1$-st column to the feature matrix $X$ to represent the intercept;
+\item Append an all-zero column as the $k$-th column to $B$, the matrix of regression
+parameters, to represent the baseline category;
+\item Convert the regularization constant $\lambda$ into matrix $\Lambda$ of the same
+size as $B$, placing 0's into the $m\,{+}\,1$-st row to disable intercept regularization,
+and placing $\lambda$'s everywhere else.
+\end{Itemize}
+Now the ($n\,{\times}\,k$)-matrix of predicted probabilities given
+by (\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob})
+and the objective function $f$ in~(\ref{eqn:mlogreg:loss}) have the matrix form
+\begin{align*}
+P \,\,&=\,\, \exp(XB) \,\,/\,\, \big(\exp(XB)\,1_{k\times k}\big)\\
+f \,\,&=\,\, - \,\,{\textstyle\sum} \,\,Y \cdot (X B)\, + \,
+{\textstyle\sum}\,\log\big(\exp(XB)\,1_{k\times 1}\big) \,+ \,
+(1/2)\,\, {\textstyle\sum} \,\,\Lambda \cdot B \cdot B
+\end{align*}
+where operations $\cdot\,$, $/$, $\exp$, and $\log$ are applied cellwise,
+and $\textstyle\sum$ denotes the sum of all cells in a matrix.
+The gradient of~$f$ with respect to~$B$ can be represented as a matrix too:
+\begin{equation*}
+\nabla f \,\,=\,\, X^T (P - Y) \,+\, \Lambda \cdot B
+\end{equation*}
+The Hessian $\mathcal{H}$ of~$f$ is a tensor, but, fortunately, the conjugate
+gradient inner loop of the trust region algorithm in~\cite{Lin2008:logistic}
+does not need to instantiate it.  We only need to multiply $\mathcal{H}$ by
+ordinary matrices of the same size as $B$ and $\nabla f$, and this can be done
+in matrix form:
+\begin{equation*}
+\mathcal{H}V \,\,=\,\, X^T \big( Q \,-\, P \cdot (Q\,1_{k\times k}) \big) \,+\,
+\Lambda \cdot V, \,\,\,\,\textrm{where}\,\,\,\,Q \,=\, P \cdot (XV)
+\end{equation*}
+At each Newton iteration (the \emph{outer} iteration) the minimization algorithm
+approximates the difference $\varDelta f(S; B) = f(B + S; X, Y) \,-\, f(B; X, Y)$
+attained in the objective function after a step $B \mapsto B\,{+}\,S$ by a
+second-degree formula
+\begin{equation*}
+\varDelta f(S; B) \,\,\,\approx\,\,\, (1/2)\,\,{\textstyle\sum}\,\,S \cdot \mathcal{H}S
+ \,+\, {\textstyle\sum}\,\,S\cdot \nabla f
+\end{equation*}
+This approximation is then minimized by trust-region conjugate gradient iterations
+(the \emph{inner} iterations) subject to the constraint $\|S\|_2 \leq \delta$.
+The trust region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$
+and updated as described in~\cite{Lin2008:logistic}.
+Users can specify the maximum number of the outer and the inner iterations with
+input parameters {\tt moi} and {\tt mii}, respectively.  The iterative minimizer
+terminates successfully if $\|\nabla f\|_2 < \eps\,\|\nabla f_{B=0}\|_2$,
+where $\eps > 0$ is a tolerance supplied by the user via input parameter~{\tt tol}.
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+The estimated regression parameters (the $\hat{\beta}_{j, l}$) are populated into
+a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
+input argument.  Only the non-baseline categories ($1\leq l \leq k\,{-}\,1$) have
+their $\hat{\beta}_{j, l}$ in the output; to add the baseline category, just append
+a column of zeros.  If {\tt icpt=0} in the input command line, no intercepts are used
+and {\tt B} has size $m\times (k\,{-}\,1)$; otherwise {\tt B} has size 
+$(m\,{+}\,1)\times (k\,{-}\,1)$
+and the intercepts are in the $m\,{+}\,1$-st row.  If {\tt icpt=2}, then initially
+the feature columns in~$X$ are shifted to mean${} = 0$ and rescaled to variance${} = 1$.
+After the iterations converge, the $\hat{\beta}_{j, l}$'s are rescaled and shifted
+to work with the original features.
+
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f MultiLogReg.dml -nvargs X=/user/biadmin/X.mtx 
+  Y=/user/biadmin/Y.mtx B=/user/biadmin/B.mtx fmt=csv
+  icpt=2 reg=1.0 tol=0.0001 moi=100 mii=10 Log=/user/biadmin/log.csv
+
+}
+
+
+\smallskip
+\noindent{\bf References}
+\begin{itemize}
+\item A.~Agresti.
+\newblock {\em Categorical Data Analysis}.
+\newblock Wiley Series in Probability and Statistics. Wiley-Interscience,  second edition, 2002.
+\end{itemize}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/MultiSVM.tex
----------------------------------------------------------------------
diff --git a/alg-ref/MultiSVM.tex b/alg-ref/MultiSVM.tex
new file mode 100644
index 0000000..87880a9
--- /dev/null
+++ b/alg-ref/MultiSVM.tex
@@ -0,0 +1,174 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsubsection{Multi-class Support Vector Machines}
+\label{msvm}
+
+\noindent{\bf Description}
+
+Support Vector Machines are used to model the relationship between a categorical 
+dependent variable y and one or more explanatory variables denoted X. This 
+implementation supports dependent variables that have domain size greater or
+equal to 2 and hence is not restricted to binary class labels.
+\\
+
+\noindent{\bf Usage}
+
+\begin{tabbing}
+\texttt{-f} \textit{path}/\texttt{m-svm.dml -nvargs}
+\=\texttt{X=}\textit{path}/\textit{file} 
+  \texttt{Y=}\textit{path}/\textit{file}
+  \texttt{icpt=}\textit{int}\\
+\>\texttt{tol=}\textit{double} 
+  \texttt{reg=}\textit{double}
+  \texttt{maxiter=}\textit{int} 
+  \texttt{model=}\textit{path}/\textit{file}\\
+\>\texttt{Log=}\textit{path}/\textit{file}
+  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
+\end{tabbing}
+
+\begin{tabbing}
+\texttt{-f} \textit{path}/\texttt{m-svm-predict.dml -nvargs}
+\=\texttt{X=}\textit{path}/\textit{file} 
+  \texttt{Y=}\textit{path}/\textit{file}
+  \texttt{icpt=}\textit{int}
+  \texttt{model=}\textit{path}/\textit{file}\\
+\>\texttt{scores=}\textit{path}/\textit{file}
+  \texttt{accuracy=}\textit{path}/\textit{file}\\
+\>\texttt{confusion=}\textit{path}/\textit{file}
+  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
+\end{tabbing}
+
+\noindent{\bf Arguments}
+
+\begin{itemize}
+\item X: Location (on HDFS) containing the explanatory variables 
+in a matrix. Each row constitutes an example.
+\item Y: Location (on HDFS) containing a 1-column matrix specifying 
+the categorical dependent variable (label). Labels are assumed to be 
+contiguously numbered from 1 $\ldots$ \#classes.  Note that, this 
+argument is optional for prediction.
+\item icpt (default: {\tt 0}): If set to 1 then a constant bias column
+is added to X.
+\item tol (default: {\tt 0.001}): Procedure terminates early if the reduction
+in objective function value is less than tolerance times the initial objective
+function value.
+\item reg (default: {\tt 1}): Regularization constant. See details to find 
+out where lambda appears in the objective function. If one were interested 
+in drawing an analogy with C-SVM, then C = 2/lambda. Usually, cross validation 
+is employed to determine the optimum value of lambda.
+\item maxiter (default: {\tt 100}): The maximum number of iterations.
+\item model: Location (on HDFS) that contains the learnt weights.
+\item Log: Location (on HDFS) to collect various metrics (e.g., objective 
+function value etc.) that depict progress across iterations while training.
+\item fmt (default: {\tt text}): Specifies the output format. Choice of 
+comma-separated values (csv) or as a sparse-matrix (text).
+\item scores: Location (on HDFS) to store scores for a held-out test set.
+Note that, this is an optional argument.
+\item accuracy: Location (on HDFS) to store the accuracy computed on a
+held-out test set. Note that, this is an optional argument.
+\item confusion: Location (on HDFS) to store the confusion matrix
+computed using a held-out test set. Note that, this is an optional 
+argument.
+\end{itemize}
+
+\noindent{\bf Details}
+
+Support vector machines learn a classification function by solving the
+following optimization problem ($L_2$-SVM):
+\begin{eqnarray*}
+&\textrm{argmin}_w& \frac{\lambda}{2} ||w||_2^2 + \sum_i \xi_i^2\\
+&\textrm{subject to:}& y_i w^{\top} x_i \geq 1 - \xi_i ~ \forall i
+\end{eqnarray*}
+where $x_i$ is an example from the training set with its label given by $y_i$, 
+$w$ is the vector of parameters and $\lambda$ is the regularization constant 
+specified by the user.
+
+To extend the above formulation (binary class SVM) to the multiclass setting,
+one standard approache is to learn one binary class SVM per class that 
+separates data belonging to that class from the rest of the training data 
+(one-against-the-rest SVM, see C. Scholkopf, 1995).
+
+To account for the missing bias term, one may augment the data with a column
+of constants which is achieved by setting intercept argument to 1 (C-J Hsieh 
+et al, 2008).
+
+This implementation optimizes the primal directly (Chapelle, 2007). It uses 
+nonlinear conjugate gradient descent to minimize the objective function 
+coupled with choosing step-sizes by performing one-dimensional Newton 
+minimization in the direction of the gradient.
+\\
+
+\noindent{\bf Returns}
+
+The learnt weights produced by m-svm.dml are populated into a matrix that 
+has as many columns as there are classes in the training data, and written 
+to file provided on HDFS (see model in section Arguments). The number of rows
+in this matrix is ncol(X) if intercept was set to 0 during invocation and ncol(X) + 1
+otherwise. The bias terms, if used, are placed in the last row. Depending on what
+arguments are provided during invocation, m-svm-predict.dml may compute one or more
+of scores, accuracy and confusion matrix in the output format specified.
+\\
+
+%%\noindent{\bf See Also}
+%%
+%%In case of binary classification problems, please consider using a binary class classifier
+%%learning algorithm, e.g., binary class $L_2$-SVM (see Section \ref{l2svm}) or logistic regression
+%%(see Section \ref{logreg}). To model the relationship between a scalar dependent variable 
+%%y and one or more explanatory variables X, consider Linear Regression instead (see Section 
+%%\ref{linreg-solver} or Section \ref{linreg-iterative}).
+%%\\
+%%
+\noindent{\bf Examples}
+\begin{verbatim}
+hadoop jar SystemML.jar -f m-svm.dml -nvargs X=/user/biadmin/X.mtx 
+                                             Y=/user/biadmin/y.mtx 
+                                             icpt=0 tol=0.001
+                                             reg=1.0 maxiter=100 fmt=csv 
+                                             model=/user/biadmin/weights.csv
+                                             Log=/user/biadmin/Log.csv
+\end{verbatim}
+
+\begin{verbatim}
+hadoop jar SystemML.jar -f m-svm-predict.dml -nvargs X=/user/biadmin/X.mtx 
+                                                     Y=/user/biadmin/y.mtx 
+                                                     icpt=0 fmt=csv
+                                                     model=/user/biadmin/weights.csv
+                                                     scores=/user/biadmin/scores.csv
+                                                     accuracy=/user/biadmin/accuracy.csv
+                                                     confusion=/user/biadmin/confusion.csv
+\end{verbatim}
+
+\noindent{\bf References}
+
+\begin{itemize}
+\item W. T. Vetterling and B. P. Flannery. \newblock{\em Conjugate Gradient Methods in Multidimensions in 
+Numerical Recipes in C - The Art in Scientific Computing.} \newblock W. H. Press and S. A. Teukolsky
+(eds.), Cambridge University Press, 1992.
+\item J. Nocedal and  S. J. Wright. \newblock{\em Numerical Optimization.} \newblock Springer-Verlag, 1999.
+\item C-J Hsieh, K-W Chang, C-J Lin, S. S. Keerthi and S. Sundararajan. \newblock {\em A Dual Coordinate 
+Descent Method for Large-scale Linear SVM.} \newblock International Conference of Machine Learning
+(ICML), 2008.
+\item Olivier Chapelle. \newblock{\em Training a Support Vector Machine in the Primal.} \newblock Neural 
+Computation, 2007.
+\item B. Scholkopf, C. Burges and V. Vapnik. \newblock{\em Extracting Support Data for a Given Task.} \newblock International Conference on Knowledge Discovery and Data Mining (ICDM), 1995.
+\end{itemize}
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/NaiveBayes.tex
----------------------------------------------------------------------
diff --git a/alg-ref/NaiveBayes.tex b/alg-ref/NaiveBayes.tex
new file mode 100644
index 0000000..b5f721d
--- /dev/null
+++ b/alg-ref/NaiveBayes.tex
@@ -0,0 +1,155 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Naive Bayes}
+\label{naive_bayes}
+
+\noindent{\bf Description}
+
+Naive Bayes is very simple generative model used for classifying data. 
+This implementation learns a multinomial naive Bayes classifier which
+is applicable when all features are counts of categorical values.
+\\
+
+\noindent{\bf Usage}
+
+\begin{tabbing}
+\texttt{-f} \textit{path}/\texttt{naive-bayes.dml -nvargs} 
+\=\texttt{X=}\textit{path}/\textit{file} 
+  \texttt{Y=}\textit{path}/\textit{file} 
+  \texttt{laplace=}\textit{double}\\
+\>\texttt{prior=}\textit{path}/\textit{file}
+  \texttt{conditionals=}\textit{path}/\textit{file}\\
+\>\texttt{accuracy=}\textit{path}/\textit{file}
+  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
+\end{tabbing}
+
+\begin{tabbing}
+\texttt{-f} \textit{path}/\texttt{naive-bayes-predict.dml -nvargs} 
+\=\texttt{X=}\textit{path}/\textit{file} 
+  \texttt{Y=}\textit{path}/\textit{file} 
+  \texttt{prior=}\textit{path}/\textit{file}\\
+\>\texttt{conditionals=}\textit{path}/\textit{file}
+  \texttt{fmt=}\textit{csv}$\vert$\textit{text}\\
+\>\texttt{accuracy=}\textit{path}/\textit{file}
+  \texttt{confusion=}\textit{path}/\textit{file}\\
+\>\texttt{probabilities=}\textit{path}/\textit{file}
+\end{tabbing}
+
+\noindent{\bf Arguments}
+
+\begin{itemize}
+\item X: Location (on HDFS) to read the matrix of feature vectors; 
+each row constitutes one feature vector.
+\item Y: Location (on HDFS) to read the one-column matrix of (categorical) 
+labels that correspond to feature vectors in X. Classes are assumed to be
+contiguously labeled beginning from 1. Note that, this argument is optional
+for prediction.
+\item laplace (default: {\tt 1}): Laplace smoothing specified by the 
+user to avoid creation of 0 probabilities.
+\item prior: Location (on HDFS) that contains the class prior probabilites.
+\item conditionals: Location (on HDFS) that contains the class conditional
+feature distributions.
+\item fmt (default: {\tt text}): Specifies the output format. Choice of 
+comma-separated values (csv) or as a sparse-matrix (text).
+\item probabilities: Location (on HDFS) to store class membership probabilities
+for a held-out test set. Note that, this is an optional argument.
+\item accuracy: Location (on HDFS) to store the training accuracy during
+learning and testing accuracy from a held-out test set during prediction. 
+Note that, this is an optional argument for prediction.
+\item confusion: Location (on HDFS) to store the confusion matrix
+computed using a held-out test set. Note that, this is an optional 
+argument.
+\end{itemize}
+
+\noindent{\bf Details}
+
+Naive Bayes is a very simple generative classification model. It posits that 
+given the class label, features can be generated independently of each other.
+More precisely, the (multinomial) naive Bayes model uses the following 
+equation to estimate the joint probability of a feature vector $x$ belonging 
+to class $y$:
+\begin{equation*}
+\text{Prob}(y, x) = \pi_y \prod_{i \in x} \theta_{iy}^{n(i,x)}
+\end{equation*}
+where $\pi_y$ denotes the prior probability of class $y$, $i$ denotes a feature
+present in $x$ with $n(i,x)$ denoting its count and $\theta_{iy}$ denotes the 
+class conditional probability of feature $i$ in class $y$. The usual 
+constraints hold on $\pi$ and $\theta$:
+\begin{eqnarray*}
+&& \pi_y \geq 0, ~ \sum_{y \in \mathcal{C}} \pi_y = 1\\
+\forall y \in \mathcal{C}: && \theta_{iy} \geq 0, ~ \sum_i \theta_{iy} = 1
+\end{eqnarray*}
+where $\mathcal{C}$ is the set of classes.
+
+Given a fully labeled training dataset, it is possible to learn a naive Bayes 
+model using simple counting (group-by aggregates). To compute the class conditional
+probabilities, it is usually advisable to avoid setting $\theta_{iy}$ to 0. One way to 
+achieve this is using additive smoothing or Laplace smoothing. Some authors have argued
+that this should in fact be add-one smoothing. This implementation uses add-one smoothing
+by default but lets the user specify her/his own constant, if required.
+
+This implementation is sometimes referred to as \emph{multinomial} naive Bayes. Other
+flavours of naive Bayes are also popular.
+\\
+
+\noindent{\bf Returns}
+
+The learnt model produced by naive-bayes.dml is stored in two separate files. 
+The first file stores the class prior (a single-column matrix). The second file 
+stores the class conditional probabilities organized into a matrix with as many 
+rows as there are class labels and as many columns as there are features. 
+Depending on what arguments are provided during invocation, naive-bayes-predict.dml 
+may compute one or more of probabilities, accuracy and confusion matrix in the 
+output format specified. 
+\\
+
+\noindent{\bf Examples}
+
+\begin{verbatim}
+hadoop jar SystemML.jar -f naive-bayes.dml -nvargs 
+                           X=/user/biadmin/X.mtx 
+                           Y=/user/biadmin/y.mtx 
+                           laplace=1 fmt=csv
+                           prior=/user/biadmin/prior.csv
+                           conditionals=/user/biadmin/conditionals.csv
+                           accuracy=/user/biadmin/accuracy.csv
+\end{verbatim}
+
+\begin{verbatim}
+hadoop jar SystemML.jar -f naive-bayes-predict.dml -nvargs 
+                           X=/user/biadmin/X.mtx 
+                           Y=/user/biadmin/y.mtx 
+                           prior=/user/biadmin/prior.csv
+                           conditionals=/user/biadmin/conditionals.csv
+                           fmt=csv
+                           accuracy=/user/biadmin/accuracy.csv
+                           probabilities=/user/biadmin/probabilities.csv
+                           confusion=/user/biadmin/confusion.csv
+\end{verbatim}
+
+\noindent{\bf References}
+
+\begin{itemize}
+\item S. Russell and P. Norvig. \newblock{\em Artificial Intelligence: A Modern Approach.} Prentice Hall, 2009.
+\item A. McCallum and K. Nigam. \newblock{\em A comparison of event models for naive bayes text classification.} 
+\newblock AAAI-98 workshop on learning for text categorization, 1998.
+\end{itemize}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/PCA.tex
----------------------------------------------------------------------
diff --git a/alg-ref/PCA.tex b/alg-ref/PCA.tex
new file mode 100644
index 0000000..cef750e
--- /dev/null
+++ b/alg-ref/PCA.tex
@@ -0,0 +1,142 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Principal Component Analysis}
+\label{pca}
+
+\noindent{\bf Description}
+
+Principal Component Analysis (PCA) is a simple, non-parametric method to transform the given data set with possibly correlated columns into a set of linearly uncorrelated or orthogonal columns, called {\em principal components}. The principal components are ordered in such a way that the first component accounts for the largest possible variance, followed by remaining principal components in the decreasing order of the amount of variance captured from the data. PCA is often used as a dimensionality reduction technique, where the original data is projected or rotated onto a low-dimensional space with basis vectors defined by top-$K$ (for a given value of $K$) principal components.
+\\
+
+\noindent{\bf Usage}
+
+\begin{tabbing}
+\texttt{-f} \textit{path}/\texttt{PCA.dml -nvargs} 
+\=\texttt{INPUT=}\textit{path}/\textit{file} 
+  \texttt{K=}\textit{int} \\
+\>\texttt{CENTER=}\textit{0/1}
+  \texttt{SCALE=}\textit{0/1}\\
+\>\texttt{PROJDATA=}\textit{0/1}
+  \texttt{OFMT=}\textit{csv}/\textit{text}\\
+\>\texttt{MODEL=}\textit{path}$\vert$\textit{file}
+  \texttt{OUTPUT=}\textit{path}/\textit{file}
+\end{tabbing}
+
+\noindent{\bf Arguments}
+
+\begin{itemize}
+\item INPUT: Location (on HDFS) to read the input matrix.
+\item K: Indicates dimension of the new vector space constructed from $K$ principal components. It must be a value between $1$ and the number of columns in the input data.
+\item CENTER (default: {\tt 0}): Indicates whether or not to {\em center} input data prior to the computation of principal components.
+\item SCALE (default: {\tt 0}): Indicates whether or not to {\em scale} input data prior to the computation of principal components.
+\item PROJDATA: Indicates whether or not the input data must be projected on to new vector space defined over principal components.
+\item OFMT (default: {\tt csv}): Specifies the output format. Choice of comma-separated values (csv) or as a sparse-matrix (text).
+\item MODEL: Either the location (on HDFS) where the computed model is stored; or the location of an existing model.
+\item OUTPUT: Location (on HDFS) to store the data rotated on to the new vector space.
+\end{itemize}
+
+\noindent{\bf Details}
+
+Principal Component Analysis (PCA) is a non-parametric procedure for orthogonal linear transformation of the input data to a new coordinate system, such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In other words, PCA first selects a normalized direction in $m$-dimensional space ($m$ is the number of columns in the input data) along which the variance in input data is maximized -- this is referred to as the first principal component. It then repeatedly finds other directions (principal components) in which the variance is maximized. At every step, PCA restricts the search for only those directions that are perpendicular to all previously selected directions. By doing so, PCA aims to reduce the redundancy among input variables. To understand the notion of redundancy, consider an extreme scenario with a data set comprising of two v
 ariables, where the first one denotes some quantity expressed in meters, and the other variable represents the same quantity but in inches. Both these variables evidently capture redundant information, and hence one of them can be removed. In a general scenario, keeping solely the linear combination of input variables would both express the data more concisely and reduce the number of variables. This is why PCA is often used as a dimensionality reduction technique.
+
+The specific method to compute such a new coordinate system is as follows -- compute a covariance matrix $C$ that measures the strength of correlation among all pairs of variables in the input data; factorize $C$ according to eigen decomposition to calculate its eigenvalues and eigenvectors; and finally, order eigenvectors in the decreasing order of their corresponding eigenvalue. The computed eigenvectors (also known as {\em loadings}) define the new coordinate system and the square root of eigen values provide the amount of variance in the input data explained by each coordinate or eigenvector. 
+\\
+
+%As an example, consider the data in Table~\ref{tab:pca_data}. 
+\begin{comment}
+\begin{table}
+\parbox{.35\linewidth}{
+\centering
+\begin{tabular}{cc}
+  \hline
+  x & y \\
+  \hline
+  2.5 & 2.4  \\
+  0.5 & 0.7  \\
+  2.2 & 2.9  \\
+  1.9 & 2.2  \\
+  3.1 & 3.0  \\
+  2.3 & 2.7  \\
+  2 & 1.6  \\
+  1 & 1.1  \\
+  1.5 & 1.6  \\
+  1.1 & 0.9  \\
+	\hline
+\end{tabular}
+\caption{Input Data}
+\label{tab:pca_data}
+}
+\hfill
+\parbox{.55\linewidth}{
+\centering
+\begin{tabular}{cc}
+  \hline
+  x & y \\
+  \hline
+  .69  & .49  \\
+  -1.31  & -1.21  \\
+  .39  & .99  \\
+  .09  & .29  \\
+  1.29  & 1.09  \\
+  .49  & .79  \\
+  .19  & -.31  \\
+  -.81  & -.81  \\
+  -.31  & -.31  \\
+  -.71  & -1.01  \\
+  \hline
+\end{tabular}
+\caption{Data after centering and scaling}
+\label{tab:pca_scaled_data}
+}
+\end{table}
+\end{comment}
+
+\noindent{\bf Returns}
+When MODEL is not provided, PCA procedure is applied on INPUT data to generate MODEL as well as the rotated data OUTPUT (if PROJDATA is set to $1$) in the new coordinate system. 
+The produced model consists of basis vectors MODEL$/dominant.eigen.vectors$ for the new coordinate system; eigen values MODEL$/dominant.eigen.values$; and the standard deviation MODEL$/dominant.eigen.standard.deviations$ of principal components.
+When MODEL is provided, INPUT data is rotated according to the coordinate system defined by MODEL$/dominant.eigen.vectors$. The resulting data is stored at location OUTPUT.
+\\
+
+\noindent{\bf Examples}
+
+\begin{verbatim}
+hadoop jar SystemML.jar -f PCA.dml -nvargs 
+            INPUT=/user/biuser/input.mtx  K=10
+            CENTER=1  SCALE=1
+            OFMT=csv PROJDATA=1
+				    # location to store model and rotated data
+            OUTPUT=/user/biuser/pca_output/   
+\end{verbatim}
+
+\begin{verbatim}
+hadoop jar SystemML.jar -f PCA.dml -nvargs 
+            INPUT=/user/biuser/test_input.mtx  K=10
+            CENTER=1  SCALE=1
+            OFMT=csv PROJDATA=1
+				    # location of an existing model
+            MODEL=/user/biuser/pca_output/       
+				    # location of rotated data
+            OUTPUT=/user/biuser/test_output.mtx  
+\end{verbatim}
+
+
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/RandomForest.tex
----------------------------------------------------------------------
diff --git a/alg-ref/RandomForest.tex b/alg-ref/RandomForest.tex
new file mode 100644
index 0000000..f9b47f3
--- /dev/null
+++ b/alg-ref/RandomForest.tex
@@ -0,0 +1,215 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Random Forests}
+\label{random_forests}
+
+\noindent{\bf Description}
+\smallskip
+
+
+Random forest is one of the most successful machine learning methods for classification and regression. 
+It is an ensemble learning method that creates a model composed of a set of tree models.
+This implementation is well-suited to handle large-scale data and builds a random forest model for classification in parallel.\\
+
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+	{\tt{}-f }path/\/{\tt{}random-forest.dml}
+	{\tt{} -nvargs}
+	{\tt{} X=}path/file
+	{\tt{} Y=}path/file
+	{\tt{} R=}path/file
+	{\tt{} bins=}integer
+	{\tt{} depth=}integer
+	{\tt{} num\_leaf=}integer
+	{\tt{} num\_samples=}integer
+	{\tt{} num\_trees=}integer
+	{\tt{} subsamp\_rate=}double
+	{\tt{} feature\_subset=}double
+	{\tt{} impurity=}Gini$\mid$entropy
+	{\tt{} M=}path/file
+	{\tt{} C=}path/file
+	{\tt{} S\_map=}path/file
+	{\tt{} C\_map=}path/file
+	{\tt{} fmt=}format
+	
+}
+
+ \smallskip
+ \noindent{\bf Usage: Prediction}
+ \smallskip
+ 
+ {\hangindent=\parindent\noindent\it%
+ 	{\tt{}-f }path/\/{\tt{}random-forest-predict.dml}
+ 	{\tt{} -nvargs}
+ 	{\tt{} X=}path/file
+ 	{\tt{} Y=}path/file
+ 	{\tt{} R=}path/file
+ 	{\tt{} M=}path/file
+ 	{\tt{} C=}path/file
+ 	{\tt{} P=}path/file
+ 	{\tt{} A=}path/file
+ 	{\tt{} OOB=}path/file
+ 	{\tt{} CM=}path/file
+ 	{\tt{} fmt=}format
+ 	
+ }\smallskip
+ 
+ 
+\noindent{\bf Arguments}
+\begin{Description}
+	\item[{\tt X}:]
+	Location (on HDFS) to read the matrix of feature vectors; 
+	each row constitutes one feature vector. Note that categorical features in $X$ need to be both recoded and dummy coded.
+	\item[{\tt Y}:]
+	Location (on HDFS) to read the matrix of (categorical) 
+	labels that correspond to feature vectors in $X$. Note that classes are assumed to be both recoded and dummy coded. 
+	This argument is optional for prediction. 
+	\item[{\tt R}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to read matrix $R$ which for each feature in $X$ contains column-ids (first column), start indices (second column), and end indices (third column).
+	If $R$ is not provided by default all features are assumed to be continuous-valued.   
+	\item[{\tt bins}:] (default:\mbox{ }{\tt 20})
+	Number of thresholds to choose for each continuous-valued feature (determined by equi-height binning). 
+	\item[{\tt depth}:] (default:\mbox{ }{\tt 25})
+	Maximum depth of the learned trees in the random forest model
+	\item[{\tt num\_leaf}:] (default:\mbox{ }{\tt 10})
+	Parameter that controls pruning. The tree
+	is not expanded if a node receives less than {\tt num\_leaf} training examples.
+	\item[{\tt num\_samples}:] (default:\mbox{ }{\tt 3000})
+	Parameter that decides when to switch to in-memory building of the subtrees in each tree of the random forest model. 
+	If a node $v$ receives less than {\tt num\_samples}
+	training examples then this implementation switches to an in-memory subtree
+	building procedure to build the subtree under $v$ in its entirety.
+	\item[{\tt num\_trees}:] (default:\mbox{ }{\tt 10})
+	Number of trees to be learned in the random forest model
+	\item[{\tt subsamp\_rate}:] (default:\mbox{ }{\tt 1.0})
+	Parameter controlling the size of each tree in the random forest model; samples are selected from a Poisson distribution with parameter {\tt subsamp\_rate}.
+	\item[{\tt feature\_subset}:] (default:\mbox{ }{\tt 0.5})
+	Parameter that controls the number of feature used as candidates for splitting at each tree node as a power of the number of features in the data, i.e., assuming the training set has $D$ features $D^{\tt feature\_subset}$ are used at each tree node.
+	\item[{\tt impurity}:] (default:\mbox{ }{\tt "Gini"})
+	Impurity measure used at internal nodes of the trees in the random forest model for selecting which features to split on. Possible value are entropy or Gini.
+	\item[{\tt M}:] 
+	Location (on HDFS) to write matrix $M$ containing the learned random forest (see Section~\ref{sec:decision_trees} and below for the schema) 
+	\item[{\tt C}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to store the number of counts (generated according to a Poisson distribution with parameter {\tt subsamp\_rate}) for each feature vector. Note that this argument is optional. If Out-Of-Bag (OOB) error estimate needs to be computed this parameter is passed as input to {\tt random-forest-predict.dml}. 
+	\item[{\tt A}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to store the testing accuracy (\%) from a 
+	held-out test set during prediction. Note that this argument is optional.
+	\item[{\tt OOB}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to store the Out-Of-Bag (OOB) error estimate of the training set. Note that the matrix of sample counts (stored at {\tt C}) needs to be provided for computing OOB error estimate. Note that this argument is optional.
+	\item[{\tt P}:] 
+	Location (on HDFS) to store predictions for a held-out test set
+	\item[{\tt CM}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to store the confusion matrix computed using a held-out test set. Note that this argument is optional.
+	\item[{\tt S\_map}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to write the mappings from the continuous-valued feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
+	\item[{\tt C\_map}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to write the mappings from the categorical feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
+	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+	see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+ \noindent{\bf Details}
+ \smallskip
+
+Random forests~\cite{Breiman01:rforest} are learning algorithms for ensembles of decision trees. 
+The main idea is to build a number of decision trees on bootstrapped training samples, i.e., by taking repeatedly samples from a (single) training set. 
+Moreover, instead of considering all the features when building the trees only a random subset of the features---typically $\approx \sqrt{D}$, where $D$ is the number of features---is chosen each time a split test at a tree node is performed. 
+This procedure {\it decorrelates} the trees and makes it less prone to overfitting. 
+To build decision trees we utilize the techniques discussed in Section~\ref{sec:decision_trees} proposed in~\cite{PandaHBB09:dtree}; 
+the implementation details are similar to those of the decision trees script.
+Below we review some features of our implementation which differ from {\tt decision-tree.dml}.
+
+
+\textbf{Bootstrapped sampling.} 
+Each decision tree is fitted to a bootstrapped training set sampled with replacement (WR).  
+To improve efficiency, we generate $N$ sample counts according to a Poisson distribution with parameter {\tt subsamp\_rate},
+where $N$ denotes the total number of training points.
+These sample counts approximate WR sampling when $N$ is large enough and are generated upfront for each decision tree.
+
+
+\textbf{Bagging.}
+Decision trees suffer from {\it high variance} resulting in different models whenever trained on a random subsets of the data points.  
+{\it Bagging} is a general-purpose method to reduce the variance of a statistical learning method like decision trees.
+In the context of decision trees (for classification), for a given test feature vector 
+the prediction is computed by taking a {\it majority vote}: the overall prediction is the most commonly occurring class among all the tree predictions.
+
+ 
+\textbf{Out-Of-Bag error estimation.} 
+Note that each bagged tree in a random forest model is trained on a subset (around $\frac{2}{3}$) of the observations (i.e., feature vectors).
+The remaining ($\frac{1}{3}$ of the) observations not used for training is called the {\it Out-Of-Bag} (OOB) observations. 
+This gives us a straightforward way to estimate the test error: to predict the class label of each test observation $i$ we use the trees in which $i$ was OOB.
+Our {\tt random-forest-predict.dml} script provides the OOB error estimate for a given training set if requested.  
+
+
+\textbf{Description of the model.} 
+Similar to decision trees, the learned random forest model is presented in a matrix $M$  with at least 7 rows.
+The information stored in the model is similar to that of decision trees with the difference that the tree-ids are stored
+in the second row and rows $2,3,\ldots$ from the decision tree model are shifted by one. See Section~\ref{sec:decision_trees} for a description of the model.
+
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+
+The matrix corresponding to the learned model is written to a file in the format specified. See Section~\ref{sec:decision_trees} where the details about the structure of the model matrix is described.
+Similar to {\tt decision-tree.dml}, $X$ is split into $X_\text{cont}$ and $X_\text{cat}$. 
+If requested, the mappings of the continuous feature-ids in $X_\text{cont}$ (stored at {\tt S\_map}) as well as the categorical feature-ids in $X_\text{cat}$ (stored at {\tt C\_map}) to the global feature-ids in $X$ will be provided. 
+The {\tt random-forest-predict.dml} script may compute one or more of
+predictions, accuracy, confusion matrix, and OOB error estimate in the requested output format depending on the input arguments used. 
+ 
+
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+	\hml -f random-forest.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
+	R=/user/biadmin/R.csv M=/user/biadmin/model.csv
+	bins=20 depth=25 num\_leaf=10 num\_samples=3000 num\_trees=10 impurity=Gini fmt=csv
+	
+}\smallskip
+
+
+\noindent To compute predictions:
+
+{\hangindent=\parindent\noindent\tt
+	\hml -f random-forest-predict.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx R=/user/biadmin/R.csv
+	M=/user/biadmin/model.csv P=/user/biadmin/predictions.csv
+	A=/user/biadmin/accuracy.csv CM=/user/biadmin/confusion.csv fmt=csv
+	
+}\smallskip
+
+
+%\noindent{\bf References}
+%
+%\begin{itemize}
+%\item B. Panda, J. Herbach, S. Basu, and R. Bayardo. \newblock{PLANET: massively parallel learning of tree ensembles with MapReduce}. In Proceedings of the VLDB Endowment, 2009.
+%\item L. Breiman. \newblock{Random Forests}. Machine Learning, 45(1), 5--32, 2001.
+%\end{itemize}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/StepGLM.tex
----------------------------------------------------------------------
diff --git a/alg-ref/StepGLM.tex b/alg-ref/StepGLM.tex
new file mode 100644
index 0000000..3869990
--- /dev/null
+++ b/alg-ref/StepGLM.tex
@@ -0,0 +1,132 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Stepwise Generalized Linear Regression}
+
+\noindent{\bf Description}
+\smallskip
+
+Our stepwise generalized linear regression script selects a model based on the Akaike information criterion (AIC): the model that gives rise to the lowest AIC is provided. Note that currently only the Bernoulli distribution family is supported (see below for details). \\
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}StepGLM.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} Y=}path/file
+{\tt{} B=}path/file
+{\tt{} S=}path/file
+{\tt{} O=}path/file
+{\tt{} link=}int
+{\tt{} yneg=}double
+{\tt{} icpt=}int
+{\tt{} tol=}double
+{\tt{} disp=}double
+{\tt{} moi=}int
+{\tt{} mii=}int
+{\tt{} thr=}double
+{\tt{} fmt=}format
+
+}
+
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+	\item[{\tt X}:]
+	Location (on HDFS) to read the matrix of feature vectors; each row is
+	an example.
+	\item[{\tt Y}:]
+	Location (on HDFS) to read the response matrix, which may have 1 or 2 columns
+	\item[{\tt B}:]
+	Location (on HDFS) to store the estimated regression parameters (the $\beta_j$'s), with the
+	intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
+	\item[{\tt S}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to store the selected feature-ids in the order as computed by the algorithm,
+	by default it is standard output.
+	\item[{\tt O}:] (default:\mbox{ }{\tt " "})
+	Location (on HDFS) to write certain summary statistics described in Table~\ref{table:GLM:stats},
+	by default it is standard output. 
+	\item[{\tt link}:] (default:\mbox{ }{\tt 2})
+	Link function code to determine the link function~$\eta = g(\mu)$, see Table~\ref{table:commonGLMs}; currently the following link functions are supported: \\
+	{\tt 1} = log,
+	{\tt 2} = logit,
+	{\tt 3} = probit,
+	{\tt 4} = cloglog.
+	\item[{\tt yneg}:] (default:\mbox{ }{\tt 0.0})
+	Response value for Bernoulli ``No'' label, usually 0.0 or -1.0
+	\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
+	Intercept and shifting/rescaling of the features in~$X$:\\
+	{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\
+	{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
+	{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
+	\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
+	Tolerance (epsilon) used in the convergence criterion: we terminate the outer iterations
+	when the deviance changes by less than this factor; see below for details.
+	\item[{\tt disp}:] (default:\mbox{ }{\tt 0.0})
+	Dispersion parameter, or {\tt 0.0} to estimate it from data
+	\item[{\tt moi}:] (default:\mbox{ }{\tt 200})
+	Maximum number of outer (Fisher scoring) iterations
+	\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
+	Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
+	limit provided
+	\item[{\tt thr}:] (default:\mbox{ }{\tt 0.01})
+	Threshold to stop the algorithm: if the decrease in the value of the AIC falls below {\tt thr}
+	no further features are being checked and the algorithm stops.
+	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+	see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+\noindent{\bf Details}
+\smallskip
+
+Similar to {\tt StepLinearRegDS.dml} our stepwise GLM script builds a model by iteratively selecting predictive variables 
+using a forward selection strategy based on the AIC (\ref{eq:AIC}).
+Note that currently only the Bernoulli distribution family ({\tt fam=2} in Table~\ref{table:commonGLMs}) together with the following link functions are supported: log, logit, probit, and cloglog ({\tt link $\in\{1,2,3,4\}$ } in Table~\ref{table:commonGLMs}).  
+
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+Similar to the outputs from {\tt GLM.dml} the stepwise GLM script computes the estimated regression coefficients and stores them in matrix $B$ on HDFS; matrix $B$ follows the same format as the one produced by {\tt GLM.dml} (see Section~\ref{sec:GLM}).   
+Additionally, {\tt StepGLM.dml} outputs the variable indices (stored in the 1-column matrix $S$) in the order they have been selected by the algorithm, i.e., $i$th entry in matrix $S$ stores the variable which improves the AIC the most in $i$th iteration.  
+If the model with the lowest AIC includes no variables matrix $S$ will be empty. 
+Moreover, the estimated summary statistics as defined in Table~\ref{table:GLM:stats}
+are printed out or stored in a file on HDFS (if requested);
+these statistics will be provided only if the selected model is nonempty, i.e., contains at least one variable.
+
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+	\hml -f StepGLM.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx	B=/user/biadmin/B.mtx S=/user/biadmin/selected.csv O=/user/biadmin/stats.csv link=2 yneg=-1.0 icpt=2 tol=0.000001  moi=100 mii=10 thr=0.05 fmt=csv
+	
+}
+
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/StepLinRegDS.tex
----------------------------------------------------------------------
diff --git a/alg-ref/StepLinRegDS.tex b/alg-ref/StepLinRegDS.tex
new file mode 100644
index 0000000..8c29fb1
--- /dev/null
+++ b/alg-ref/StepLinRegDS.tex
@@ -0,0 +1,122 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Stepwise Linear Regression}
+
+\noindent{\bf Description}
+\smallskip
+
+Our stepwise linear regression script selects a linear model based on the Akaike information criterion (AIC): 
+the model that gives rise to the lowest AIC is computed. \\
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}StepLinearRegDS.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} Y=}path/file
+{\tt{} B=}path/file
+{\tt{} S=}path/file
+{\tt{} O=}path/file
+{\tt{} icpt=}int
+{\tt{} thr=}double
+{\tt{} fmt=}format
+
+}
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the matrix of feature vectors, each row contains
+one feature vector.
+\item[{\tt Y}:]
+Location (on HDFS) to read the 1-column matrix of response values
+\item[{\tt B}:]
+Location (on HDFS) to store the estimated regression parameters (the $\beta_j$'s), with the
+intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
+\item[{\tt S}:] (default:\mbox{ }{\tt " "})
+Location (on HDFS) to store the selected feature-ids in the order as computed by the algorithm;
+by default the selected feature-ids are forwarded to the standard output.
+\item[{\tt O}:] (default:\mbox{ }{\tt " "})
+Location (on HDFS) to store the CSV-file of summary statistics defined in
+Table~\ref{table:linreg:stats}; by default the summary statistics are forwarded to the standard output.
+\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
+Intercept presence and shifting/rescaling the features in~$X$:\\
+{\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\
+{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
+{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
+\item[{\tt thr}:] (default:\mbox{ }{\tt 0.01})
+Threshold to stop the algorithm: if the decrease in the value of the AIC falls below {\tt thr}
+no further features are being checked and the algorithm stops.
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+\noindent{\bf Details}
+\smallskip
+
+Stepwise linear regression iteratively selects predictive variables in an automated procedure.
+Currently, our implementation supports forward selection: starting from an empty model (without any variable) 
+the algorithm examines the addition of each variable based on the AIC as a model comparison criterion. The AIC is defined as  
+\begin{equation}
+AIC = -2 \log{L} + 2 edf,\label{eq:AIC}
+\end{equation}    
+where $L$ denotes the likelihood of the fitted model and $edf$ is the equivalent degrees of freedom, i.e., the number of estimated parameters. 
+This procedure is repeated until including no additional variable improves the model by a certain threshold 
+specified in the input parameter {\tt thr}. 
+
+For fitting a model in each iteration we use the ``direct solve'' method as in the script {\tt LinearRegDS.dml} discussed in Section~\ref{sec:LinReg}.  
+
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+Similar to the outputs from {\tt LinearRegDS.dml} the stepwise linear regression script computes 
+the estimated regression coefficients and stores them in matrix $B$ on HDFS. 
+The format of matrix $B$ is identical to the one produced by the scripts for linear regression (see Section~\ref{sec:LinReg}).   
+Additionally, {\tt StepLinearRegDS.dml} outputs the variable indices (stored in the 1-column matrix $S$) 
+in the order they have been selected by the algorithm, i.e., $i$th entry in matrix $S$ corresponds to 
+the variable which improves the AIC the most in $i$th iteration.  
+If the model with the lowest AIC includes no variables matrix $S$ will be empty (contains one 0). 
+Moreover, the estimated summary statistics as defined in Table~\ref{table:linreg:stats}
+are printed out or stored in a file (if requested). 
+In the case where an empty model achieves the best AIC these statistics will not be produced. 
+
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+	\hml -f StepLinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
+	B=/user/biadmin/B.mtx S=/user/biadmin/selected.csv O=/user/biadmin/stats.csv
+	icpt=2 thr=0.05 fmt=csv
+	
+}
+
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/SystemML_Algorithms_Reference.bib
----------------------------------------------------------------------
diff --git a/alg-ref/SystemML_Algorithms_Reference.bib b/alg-ref/SystemML_Algorithms_Reference.bib
new file mode 100644
index 0000000..878e1dc
--- /dev/null
+++ b/alg-ref/SystemML_Algorithms_Reference.bib
@@ -0,0 +1,215 @@
+
+@article {Lin2008:logistic,
+   author       = {Chih-Jen Lin and Ruby C.\ Weng and S.\ Sathiya Keerthi},
+   title        = {Trust Region {N}ewton Method for Large-Scale Logistic Regression},
+   journal      = {Journal of Machine Learning Research},
+   month        = {April},
+   year         = {2008},
+   volume       = {9},
+   pages        = {627--650}
+}
+
+@book {Agresti2002:CDA,
+   author       = {Alan Agresti},
+   title        = {Categorical Data Analysis},
+   edition      = {Second},
+   series       = {Wiley Series in Probability and Statistics},
+   publisher    = {Wiley-Interscience},
+   year         = {2002},
+   pages        = {710}
+}
+
+@article {Nelder1972:GLM,
+   author       = {John Ashworth Nelder and Robert William Maclagan Wedderburn},
+   title        = {Generalized Linear Models},
+   journal      = {Journal of the Royal Statistical Society, Series~A (General)},
+   year         = {1972},
+   volume       = {135},
+   number       = {3},
+   pages        = {370--384}
+}
+
+@book {McCullagh1989:GLM,
+   author       = {Peter McCullagh and John Ashworth Nelder},
+   title        = {Generalized Linear Models},
+   edition      = {Second},
+   series       = {Monographs on Statistics and Applied Probability},
+   number       = {37},
+   year         = {1989},
+   publisher    = {Chapman~\&~Hall/CRC}, 
+   pages        = {532}
+}
+
+@book {Gill2000:GLM,
+   author       = {Jeff Gill},
+   title        = {Generalized Linear Models: A Unified Approach},
+   series       = {Sage University Papers Series on Quantitative Applications in the Social Sciences},
+   number       = {07-134},
+   year         = {2000},
+   publisher    = {Sage Publications},
+   pages        = {101}
+}
+
+@inproceedings {AgrawalKSX2002:hippocratic,
+   author       = {Rakesh Agrawal and Jerry Kiernan and Ramakrishnan Srikant and Yirong Xu},
+   title        = {Hippocratic Databases},
+   booktitle    = {Proceedings of the 28-th International Conference on Very Large Data Bases ({VLDB} 2002)},
+   address      = {Hong Kong, China},
+   month        = {August 20--23},
+   year         = {2002},
+   pages        = {143--154}
+}
+
+@book {Nocedal2006:Optimization,
+   title        = {Numerical Optimization},
+   author       = {Jorge Nocedal and Stephen Wright},
+   series       = {Springer Series in Operations Research and Financial Engineering},
+   pages        = {664},
+   edition      = {Second},
+   publisher    = {Springer},
+   year         = {2006}
+}
+
+@book {Hartigan1975:clustering,
+   author       = {John A.\ Hartigan},
+   title        = {Clustering Algorithms},
+   publisher    = {John Wiley~\&~Sons Inc.},
+   series       = {Probability and Mathematical Statistics},
+   month        = {April},
+   year         = {1975},
+   pages        = {365}
+}
+
+@inproceedings {ArthurVassilvitskii2007:kmeans,
+   title        = {{\tt k-means++}: The Advantages of Careful Seeding},
+   author       = {David Arthur and Sergei Vassilvitskii},
+   booktitle    = {Proceedings of the 18th Annual {ACM-SIAM} Symposium on Discrete Algorithms ({SODA}~2007)},
+   month        = {January 7--9}, 
+   year         = {2007},
+   address      = {New Orleans~{LA}, {USA}},
+   pages        = {1027--1035}
+}
+
+@article {AloiseDHP2009:kmeans,
+   author       = {Daniel Aloise and Amit Deshpande and Pierre Hansen and Preyas Popat},
+   title        = {{NP}-hardness of {E}uclidean Sum-of-squares Clustering},
+   journal      = {Machine Learning},
+   publisher    = {Kluwer Academic Publishers},
+   volume       = {75},
+   number       = {2}, 
+   month        = {May}, 
+   year         = {2009},
+   pages        = {245--248}
+}
+
+@article {Cochran1954:chisq,
+   author       = {William G.\ Cochran},
+   title        = {Some Methods for Strengthening the Common $\chi^2$ Tests},
+   journal      = {Biometrics},
+   volume       = {10},
+   number       = {4},
+   month        = {December},
+   year         = {1954},
+   pages        = {417--451}
+}
+
+@article {AcockStavig1979:CramersV,
+   author       = {Alan C.\ Acock and Gordon R.\ Stavig},
+   title        = {A Measure of Association for Nonparametric Statistics},
+   journal      = {Social Forces},
+   publisher    = {Oxford University Press},
+   volume       = {57},
+   number       = {4},
+   month        = {June},
+   year         = {1979},
+   pages        = {1381--1386}
+}
+
+@article {Stevens1946:scales,
+   author       = {Stanley Smith Stevens},
+   title        = {On the Theory of Scales of Measurement},
+   journal      = {Science},
+   month        = {June 7},
+   year         = {1946},
+   volume       = {103},
+   number       = {2684},
+   pages        = {677--680}
+}
+
+@book{collett2003:kaplanmeier,
+  title={Modelling Survival Data in Medical Research, Second Edition},
+  author={Collett, D.},
+  isbn={9781584883258},
+  lccn={2003040945},
+  series={Chapman \& Hall/CRC Texts in Statistical Science},
+  year={2003},
+  publisher={Taylor \& Francis}
+}
+
+@article{PetoPABCHMMPS1979:kaplanmeier,
+    title = {{Design and analysis of randomized clinical trials requiring prolonged observation of each patient. II. analysis and examples.}},
+    author = {Peto, R. and Pike, M. C. and Armitage, P. and Breslow, N. E. and Cox, D. R. and Howard, S. V. and Mantel, N. and McPherson, K. and Peto, J. and Smith, P. G.},
+    journal = {British journal of cancer},
+    number = {1},
+    pages = {1--39},
+    volume = {35},
+    year = {1977}
+}
+
+@inproceedings{ZhouWSP08:als,
+  author    = {Yunhong Zhou and
+               Dennis M. Wilkinson and
+               Robert Schreiber and
+               Rong Pan},
+  title     = {Large-Scale Parallel Collaborative Filtering for the Netflix Prize},
+  booktitle = {Algorithmic Aspects in Information and Management, 4th International
+               Conference, {AAIM} 2008, Shanghai, China, June 23-25, 2008. Proceedings},
+  pages     = {337--348},
+  year      = {2008}
+}
+
+@book{BreimanFOS84:dtree,
+  author    = {Leo Breiman and
+               J. H. Friedman and
+               R. A. Olshen and
+               C. J. Stone},
+  title     = {Classification and Regression Trees},
+  publisher = {Wadsworth},
+  year      = {1984},
+  isbn      = {0-534-98053-8},
+  timestamp = {Thu, 03 Jan 2002 11:51:52 +0100},
+  biburl    = {http://dblp.uni-trier.de/rec/bib/books/wa/BreimanFOS84},
+  bibsource = {dblp computer science bibliography, http://dblp.org}
+}
+
+@article{PandaHBB09:dtree,
+  author    = {Biswanath Panda and
+               Joshua Herbach and
+               Sugato Basu and
+               Roberto J. Bayardo},
+  title     = {{PLANET:} Massively Parallel Learning of Tree Ensembles with MapReduce},
+  journal   = {{PVLDB}},
+  volume    = {2},
+  number    = {2},
+  pages     = {1426--1437},
+  year      = {2009},
+  url       = {http://www.vldb.org/pvldb/2/vldb09-537.pdf},
+  timestamp = {Wed, 02 Sep 2009 09:21:18 +0200},
+  biburl    = {http://dblp.uni-trier.de/rec/bib/journals/pvldb/PandaHBB09},
+  bibsource = {dblp computer science bibliography, http://dblp.org}
+}
+
+@article{Breiman01:rforest,
+  author    = {Leo Breiman},
+  title     = {Random Forests},
+  journal   = {Machine Learning},
+  volume    = {45},
+  number    = {1},
+  pages     = {5--32},
+  year      = {2001},
+  url       = {http://dx.doi.org/10.1023/A:1010933404324},
+  doi       = {10.1023/A:1010933404324},
+  timestamp = {Thu, 26 May 2011 15:25:18 +0200},
+  biburl    = {http://dblp.uni-trier.de/rec/bib/journals/ml/Breiman01},
+  bibsource = {dblp computer science bibliography, http://dblp.org}
+}
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/SystemML_Algorithms_Reference.pdf
----------------------------------------------------------------------
diff --git a/alg-ref/SystemML_Algorithms_Reference.pdf b/alg-ref/SystemML_Algorithms_Reference.pdf
new file mode 100644
index 0000000..4087ba5
Binary files /dev/null and b/alg-ref/SystemML_Algorithms_Reference.pdf differ


[16/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1226] Add python validation to release process

Posted by de...@apache.org.
[SYSTEMML-1226] Add python validation to release process

Describe python test execution for release validation.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/b9d878c4
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/b9d878c4
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/b9d878c4

Branch: refs/heads/gh-pages
Commit: b9d878c47329f4d26e173ed8b046a523c6e19d5f
Parents: fe26aab
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Wed Feb 1 18:14:10 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Wed Feb 1 18:14:10 2017 -0800

----------------------------------------------------------------------
 release-process.md | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/b9d878c4/release-process.md
----------------------------------------------------------------------
diff --git a/release-process.md b/release-process.md
index d734f4f..1cc5c9f 100644
--- a/release-process.md
+++ b/release-process.md
@@ -184,6 +184,26 @@ sanity check on OS X after building the artifacts manually.
 	hadoop jar SystemML.jar -s "print('hello world');"
 
 
+## Python Tests
+
+For Spark 1.*, the Python tests at (`src/main/python/tests`) can be executed in the following manner:
+
+	PYSPARK_PYTHON=python3 pyspark --driver-class-path SystemML.jar test_matrix_agg_fn.py
+	PYSPARK_PYTHON=python3 pyspark --driver-class-path SystemML.jar test_matrix_binary_op.py
+	PYSPARK_PYTHON=python3 pyspark --driver-class-path SystemML.jar test_mlcontext.py
+	PYSPARK_PYTHON=python3 pyspark --driver-class-path SystemML.jar test_mllearn_df.py
+	PYSPARK_PYTHON=python3 pyspark --driver-class-path SystemML.jar test_mllearn_numpy.py
+
+For Spark 2.*, pyspark can't be used to run the Python tests, so they can be executed using
+spark-submit:
+
+	spark-submit --driver-class-path SystemML.jar test_matrix_agg_fn.py
+	spark-submit --driver-class-path SystemML.jar test_matrix_binary_op.py
+	spark-submit --driver-class-path SystemML.jar test_mlcontext.py
+	spark-submit --driver-class-path SystemML.jar test_mllearn_df.py
+	spark-submit --driver-class-path SystemML.jar test_mllearn_numpy.py
+
+
 ## Check LICENSE and NOTICE Files
 
 <a href="#release-candidate-checklist">Up to Checklist</a>


[20/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1229] Add Python MLContext example to Engine Dev Guide

Posted by de...@apache.org.
[SYSTEMML-1229] Add Python MLContext example to Engine Dev Guide


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/7283ddc8
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/7283ddc8
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/7283ddc8

Branch: refs/heads/gh-pages
Commit: 7283ddc8f2f4732d0b91fa4f7e5c58e5b87f0309
Parents: f80ab12
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Fri Feb 3 14:11:16 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Fri Feb 3 14:11:16 2017 -0800

----------------------------------------------------------------------
 engine-dev-guide.md | 60 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 60 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/7283ddc8/engine-dev-guide.md
----------------------------------------------------------------------
diff --git a/engine-dev-guide.md b/engine-dev-guide.md
index 0d793fa..8dff7f7 100644
--- a/engine-dev-guide.md
+++ b/engine-dev-guide.md
@@ -86,6 +86,66 @@ This SystemML script can be debugged in Eclipse using a Debug Configuration such
 
 * * *
 
+## Python MLContext API
+
+When working with the Python MLContext API (see `src/main/python/systemml/mlcontext.py`) during development,
+it can be useful to install the Python MLContext API in editable mode (`-e`). This allows Python updates
+to take effect without requiring the SystemML python artifact to be built and installed.
+
+{% highlight bash %}
+mvn clean
+pip3 install -e src/main/python
+mvn clean package
+PYSPARK_PYTHON=python3 pyspark --driver-class-path target/SystemML.jar
+{% endhighlight %}
+
+<div class="codetabs">
+
+<div data-lang="Python 3" markdown="1">
+{% highlight python %}
+from systemml import MLContext, dml
+ml = MLContext(sc)
+script = dml("print('hello world')")
+ml.execute(script)
+{% endhighlight %}
+</div>
+
+<div data-lang="PySpark" markdown="1">
+{% highlight python %}
+Python 3.5.2 (default, Jul 28 2016, 21:28:07) 
+[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
+Type "help", "copyright", "credits" or "license" for more information.
+Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
+Setting default log level to "WARN".
+To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+17/02/03 12:33:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
+17/02/03 12:33:56 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+Welcome to
+      ____              __
+     / __/__  ___ _____/ /__
+    _\ \/ _ \/ _ `/ __/  '_/
+   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
+      /_/
+
+Using Python version 3.5.2 (default, Jul 28 2016 21:28:07)
+SparkSession available as 'spark'.
+>>> from systemml import MLContext, dml
+>>> ml = MLContext(sc)
+
+Welcome to Apache SystemML!
+
+>>> script = dml("print('hello world')")
+>>> ml.execute(script)
+hello world
+MLResults
+{% endhighlight %}
+</div>
+
+</div>
+
+
+* * *
+
 ## Matrix Multiplication Operators
 
 In the following, we give an overview of backend-specific physical matrix multiplication operators in SystemML as well as their internally used matrix multiplication block operations.


[35/50] [abbrv] incubator-systemml git commit: [MINOR] Update documentation version to 'Latest'

Posted by de...@apache.org.
[MINOR] Update documentation version to 'Latest'


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/4ec1b9f4
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/4ec1b9f4
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/4ec1b9f4

Branch: refs/heads/gh-pages
Commit: 4ec1b9f402a228b6b8cc13cc1c477c237040e744
Parents: 032bc37
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Mon Mar 6 18:51:29 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Mon Mar 6 18:51:29 2017 -0800

----------------------------------------------------------------------
 _config.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4ec1b9f4/_config.yml
----------------------------------------------------------------------
diff --git a/_config.yml b/_config.yml
index 1d213d7..ba1a808 100644
--- a/_config.yml
+++ b/_config.yml
@@ -11,7 +11,7 @@ include:
   - _modules
 
 # These allow the documentation to be updated with newer releases
-SYSTEMML_VERSION: 0.13.0
+SYSTEMML_VERSION: Latest
 
 # if 'analytics_on' is true, analytics section will be rendered on the HTML pages
 analytics_on: true


[37/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

Posted by de...@apache.org.
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/SystemML_Algorithms_Reference.tex
----------------------------------------------------------------------
diff --git a/alg-ref/SystemML_Algorithms_Reference.tex b/alg-ref/SystemML_Algorithms_Reference.tex
new file mode 100644
index 0000000..75308c9
--- /dev/null
+++ b/alg-ref/SystemML_Algorithms_Reference.tex
@@ -0,0 +1,174 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\documentclass[letter]{article}
+\usepackage{graphicx,amsmath,amssymb,amsthm,subfigure,color,url,multirow,rotating,comment}
+\usepackage{tikz}
+\usepackage[normalem]{ulem}
+\usepackage[np,autolanguage]{numprint}
+\usepackage{tabularx}
+
+\usepackage[pdftex]{hyperref}
+\hypersetup{
+    unicode=false,          % non-Latin characters in Acrobat&#146;s bookmarks
+    pdftoolbar=true,        % show Acrobat&#146;s toolbar?
+    pdfmenubar=true,        % show Acrobat&#146;s menu?
+    pdffitwindow=true,      % window fit to page when opened
+    pdfstartview={FitV},    % fits the width of the page to the window
+    pdftitle={SystemML Algorithms Reference},    % title
+    pdfauthor={SystemML Team}, % author
+    pdfsubject={Documentation},   % subject of the document
+    pdfkeywords={},         % list of keywords
+    pdfnewwindow=true,      % links in new window
+    bookmarksnumbered=true, % put section numbers in bookmarks
+    bookmarksopen=true,     % open up bookmark tree
+    bookmarksopenlevel=1,   % \maxdimen level to which bookmarks are open
+    colorlinks=true,        % false: boxed links; true: colored links
+    linkcolor=black,        % color of internal links  
+    citecolor=blue,         % color of links to bibliography
+    filecolor=black,        % color of file links
+    urlcolor=black          % color of external links
+}
+
+
+\newtheorem{definition}{Definition}
+\newtheorem{example}{Example}
+
+\newcommand{\Paragraph}[1]{\vspace*{1ex} \noindent {\bf #1} \hspace*{1ex}}
+\newenvironment{Itemize}{\vspace{-0.5ex}\begin{itemize}\setlength{\itemsep}{-0.2ex}
+}{\end{itemize}\vspace{-0.5ex}}
+\newenvironment{Enumerate}{\vspace{-0.5ex}\begin{enumerate}\setlength{\itemsep}{-0.2ex}
+}{\end{enumerate}\vspace{-0.5ex}}
+\newenvironment{Description}{\vspace{-0.5ex}\begin{description}\setlength{\itemsep}{-0.2ex}
+}{\end{description}\vspace{-0.5ex}}
+
+
+\newcommand{\SystemML}{\texttt{SystemML} }
+\newcommand{\hml}{\texttt{hadoop jar SystemML.jar} }
+\newcommand{\pxp}{\mathbin{\texttt{\%\textasteriskcentered\%}}}
+\newcommand{\todo}[1]{{{\color{red}TODO: #1}}}
+\newcommand{\Normal}{\ensuremath{\mathop{\mathrm{Normal}}\nolimits}}
+\newcommand{\Prob}{\ensuremath{\mathop{\mathrm{Prob}\hspace{0.5pt}}\nolimits}}
+\newcommand{\E}{\ensuremath{\mathop{\mathrm{E}}\nolimits}}
+\newcommand{\mean}{\ensuremath{\mathop{\mathrm{mean}}\nolimits}}
+\newcommand{\Var}{\ensuremath{\mathop{\mathrm{Var}}\nolimits}}
+\newcommand{\Cov}{\ensuremath{\mathop{\mathrm{Cov}}\nolimits}}
+\newcommand{\stdev}{\ensuremath{\mathop{\mathrm{st.dev}}\nolimits}}
+\newcommand{\atan}{\ensuremath{\mathop{\mathrm{arctan}}\nolimits}}
+\newcommand{\diag}{\ensuremath{\mathop{\mathrm{diag}}\nolimits}}
+\newcommand{\const}{\ensuremath{\mathop{\mathrm{const}}\nolimits}}
+\newcommand{\eps}{\varepsilon}
+
+\sloppy
+
+%%%%%%%%%%%%%%%%%%%%% 
+% header
+%%%%%%%%%%%%%%%%%%%%%
+
+\title{\LARGE{{\SystemML Algorithms Reference}}} 
+\date{\today}
+
+%%%%%%%%%%%%%%%%%%%%%
+% document start
+%%%%%%%%%%%%%%%%%%%%%
+\begin{document}	
+
+%\pagenumbering{roman}
+\maketitle
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Descriptive Statistics}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\input{DescriptiveStats}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Classification}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\input{LogReg}
+
+\subsection{Support Vector Machines}
+
+\input{BinarySVM}
+
+\input{MultiSVM}
+
+\input{NaiveBayes}
+
+\input{DecisionTrees}
+
+\input{RandomForest}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Clustering}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\input{Kmeans}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Regression}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\input{LinReg}
+
+\input{StepLinRegDS}
+
+\input{GLM}
+
+\input{StepGLM}
+
+\input{GLMpredict.tex}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Matrix Factorization}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\input{pca}
+
+\input{ALS.tex}
+
+%%{\color{red}\subsection{GNMF}}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%{\color{red}\section{Sequence Mining}}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Survival Analysis}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\input{KaplanMeier}
+
+\input{Cox}
+
+\bibliographystyle{abbrv}
+
+\bibliography{SystemML_ALgorithms_Reference}
+
+	
+%%%%%%%%%%%%%%%%%%%%%
+% document end
+%%%%%%%%%%%%%%%%%%%%%
+\end{document}
+
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/lang-ref/PyDML_Language_Reference.doc
----------------------------------------------------------------------
diff --git a/lang-ref/PyDML_Language_Reference.doc b/lang-ref/PyDML_Language_Reference.doc
new file mode 100644
index 0000000..b43b6db
Binary files /dev/null and b/lang-ref/PyDML_Language_Reference.doc differ

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/lang-ref/Python_syntax_for_DML.doc
----------------------------------------------------------------------
diff --git a/lang-ref/Python_syntax_for_DML.doc b/lang-ref/Python_syntax_for_DML.doc
new file mode 100644
index 0000000..ee43a6b
Binary files /dev/null and b/lang-ref/Python_syntax_for_DML.doc differ

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/lang-ref/README_HADOOP_CONFIG.txt
----------------------------------------------------------------------
diff --git a/lang-ref/README_HADOOP_CONFIG.txt b/lang-ref/README_HADOOP_CONFIG.txt
new file mode 100644
index 0000000..e34d4f3
--- /dev/null
+++ b/lang-ref/README_HADOOP_CONFIG.txt
@@ -0,0 +1,83 @@
+Usage
+-----
+The machine learning algorithms described in SystemML_Algorithms_Reference.pdf can be invoked
+from the hadoop command line using the described, algorithm-specific parameters. 
+
+Generic command line arguments arguments are provided by the help command below.
+
+   hadoop jar SystemML.jar -? or -help 
+
+
+Recommended configurations
+--------------------------
+1) JVM Heap Sizes: 
+We recommend an equal-sized JVM configuration for clients, mappers, and reducers. For the client
+process this can be done via
+
+   export HADOOP_CLIENT_OPTS="-Xmx2048m -Xms2048m -Xmn256m" 
+   
+where Xmx specifies the maximum heap size, Xms the initial heap size, and Xmn is size of the young 
+generation. For Xmn values of equal or less than 15% of the max heap size, we guarantee the memory budget.
+
+For mapper or reducer JVM configurations, the following properties can be specified in mapred-site.xml,
+where 'child' refers to both mapper and reducer. If map and reduce are specified individually, they take 
+precedence over the generic property.
+
+  <property>
+    <name>mapreduce.child.java.opts</name> <!-- synonym: mapred.child.java.opts -->
+    <value>-Xmx2048m -Xms2048m -Xmn256m</value>
+  </property>
+  <property>
+    <name>mapreduce.map.java.opts</name> <!-- synonym: mapred.map.java.opts -->
+    <value>-Xmx2048m -Xms2048m -Xmn256m</value>
+  </property>
+  <property>
+    <name>mapreduce.reduce.java.opts</name> <!-- synonym: mapred.reduce.java.opts -->
+    <value>-Xmx2048m -Xms2048m -Xmn256m</value>
+  </property>
+ 
+
+2) CP Memory Limitation:
+There exist size limitations for in-memory matrices. Dense in-memory matrices are limited to 16GB 
+independent of their dimension. Sparse in-memory matrices are limited to 2G rows and 2G columns 
+but the overall matrix can be larger. These limitations do only apply to in-memory matrices but 
+NOT in HDFS or involved in MR computations. Setting HADOOP_CLIENT_OPTS below those limitations 
+prevents runtime errors.
+
+3) Transparent Huge Pages (on Red Hat Enterprise Linux 6):
+Hadoop workloads might show very high System CPU utilization if THP is enabled. In case of such 
+behavior, we recommend to disable THP with
+   
+   echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
+   
+4) JVM Reuse:
+Performance benefits from JVM reuse because data sets that fit into the mapper memory budget are 
+reused across tasks per slot. However, Hadoop 1.0.3 JVM Reuse is incompatible with security (when 
+using the LinuxTaskController). The workaround is to use the DefaultTaskController. SystemML provides 
+a configuration property in SystemML-config.xml to enable JVM reuse on a per job level without
+changing the global cluster configuration.
+   
+   <jvmreuse>false</jvmreuse> 
+   
+5) Number of Reducers:
+The number of reducers can have significant impact on performance. SystemML provides a configuration
+property to set the default number of reducers per job without changing the global cluster configuration.
+In general, we recommend a setting of twice the number of nodes. Smaller numbers create less intermediate
+files, larger numbers increase the degree of parallelism for compute and parallel write. In
+SystemML-config.xml, set:
+   
+   <!-- default number of reduce tasks per MR job, default: 2 x number of nodes -->
+   <numreducers>12</numreducers> 
+
+6) SystemML temporary directories:
+SystemML uses temporary directories in two different locations: (1) on local file system for temping from 
+the client process, and (2) on HDFS for intermediate results between different MR jobs and between MR jobs 
+and in-memory operations. Locations of these directories can be configured in SystemML-config.xml with the
+following properties:
+
+   <!-- local fs tmp working directory-->
+   <localtmpdir>/tmp/systemml</localtmpdir>
+
+   <!-- hdfs tmp working directory--> 
+   <scratch>scratch_space</scratch> 
+ 
\ No newline at end of file


[45/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

Posted by de...@apache.org.
[SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

Rename "docs/Algorithms Reference" to docs/alg-ref.
Rename "docs/Language Reference" to docs/lang-ref.
Exclude docs/alg-ref and docs/lang-ref from docs site.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/358cfc9f
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/358cfc9f
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/358cfc9f

Branch: refs/heads/gh-pages
Commit: 358cfc9f3092aa9b2ab06f09e96f213363009d4d
Parents: 42e86e7
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Fri Mar 10 17:06:49 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Fri Mar 10 17:09:40 2017 -0800

----------------------------------------------------------------------
 Algorithms Reference/ALS.tex                    | 298 ---------
 Algorithms Reference/BinarySVM.tex              | 175 ------
 Algorithms Reference/Cox.tex                    | 340 -----------
 Algorithms Reference/DecisionTrees.tex          | 312 ----------
 Algorithms Reference/DescriptiveBivarStats.tex  | 438 --------------
 Algorithms Reference/DescriptiveStats.tex       | 115 ----
 Algorithms Reference/DescriptiveStratStats.tex  | 306 ----------
 Algorithms Reference/DescriptiveUnivarStats.tex | 603 -------------------
 Algorithms Reference/GLM.tex                    | 431 -------------
 Algorithms Reference/GLMpredict.tex             | 474 ---------------
 Algorithms Reference/KaplanMeier.tex            | 289 ---------
 Algorithms Reference/Kmeans.tex                 | 371 ------------
 Algorithms Reference/LinReg.tex                 | 328 ----------
 Algorithms Reference/LogReg.tex                 | 287 ---------
 Algorithms Reference/MultiSVM.tex               | 174 ------
 Algorithms Reference/NaiveBayes.tex             | 155 -----
 Algorithms Reference/PCA.tex                    | 142 -----
 Algorithms Reference/RandomForest.tex           | 215 -------
 Algorithms Reference/StepGLM.tex                | 132 ----
 Algorithms Reference/StepLinRegDS.tex           | 122 ----
 .../SystemML_Algorithms_Reference.bib           | 215 -------
 .../SystemML_Algorithms_Reference.pdf           | Bin 1266909 -> 0 bytes
 .../SystemML_Algorithms_Reference.tex           | 174 ------
 Language Reference/PyDML Language Reference.doc | Bin 209408 -> 0 bytes
 Language Reference/Python syntax for DML.doc    | Bin 207360 -> 0 bytes
 Language Reference/README_HADOOP_CONFIG.txt     |  83 ---
 _config.yml                                     |   4 +
 alg-ref/ALS.tex                                 | 298 +++++++++
 alg-ref/BinarySVM.tex                           | 175 ++++++
 alg-ref/Cox.tex                                 | 340 +++++++++++
 alg-ref/DecisionTrees.tex                       | 312 ++++++++++
 alg-ref/DescriptiveBivarStats.tex               | 438 ++++++++++++++
 alg-ref/DescriptiveStats.tex                    | 115 ++++
 alg-ref/DescriptiveStratStats.tex               | 306 ++++++++++
 alg-ref/DescriptiveUnivarStats.tex              | 603 +++++++++++++++++++
 alg-ref/GLM.tex                                 | 431 +++++++++++++
 alg-ref/GLMpredict.tex                          | 474 +++++++++++++++
 alg-ref/KaplanMeier.tex                         | 289 +++++++++
 alg-ref/Kmeans.tex                              | 371 ++++++++++++
 alg-ref/LinReg.tex                              | 328 ++++++++++
 alg-ref/LogReg.tex                              | 287 +++++++++
 alg-ref/MultiSVM.tex                            | 174 ++++++
 alg-ref/NaiveBayes.tex                          | 155 +++++
 alg-ref/PCA.tex                                 | 142 +++++
 alg-ref/RandomForest.tex                        | 215 +++++++
 alg-ref/StepGLM.tex                             | 132 ++++
 alg-ref/StepLinRegDS.tex                        | 122 ++++
 alg-ref/SystemML_Algorithms_Reference.bib       | 215 +++++++
 alg-ref/SystemML_Algorithms_Reference.pdf       | Bin 0 -> 1266909 bytes
 alg-ref/SystemML_Algorithms_Reference.tex       | 174 ++++++
 lang-ref/PyDML_Language_Reference.doc           | Bin 0 -> 209408 bytes
 lang-ref/Python_syntax_for_DML.doc              | Bin 0 -> 207360 bytes
 lang-ref/README_HADOOP_CONFIG.txt               |  83 +++
 53 files changed, 6183 insertions(+), 6179 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/ALS.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/ALS.tex b/Algorithms Reference/ALS.tex
deleted file mode 100644
index c2a5e3a..0000000
--- a/Algorithms Reference/ALS.tex	
+++ /dev/null
@@ -1,298 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Matrix Completion via Alternating Minimizations}
-\label{matrix_completion}
-
-\noindent{\bf Description}
-\smallskip
-
-Low-rank matrix completion is an effective technique for statistical data analysis widely used in the data mining and machine learning applications.
-Matrix completion is a variant of low-rank matrix factorization with the goal of recovering a partially observed and potentially noisy matrix from a subset of its revealed entries.
-Perhaps the most popular applications in which matrix completion has been successfully applied is in the context of collaborative filtering in recommender systems. 
-In this setting, the rows in the data matrix correspond to users, 
-the columns to items such as movies, and entries to feedback provided by users for items. 
-The goal is to predict missing entries of the rating matrix. 
-This implementation uses the alternating least-squares (ALS) technique for solving large-scale matrix completion problems.\\ 
-
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-	{\tt{}-f }path/\/{\tt{}ALS.dml}
-	{\tt{} -nvargs}
-	{\tt{} V=}path/file
-	{\tt{} L=}path/file
-	{\tt{} R=}path/file
-%	{\tt{} VO=}path/file
-	{\tt{} rank=}int
-	{\tt{} reg=}L2$\mid$wL2%regularization
-	{\tt{} lambda=}double
-	{\tt{} fmt=}format
-	
-}
-
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-	\item[{\tt V}:]
-	Location (on HDFS) to read the input (user-item) matrix $V$ to be factorized
-	\item[{\tt L}:]
-	Location (on HDFS) to write the left (user) factor matrix $L$
-	\item[{\tt R}:]
-	Location (on HDFS) to write the right (item) factor matrix $R$
-%	\item[{\tt VO}:]
-%	Location (on HDFS) to write the input matrix $VO$ with empty rows and columns removed (if there are any)
-	\item[{\tt rank}:] (default:\mbox{ }{\tt 10})
-	Rank of the factorization
-	\item[{\tt reg}] (default:\mbox{ }{\tt L2})
-	Regularization:\\
-	{\tt L2} = L2 regularization;\\
- 	{\tt wL2} = weighted L2 regularization;\\
- 	if {\tt reg} is not provided no regularization will be performed. 
- 	\item[{\tt lambda}:] (default:\mbox{ }{\tt 0.000001})
- 	Regularization parameter
- 	\item[{\tt maxi}:] (default:\mbox{ }{\tt 50})
-	 Maximum number of iterations
-	\item[{\tt check}:] (default:\mbox{ }{\tt FALSE})
-	Check for convergence after every iteration, i.e., updating $L$ and $R$ once
-	\item[{\tt thr}:] (default:\mbox{ }{\tt 0.0001})
-	Assuming {\tt check=TRUE}, the algorithm stops and convergence is declared 
-	if the decrease in loss in any two consecutive iterations falls below threshold {\tt thr}; 
-	if {\tt check=FALSE} parameter {\tt thr} is ignored.
-	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}
-\end{Description}
- 
- \smallskip
- \noindent{\bf Usage: ALS Prediction/Top-K Prediction}
- \smallskip
- 
- {\hangindent=\parindent\noindent\it%
- 	{\tt{}-f }path/\/{\tt{}ALS\_predict.dml}
- 	{\tt{} -nvargs}
- 	{\tt{} X=}path/file
- 	{\tt{} Y=}path/file
- 	{\tt{} L=}path/file
- 	{\tt{} R=}path/file
- 	{\tt{} Vrows=}int
- 	{\tt{} Vcols=}int
- 	{\tt{} fmt=}format
- 	
- }\smallskip
- 
- 
-  \smallskip  
-  {\hangindent=\parindent\noindent\it%
-  	{\tt{}-f }path/\/{\tt{}ALS\_topk\_predict.dml}
-  	{\tt{} -nvargs}
-  	{\tt{} X=}path/file
-  	{\tt{} Y=}path/file
-  	{\tt{} L=}path/file
-  	{\tt{} R=}path/file
-  	{\tt{} V=}path/file
-  	{\tt{} K=}int
-  	{\tt{} fmt=}format
-  	
-  }\smallskip
- 
-%   \noindent{\bf Arguments --- Prediction}
-%   \begin{Description}
-%   	\item[{\tt X}:]
-%   	Location (on HDFS) to read the input matrix $X$ containing user-ids (first column) and item-ids (second column) 
-%   	\item[{\tt L}:]
-%   	Location (on HDFS) to read the left (user) factor matrix $L$
-%   	\item[{\tt R}:]
-%   	Location (on HDFS) to read the right (item) factor matrix $R$
-%   	\item[{\tt Y}:]
-%   	Location (on HDFS) to write the output matrix $Y$ containing user-ids (first column), item-ids (second column) and predicted ratings (third column)
-%   	\item[{\tt Vrows}:] 
-%   	Number of rows of the user-item matrix $V$
-%   	\item[{\tt Vcols}] 
-%   	Number of columns of the user-item matrix $V$ 
-%   	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-%   	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}
-%   \end{Description}
-   
-
-  \noindent{\bf Arguments --- Prediction/Top-K Prediction}
-  \begin{Description}
-  	\item[{\tt V}:]
-  	Location (on HDFS) to read the user-item matrix $V$ 
-  	\item[{\tt X}:]
-  	Location (on HDFS) to read the input matrix $X$ with following format:
-  	\begin{itemize}
-  		\item for {ALS\_predict.dml}: a 2-column matrix that contains the user-ids (first column) and the item-ids (second column),
-  		\item for {ALS\_topk\_predict.dml}: a 1-column matrix that contains the user-ids.
-  	\end{itemize} 
-  	\item[{\tt Y}:]
-  	Location (on HDFS) to write the output of prediction with the following format:
-  	\begin{itemize}
-  		\item for {ALS\_predict.dml}: a 3-column matrix that contains the user-ids (first column), the item-ids (second column) and the predicted ratings (third column),
-  		\item for {ALS\_topk\_predict.dml}: a ($K+1$)-column matrix that contains the user-ids in the first column and the top-K item-ids in the remaining $K$ columns will be stored at {\tt Y}.
-  		Additionally, a matrix with the same dimensions that contains the corresponding actual top-K ratings will be stored at {\tt Y.ratings}; see below for details. 
-  	\end{itemize}
-%  	Note the following output format in predicting top-K items. 
-%  	For a user with no available ratings in $V$ no 
-%  	top-K items will be provided, i.e., the corresponding row in $Y$ will contains 0s.   
-%  	Moreover, $K'<K$ items with highest predicted ratings will be provided for a user $i$ 
-%  	if the number of missing ratings $K'$ (i.e., those with 0 value in $V$) for $i$ is less than $K$.
-  	\item[{\tt L}:]
-  	Location (on HDFS) to read the left (user) factor matrix $L$
-  	\item[{\tt R}:]
-  	Location (on HDFS) to write the right (item) factor matrix $R$
-   	\item[{\tt Vrows}:] 
-   	Number of rows of $V$ (i.e., number of users)
-   	\item[{\tt Vcols}] 
-   	Number of columns of $V$ (i.e., number of items) 
-  	\item[{\tt K}:] (default:\mbox{ }{\tt 5})
-  	Number of top-K items for top-K prediction
-  	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-  	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}
-  \end{Description}
-  
- \noindent{\bf Details}
- \smallskip
- 
- Given an $m \times n$ input matrix $V$ and a rank parameter $r \ll \min{(m,n)}$, low-rank matrix factorization seeks to find an $m \times r$ matrix $L$ and an $r \times n$ matrix $R$ such that $V \approx LR$, i.e., we aim to approximate $V$ by the low-rank matrix $LR$.
- The quality of the approximation is determined by an application-dependent loss function $\mathcal{L}$. We aim at finding the loss-minimizing factor matrices, i.e., 
- \begin{equation}\label{eq:problem}
- (L^*, R^*) = \textrm{argmin}_{L,R}{\mathcal{L}(V,L,R)}.
- \end{equation} 
- In the context of collaborative filtering in the recommender systems it is often the case that the input matrix $V$ contains several missing entries. Such entries are coded with the 0 value and the loss function is computed only based on the nonzero entries in $V$, i.e.,
- \begin{equation*} %\label{eq:loss}
- \mathcal{L}=\sum_{(i,j)\in\Omega}l(V_{ij},L_{i*},R_{*j}),
- \end{equation*} 
- where $L_{i*}$ and $R_{*j}$, respectively, denote the $i$th row of $L$ and the $j$th column of $R$, $\Omega=\{\omega_1,\dots,\omega_N\}$ denotes the training set containing the observed (nonzero) entries in $V$, and $l$ is some local loss function.  
- %for some training set $\Omega$ that contains the observed (nonzero) entries in $V$ and some local loss function $l$. In the above formula, 
- 
- ALS is an optimization technique that can be used to solve quadratic problems. 
- For matrix completion, the algorithm repeatedly keeps one of the unknown matrices ($L$ or $R$) fixed and optimizes the other one. In particular, ALS alternates between recomputing the rows of $L$ in one step and the columns of $R$ in the subsequent step.  
- Our implementation of the ALS algorithm supports the loss functions summarized in Table~\ref{tab:loss_functions} commonly used for matrix completion~\cite{ZhouWSP08:als}. 
- %
- \begin{table}[t]
- 	\centering
- 	\label{tab:loss_functions}
- 	\begin{tabular}{|ll|} \hline
- 		Loss & Definition \\ \hline
-% 		$\mathcal{L}_\text{Sl}$ & $\sum_{i,j} (V_{ij} - [LR]_{ij})^2$ \\
-% 		$\mathcal{L}_\text{Sl+L2}$ & $\mathcal{L}_\text{Sl} + \lambda \Bigl( \sum_{ik} L_{ik}^2 + \sum_{kj} R_{kj}^2 \Bigr)$ \\
- 		$\mathcal{L}_\text{Nzsl}$ & $\sum_{i,j:V_{ij}\neq 0} (V_{ij} - [LR]_{ij})^2$ \\
- 		$\mathcal{L}_\text{Nzsl+L2}$ & $\mathcal{L}_\text{Nzsl} + \lambda \Bigl( \sum_{ik} L_{ik}^2 + \sum_{kj} R_{kj}^2 \Bigr)$ \\
- 		$\mathcal{L}_\text{Nzsl+wL2}$ & $\mathcal{L}_\text{Nzsl} + \lambda \Bigl(\sum_{ik}N_{i*} L_{ik}^2 + \sum_{kj}N_{*j} R_{kj}^2 \Bigr)$ \\ \hline 
- 	\end{tabular}
- 	\caption{Popular loss functions supported by our ALS implementation; $N_{i*}$ and $N_{*j}$, respectively, denote the number of nonzero entries in row $i$ and column $j$ of $V$.}
- \end{table}
- 
- Note that the matrix completion problem as defined in (\ref{eq:problem}) is a non-convex problem for all loss functions from Table~\ref{tab:loss_functions}. 
- However, when fixing one of the matrices $L$ or $R$, we get a least-squares problem with a globally optimal solution.  
- For example, for the case of $\mathcal{L}_\text{Nzsl+wL2}$ we have the following closed form solutions
-  \begin{align*}
-  L^\top_{n+1,i*} &\leftarrow (R^{(i)}_n {[R^{(i)}_n]}^\top + \lambda N_2 I)^{-1} R_n V^\top_{i*}, \\
-  R_{n+1,*j} &\leftarrow ({[L^{(j)}_{n+1}]}^\top L^{(j)}_{n+1} + \lambda N_1 I)^{-1} L^\top_{n+1} V_{*j}, 
-  \end{align*}
- where $L_{n+1,i*}$ (resp. $R_{n+1,*j}$) denotes the $i$th row of $L_{n+1}$ (resp. $j$th column of $R_{n+1}$), $\lambda$ denotes 
- the regularization parameter, $I$ is the identity matrix of appropriate dimensionality, 
- $V_{i*}$ (resp. $V_{*j}$) denotes the revealed entries in row $i$ (column $j$), 
- $R^{(i)}_n$ (resp. $L^{(j)}_{n+1}$) refers to the corresponding columns of $R_n$ (rows of $L_{n+1}$), 
- and $N_1$ (resp. $N_2$) denotes a diagonal matrix that contains the number of nonzero entries in row $i$ (column $j$) of $V$.   
- 
-% For example, for the case of $\mathcal{L}_\text{Sl-L2}$ we have the following closed form solutions
-% \begin{align*}
-% L^\top_{n+1,i*} &\leftarrow (R_n {[R_n]}^\top + \lambda I)^{-1} R_n V^\top_{i*}, \\
-% R_{n+1,*j} &\leftarrow ({[L_{n+1}]}^\top L_{n+1} + \lambda I)^{-1} L^\top_{n+1} V_{*j}, 
-% \end{align*}
-% where $L_{n+1,i*}$ (resp. $R_{n+1,*j}$) denotes the $i$th row of $L_{n+1}$ (resp. $j$th column of $R_{n+1}$), $\lambda$ denotes 
-% the regularization parameter and $I$ is the identity matrix of appropriate dimensionality. 
-% For the case of $\mathcal{L}_\text{Nzsl}$ we need to remove the equation that correspond to zero entries of $V$ from the least-squares problems. 
-% With wL2 we get the following equations
-% \begin{align*}
-% L^\top_{n+1,i*} &\leftarrow (R^{(i)}_n {[R^{(i)}_n]}^\top + \lambda N_2 I)^{-1} R_n V^\top_{i*}, \\
-% R_{n+1,*j} &\leftarrow ({[L^{(j)}_{n+1}]}^\top L^{(j)}_{n+1} + \lambda N_1 I)^{-1} L^\top_{n+1} V_{*j}, 
-% \end{align*}
-% where $V_{i*}$ (resp. $V_{*j}$) denotes the revealed entries in row $i$ (column $j$), 
-% $R^{(i)}_n$ (resp. $L^{(j)}_{n+1}$) refers to the corresponding columns of $R_n$ (rows of $L_{n+1}$), 
-% and $N_1$ (resp. $N_2$) denotes a diagonal matrix that contains the number of nonzero entries in row $i$ (column $j$) of $V$.
- 
- \textbf{Prediction.} 
- Based on the factor matrices computed by ALS we provide two prediction scripts:   
- \begin{Enumerate}
- 	\item {\tt ALS\_predict.dml} computes the predicted ratings for a given list of users and items;
- 	\item {\tt ALS\_topk\_predict.dml} computes top-K item (where $K$ is given as input) with highest predicted ratings together with their corresponding ratings for a given list of users.
- \end{Enumerate} 
-  
- \smallskip
- \noindent{\bf Returns}
- \smallskip
- 
- We output the factor matrices $L$ and $R$ after the algorithm has converged. The algorithm is declared as converged if one of the two criteria is meet: 
- (1) the decrease in the value of loss function falls below {\tt thr}
- given as an input parameter (if parameter {\tt check=TRUE}), or (2) the maximum number of iterations (defined as parameter {\tt maxi}) is reached. 
- Note that for a given user $i$ prediction is possible only if user $i$ has rated at least one item, i.e., row $i$ in matrix $V$ has at least one nonzero entry. 
- In case, some users have not rated any items the corresponding factor in $L$ will be all 0s.
- Similarly if some items have not been rated at all the corresponding factors in $R$  will contain only 0s. 
- Our prediction scripts output the predicted ratings for a given list of users and items as well as the top-K items with highest predicted ratings together with the predicted ratings for a given list of users. Note that the predictions will only be provided for the users who have rated at least one item, i.e., the corresponding rows contain at least one nonzero entry. 
-% Moreover in the case of top-K prediction, if the number of predicted ratings---i.e., missing entries--- for some user $i$ is less than the input parameter $K$, all the predicted ratings for user $i$ will be provided.
-
- 
-
- 
- 
-  
- \smallskip
- \noindent{\bf Examples}
- \smallskip
-  
-% {\hangindent=\parindent\noindent\tt
-% 	\hml -f ALS.dml -nvargs V=/user/biadmin/V L=/user/biadmin/L R=/user/biadmin/R rank=10 reg="L2" lambda=0.0001 fmt=csv 
-% 		
-% }
-  
- {\hangindent=\parindent\noindent\tt
- 	\hml -f ALS.dml -nvargs V=/user/biadmin/V L=/user/biadmin/L R=/user/biadmin/R rank=10 reg="wL2" lambda=0.0001 maxi=50 check=TRUE thr=0.001 fmt=csv	
- 	
- }
- 
- \noindent To compute predicted ratings for a given list of users and items:
- 
- {\hangindent=\parindent\noindent\tt
-  	\hml -f ALS-predict.dml -nvargs X=/user/biadmin/X Y=/user/biadmin/Y L=/user/biadmin/L R=/user/biadmin/R  Vrows=100000 Vcols=10000 fmt=csv	
-  	
- }
-  
- \noindent To compute top-K items with highest predicted ratings together with the predicted ratings for a given list of users:
- 
- {\hangindent=\parindent\noindent\tt
-   	\hml -f ALS-top-predict.dml -nvargs X=/user/biadmin/X Y=/user/biadmin/Y L=/user/biadmin/L R=/user/biadmin/R V=/user/biadmin/V K=10 fmt=csv	
-   	
- }
-
-
-%
-%\begin{itemize}
-%	\item Y. Zhou, D. K. Wilkinson, R. Schreiber, and R. Pan. \newblock{Large-scale parallel collaborative flitering for the Netflix prize}. In Proceedings of the International
-%	Conference on Algorithmic Aspects in Information and Management (AAIM), 2008, 337-348.
-%\end{itemize}
- 
- 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/BinarySVM.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/BinarySVM.tex b/Algorithms Reference/BinarySVM.tex
deleted file mode 100644
index 7ff5b06..0000000
--- a/Algorithms Reference/BinarySVM.tex	
+++ /dev/null
@@ -1,175 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsubsection{Binary-class Support Vector Machines}
-\label{l2svm}
-
-\noindent{\bf Description}
-
-Support Vector Machines are used to model the relationship between a categorical 
-dependent variable y and one or more explanatory variables denoted X. This 
-implementation learns (and predicts with) a binary class support vector machine 
-(y with domain size 2).
-\\
-
-\noindent{\bf Usage}
-
-\begin{tabbing}
-\texttt{-f} \textit{path}/\texttt{l2-svm.dml -nvargs} 
-\=\texttt{X=}\textit{path}/\textit{file} 
-  \texttt{Y=}\textit{path}/\textit{file} 
-  \texttt{icpt=}\textit{int} 
-  \texttt{tol=}\textit{double}\\
-\>\texttt{reg=}\textit{double} 
-  \texttt{maxiter=}\textit{int} 
-  \texttt{model=}\textit{path}/\textit{file}\\
-\>\texttt{Log=}\textit{path}/\textit{file}
-  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
-\end{tabbing}
-
-\begin{tabbing}
-\texttt{-f} \textit{path}/\texttt{l2-svm-predict.dml -nvargs} 
-\=\texttt{X=}\textit{path}/\textit{file} 
-  \texttt{Y=}\textit{path}/\textit{file} 
-  \texttt{icpt=}\textit{int} 
-  \texttt{model=}\textit{path}/\textit{file}\\
-\>\texttt{scores=}\textit{path}/\textit{file}
-  \texttt{accuracy=}\textit{path}/\textit{file}\\
-\>\texttt{confusion=}\textit{path}/\textit{file}
-  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
-\end{tabbing}
-
-%%\begin{verbatim}
-%%-f path/l2-svm.dml -nvargs X=path/file Y=path/file icpt=int tol=double
-%%                      reg=double maxiter=int model=path/file
-%%\end{verbatim}
-
-\noindent{\bf Arguments}
-
-\begin{itemize}
-\item X: Location (on HDFS) to read the matrix of feature vectors; 
-each row constitutes one feature vector.
-\item Y: Location to read the one-column matrix of (categorical) 
-labels that correspond to feature vectors in X. Binary class labels 
-can be expressed in one of two choices: $\pm 1$ or $1/2$. Note that,
-this argument is optional for prediction.
-\item icpt (default: {\tt 0}): If set to 1 then a constant bias column is 
-added to X. 
-\item tol (default: {\tt 0.001}): Procedure terminates early if the reduction
-in objective function value is less than tolerance times the initial objective
-function value.
-\item reg (default: {\tt 1}): Regularization constant. See details to find 
-out where lambda appears in the objective function. If one were interested 
-in drawing an analogy with the C parameter in C-SVM, then C = 2/lambda. 
-Usually, cross validation is employed to determine the optimum value of 
-lambda.
-\item maxiter (default: {\tt 100}): The maximum number of iterations.
-\item model: Location (on HDFS) that contains the learnt weights.
-\item Log: Location (on HDFS) to collect various metrics (e.g., objective 
-function value etc.) that depict progress across iterations while training.
-\item fmt (default: {\tt text}): Specifies the output format. Choice of 
-comma-separated values (csv) or as a sparse-matrix (text).
-\item scores: Location (on HDFS) to store scores for a held-out test set.
-Note that, this is an optional argument.
-\item accuracy: Location (on HDFS) to store the accuracy computed on a
-held-out test set. Note that, this is an optional argument.
-\item confusion: Location (on HDFS) to store the confusion matrix
-computed using a held-out test set. Note that, this is an optional 
-argument.
-\end{itemize}
-
-\noindent{\bf Details}
-
-Support vector machines learn a classification function by solving the
-following optimization problem ($L_2$-SVM):
-\begin{eqnarray*}
-&\textrm{argmin}_w& \frac{\lambda}{2} ||w||_2^2 + \sum_i \xi_i^2\\
-&\textrm{subject to:}& y_i w^{\top} x_i \geq 1 - \xi_i ~ \forall i
-\end{eqnarray*}
-where $x_i$ is an example from the training set with its label given by $y_i$, 
-$w$ is the vector of parameters and $\lambda$ is the regularization constant 
-specified by the user.
-
-To account for the missing bias term, one may augment the data with a column
-of constants which is achieved by setting intercept argument to 1 (C-J Hsieh 
-et al, 2008).
-
-This implementation optimizes the primal directly (Chapelle, 2007). It uses 
-nonlinear conjugate gradient descent to minimize the objective function 
-coupled with choosing step-sizes by performing one-dimensional Newton 
-minimization in the direction of the gradient.
-\\
-
-\noindent{\bf Returns}
-
-The learnt weights produced by l2-svm.dml are populated into a single column matrix 
-and written to file on HDFS (see model in section Arguments). The number of rows in 
-this matrix is ncol(X) if intercept was set to 0 during invocation and ncol(X) + 1 
-otherwise. The bias term, if used, is placed in the last row. Depending on what arguments
-are provided during invocation, l2-svm-predict.dml may compute one or more of scores, 
-accuracy and confusion matrix in the output format specified. 
-\\
-
-%%\noindent{\bf See Also}
-%%
-%%In case of multi-class classification problems (y with domain size greater than 2), 
-%%please consider using a multi-class classifier learning algorithm, e.g., multi-class
-%%support vector machines (see Section \ref{msvm}). To model the relationship between 
-%%a scalar dependent variable y and one or more explanatory variables X, consider 
-%%Linear Regression instead (see Section \ref{linreg-solver} or Section 
-%%\ref{linreg-iterative}).
-%%\\
-%%
-\noindent{\bf Examples}
-
-\begin{verbatim}
-hadoop jar SystemML.jar -f l2-svm.dml -nvargs X=/user/biadmin/X.mtx 
-                                              Y=/user/biadmin/y.mtx 
-                                              icpt=0 tol=0.001 fmt=csv
-                                              reg=1.0 maxiter=100 
-                                              model=/user/biadmin/weights.csv
-                                              Log=/user/biadmin/Log.csv
-\end{verbatim}
-
-\begin{verbatim}
-hadoop jar SystemML.jar -f l2-svm-predict.dml -nvargs X=/user/biadmin/X.mtx 
-                                                      Y=/user/biadmin/y.mtx 
-                                                      icpt=0 fmt=csv
-                                                      model=/user/biadmin/weights.csv
-                                                      scores=/user/biadmin/scores.csv
-                                                      accuracy=/user/biadmin/accuracy.csv
-                                                      confusion=/user/biadmin/confusion.csv
-\end{verbatim}
-
-\noindent{\bf References}
-
-\begin{itemize}
-\item W. T. Vetterling and B. P. Flannery. \newblock{\em Conjugate Gradient Methods in Multidimensions in 
-Numerical Recipes in C - The Art in Scientific Computing}. \newblock W. H. Press and S. A. Teukolsky
-(eds.), Cambridge University Press, 1992.
-\item J. Nocedal and  S. J. Wright. Numerical Optimization, Springer-Verlag, 1999.
-\item C-J Hsieh, K-W Chang, C-J Lin, S. S. Keerthi and S. Sundararajan. \newblock{\em A Dual Coordinate 
-Descent Method for Large-scale Linear SVM.} \newblock International Conference of Machine Learning
-(ICML), 2008.
-\item Olivier Chapelle. \newblock{\em Training a Support Vector Machine in the Primal}. \newblock Neural 
-Computation, 2007.
-\end{itemize}
-

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/Cox.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/Cox.tex b/Algorithms Reference/Cox.tex
deleted file mode 100644
index a355df7..0000000
--- a/Algorithms Reference/Cox.tex	
+++ /dev/null
@@ -1,340 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Cox Proportional Hazard Regression Model}
-
-\noindent{\bf Description}
-\smallskip
-
-
-The Cox (proportional hazard or PH) is a semi-parametric statistical approach commonly used for analyzing survival data.
-Unlike non-parametric approaches, e.g., the Kaplan-Meier estimates (Section \ref{sec:kaplan-meier}), which can be used to analyze single sample of survival data or to compare between groups of survival times, the Cox PH models the dependency of the survival times on the values of {\it explanatory variables} (i.e., covariates) recorded for each individual at the time origin. Our focus is on covariates that do not change value over time, i.e., time-independent covariates, and that may be categorical (ordinal or nominal) as well as continuous-valued. \\  
-
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}Cox.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} TE=}path/file
-{\tt{} F=}path/file
-{\tt{} R=}path/file
-{\tt{} M=}path/file
-{\tt{} S=}path/file
-{\tt{} T=}path/file
-{\tt{} COV=}path/file
-{\tt{} RT=}path/file
-{\tt{} XO=}path/file
-{\tt{} MF=}path/file
-{\tt{} alpha=}double
-{\tt{} fmt=}format
-
-}
-
-\smallskip
-\noindent{\bf Arguments --- Model Fitting/Prediction}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the input matrix of the survival data containing: 
-\begin{Itemize}
-	\item timestamps,
-	\item whether event occurred (1) or data is censored (0),
-	\item feature vectors
-\end{Itemize}
-\item[{\tt Y}:]
-Location (on HDFS) to the read matrix used for prediction 
-\item[{\tt TE}:]
-Location (on HDFS) to read the 1-column matrix $TE$ that contains the column indices of the input matrix $X$ corresponding to timestamps (first entry) and event information (second entry)
-\item[{\tt F}:]
-Location (on HDFS) to read the 1-column matrix $F$ that contains the column indices of the input matrix $X$ corresponding to the features to be used for fitting the Cox model
-\item[{\tt R}:] (default:\mbox{ }{\tt " "})
-If factors (i.e., categorical features) are available in the input matrix $X$, location (on HDFS) to read matrix $R$ containing the start (first column) and end (second column) indices of each factor in $X$;
-alternatively, user can specify the indices of the baseline level of each factor which needs to be removed from $X$. If $R$ is not provided by default all variables are considered to be continuous-valued.
-\item[{\tt M}:]							
-Location (on HDFS) to store the results of Cox regression analysis including regression coefficients $\beta_j$s, their standard errors, confidence intervals, and $P$-values  
-\item[{\tt S}:] (default:\mbox{ }{\tt " "})
-Location (on HDFS) to store a summary of some statistics of the fitted model including number of records, number of events, log-likelihood, AIC, Rsquare (Cox \& Snell), and maximum possible Rsquare 
-\item[{\tt T}:] (default:\mbox{ }{\tt " "})
-Location (on HDFS) to store the results of Likelihood ratio test, Wald test, and Score (log-rank) test of the fitted model
-\item[{\tt COV}:]
-Location (on HDFS) to store the variance-covariance matrix of $\beta_j$s; note that parameter {\tt COV} needs to provided as input to prediction.
-\item[{\tt RT}:]
-Location (on HDFS) to store matrix $RT$ containing the order-preserving recoded timestamps from $X$; note that parameter {\tt RT} needs to provided as input for prediction.
-\item[{\tt XO}:]
-Location (on HDFS) to store the input matrix $X$ ordered by the timestamps; note that parameter {\tt XO} needs to provided as input for prediction.
-\item[{\tt MF}:]
-Location (on HDFS) to store column indices of $X$ excluding the baseline factors if available; note that parameter {\tt MF} needs to provided as input for prediction.
-\item[{\tt P}] 
-Location (on HDFS) to store matrix $P$ containing the results of prediction
-\item[{\tt alpha}](default:\mbox{ }{\tt 0.05})
-Parameter to compute a $100(1-\alpha)\%$ confidence interval for $\beta_j$s 
-\item[{\tt tol}](default:\mbox{ }{\tt 0.000001})
-Tolerance (epsilon) used in the convergence criterion
-\item[{\tt moi}:] (default:\mbox{ }{\tt 100})
-Maximum number of outer (Fisher scoring) iterations
-\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
-Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
-limit provided
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
- \smallskip
- \noindent{\bf Usage: Cox Prediction}
- \smallskip
- 
- {\hangindent=\parindent\noindent\it%
- 	{\tt{}-f }path/\/{\tt{}Cox-predict.dml}
- 	{\tt{} -nvargs}
- 	{\tt{} X=}path/file
- 	{\tt{} RT=}path/file
- 	{\tt{} M=}path/file
- 	{\tt{} Y=}path/file
- 	{\tt{} COV=}path/file
- 	{\tt{} MF=}path/file
- 	{\tt{} P=}path/file
- 	{\tt{} fmt=}format
- 	
- }\smallskip
- 
-% \noindent{\bf Arguments --- Prediction}
-% \begin{Description}
-% 	\item[{\tt X}:]
-%	Location (on HDFS) to read the input matrix of the survival data sorted by the timestamps including: 
-%	\begin{Itemize}
-%		\item timestamps,
-%		\item whether event occurred (1) or data is censored (0),
-%		\item feature vectors
-%	\end{Itemize}
-% 	\item[{\tt RT}:]
-% 	Location to read column matrix $RT$ containing the (order preserving) recoded timestamps from X (output by {\tt Cox.dml})
-% 	\item[{\tt M}:]
-% 	Location to read matrix $M$ containing the fitted Cox model (see below for the schema) 
-% 	\item[{\tt Y}:]
-%	Location to the read matrix used for prediction    
-% 	\item[{\tt COV}:] 
-% 	Location to read the variance-covariance matrix of the regression coefficients (output by {\tt Cox.dml})
-% 	\item[{\tt MF}] 
-% 	Location to store column indices of $X$ excluding the baseline factors if available (output by {\tt Cox.dml})
-% 	\item[{\tt P}] 
-% 	Location to store matrix $P$ containing the results of prediction
-% 	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-% 	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv}
-% \end{Description}
- 
-
-
-\noindent{\bf Details}
-\smallskip
-
- 
-In Cox PH regression model the relationship between the hazard function---i.e., the probability of event occurrence at a given time---and the covariates is described as
-\begin{equation}
-h_i(t)=h_0(t)\exp\Bigl\{ \sum_{j=1}^{p} \beta_jx_{ij} \Bigr\}, \label{eq:coxph}
-\end{equation} 
-where the hazard function for the $i$th individual ($i\in\{1,2,\ldots,n\}$) depends on a set of $p$ covariates $x_i=(x_{i1},x_{i2},\ldots,x_{ip})$, whose importance is measured by the magnitude of the corresponding coefficients 
-$\beta=(\beta_1,\beta_2,\ldots,\beta_p)$. The term $h_0(t)$ is the baseline hazard and is related to a hazard value if all covariates equal 0. 
-In the Cox PH model the hazard function for the individuals may vary over time, however the baseline hazard is estimated non-parametrically and can take any form.
-Note that re-writing~(\ref{eq:coxph}) we have 
-\begin{equation*}
-\log\biggl\{ \frac{h_i(t)}{h_0(t)} \biggr\} = \sum_{j=1}^{p} \beta_jx_{ij}.
-\end{equation*}
-Thus, the Cox PH model is essentially a linear model for the logarithm of the hazard ratio and the hazard of event for any individual is a constant multiple of the hazard of any other. 
-%Consequently, the Cox model is a proportional hazard model.
-We follow similar notation and methodology as in~\cite[Sec.~3]{collett2003:kaplanmeier}.
-For completeness we briefly discuss the equations used in our implementation.
-
-
-\textbf{Factors in the model.} 
-Note that if some of the feature variables are factors they need to {\it dummy code} as follows. 
-Let $\alpha$ be such a variable (i.e., a factor) with $a$ levels. 
-We introduce $a-1$ indicator (or dummy coded) variables $X_2,X_3\ldots,X_a$ with $X_j=1$ if $\alpha=j$ and 0 otherwise, for $j\in\{ 2,3,\ldots,a\}$.
-In particular, one of $a$ levels of $\alpha$ will be considered as the baseline and is not included in the model.
-In our implementation, user can specify a baseline level for each of the factor (as selecting the baseline level for each factor is arbitrary). 
-On the other hand, if for a given factor $\alpha$ no baseline is specified by the user, the most frequent level of $\alpha$ will be considered as the baseline.   
-
-
-\textbf{Fitting the model.}
-We estimate the coefficients of the Cox model via negative log-likelihood method.
-In particular the Cox PH model is fitted by using trust region Newton method with conjugate gradient~\cite{Nocedal2006:Optimization}.
-%The likelihood for the PH hazard model is given by
-%\begin{equation*}
-%\prod_{i=1}^{n} {\Bigg\{ \frac{\exp(\vec{\beta}^\top\vec{x_i})}{\sum_{l\in %R(t_i)\exp(\vec{\beta}\vec{x}_l)}} \Biggr\}}^\delta_i,
-%\end{equation*}
-%where $\delta_i$ is an event indicator, which is 0 if the $i$th survival time is censored or 1 otherwise, and $R(t_i)$ is the risk set defined as the set of individuals who die at time $t_i$ or later.
-Define the risk set $R(t_j)$ at time $t_j$ to be the set of individuals who die at time $t_i$ or later. 
-The PH model assumes that survival times are distinct. In order to handle tied observations
-we use the \emph{Breslow} approximation of the likelihood function
-\begin{equation*}
-\mathcal{L}=\prod_{j=1}^{r} \frac{\exp(\beta^\top s_j)}{{\bigg\{ \sum_{l\in R(t_j)} \exp(\beta^\top x_l) \biggr\}}^{d_j}},
-\end{equation*}
-where $d_j$ is number individuals who die at time $t_j$ and $s_j$ denotes the element-wise sum of the covariates for those individuals who die at time $t_j$, $j=1,2,\ldots,r$, i.e.,
-the $h$th element of $s_j$ is given by $s_{hj}=\sum_{k=1}^{d_j}x_{hjk}$, where $x_{hjk}$ is the value of $h$th variable ($h\in \{1,2,\ldots,p\}$) for the $k$th of the $d_j$ individuals ($k\in\{ 1,2,\ldots,d_j \}$) who die at the $j$th death time ($j\in\{ 1,2,\ldots,r \}$).  
-
-\textbf{Standard error and confidence interval for coefficients.}
-Note that the variance-covariance matrix of the estimated coefficients $\hat{\beta}$ can be approximated by the inverse of the Hessian evaluated at $\hat{\beta}$. The square root of the diagonal elements of this matrix are the standard errors of estimated coefficients.  
-Once the standard errors of the coefficients $se(\hat{\beta})$ is obtained we can compute a $100(1-\alpha)\%$ confidence interval using $\hat{\beta}\pm z_{\alpha/2}se(\hat{\beta})$, where $z_{\alpha/2}$ is the upper $\alpha/2$-point of the standard normal distribution.
-In {\tt Cox.dml}, we utilize the build-in function {\tt inv()} to compute the inverse of the Hessian. Note that this build-in function can be used only if the Hessian fits in the main memory of a single machine.   
-
-
-\textbf{Wald test, likelihood ratio test, and log-rank test.}
-In order to test the {\it null hypothesis} that all of the coefficients $\beta_j$s are 0, our implementation provides three statistical test: {\it Wald test}, {\it likelihood ratio test}, the {\it log-rank test} (also known as the {\it score test}). 
-Let $p$ be the number of coefficients.
-The Wald test is based on the test statistic ${\hat{\beta}}^2/{se(\hat{\beta})}^2$, which is compared to percentage points of the Chi-squared distribution to obtain the $P$-value.
-The likelihood ratio test relies on the test statistic $-2\log\{ {L}(\textbf{0})/{L}(\hat{\beta}) \}$ ($\textbf{0}$ denotes a zero vector of size $p$ ) which has an approximate Chi-squared distribution with $p$ degrees of freedom under the null hypothesis that all $\beta_j$s are 0.
-The Log-rank test is based on the test statistic 
-$l=\nabla^\top L(\textbf{0}) {\mathcal{H}}^{-1}(\textbf{0}) \nabla L(\textbf{0})$, 
-where $\nabla L(\textbf{0})$ is the gradient of $L$ and $\mathcal{H}(\textbf{0})$ is the Hessian of $L$ evaluated at \textbf{0}. Under the null hypothesis that $\beta=\textbf{0}$, $l$ has a Chi-squared distribution on $p$ degrees of freedom.
-
-
-% Scoring
-\textbf{Prediction.}
-Once the parameters of the model are fitted, we compute the following predictions together with their standard errors
-\begin{itemize}
-	\item linear predictors,
-	\item risk, and
-	\item estimated cumulative hazard. 
-\end{itemize}
-Given feature vector $X_i$ for individual $i$, we obtain the above predictions at time $t$ as follows.
-The linear predictors (denoted as $\mathcal{LP}$) as well as the risk (denoted as $\mathcal{R}$) are computed relative to a baseline whose feature values are the mean of the values in the corresponding features.
-Let $X_i^\text{rel} = X_i - \mu$, where $\mu$ is a row vector that contains the mean values for each feature.  
-We have  $\mathcal{LP}=X_i^\text{rel} \hat{\beta}$ and $\mathcal{R}=\exp\{ X_i^\text{rel}\hat{\beta} \}$.
-The standard errors of the linear predictors $se\{\mathcal{LP} \}$ are computed as the square root of ${(X_i^\text{rel})}^\top V(\hat{\beta}) X_i^\text{rel}$ and the standard error of the risk $se\{ \mathcal{R} \}$ are given by the square root of 
-${(X_i^\text{rel} \odot \mathcal{R})}^\top V(\hat{\beta}) (X_i^\text{rel} \odot \mathcal{R})$, where $V(\hat{\beta})$ is the variance-covariance matrix of the coefficients and $\odot$ is the element-wise multiplication.     
-
-We estimate the cumulative hazard function for individual $i$ by
-\begin{equation*}
-\hat{H}_i(t) = \exp(\hat{\beta}^\top X_i) \hat{H}_0(t), 
-\end{equation*}
-where $\hat{H}_0(t)$ is the \emph{Breslow estimate} of the cumulative baseline hazard given by
-\begin{equation*}
-\hat{H}_0(t) = \sum_{j=1}^{k} \frac{d_j}{\sum_{l\in R(t_{(j)})} \exp(\hat{\beta}^\top X_l)}.
-\end{equation*}
-In the equation above, as before, $d_j$ is the number of deaths, and $R(t_{(j)})$ is the risk set at time $t_{(j)}$, for $t_{(k)} \leq t \leq t_{(k+1)}$, $k=1,2,\ldots,r-1$.
-The standard error of $\hat{H}_i(t)$ is obtained using the estimation
-\begin{equation*}
-se\{ \hat{H}_i(t) \} = \sum_{j=1}^{k} \frac{d_j}{ {\left[ \sum_{l\in R(t_{(j)})} \exp(X_l\hat{\beta}) \right]}^2 } + J_i^\top(t) V(\hat{\beta}) J_i(t),
-\end{equation*}
-where 
-\begin{equation*}
-J_i(t) = \sum_{j-1}^{k} d_j \frac{\sum_{l\in R(t_{(j)})} (X_l-X_i)\exp \{ (X_l-X_i)\hat{\beta} \}}{ {\left[ \sum_{l\in R(t_{(j)})} \exp\{(X_l-X_i)\hat{\beta}\} \right]}^2  },
-\end{equation*}
-for $t_{(k)} \leq t \leq t_{(k+1)}$, $k=1,2,\ldots,r-1$. 
-
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-  
-Blow we list the results of fitting a Cox regression model stored in matrix {\tt M} with the following schema:
-\begin{itemize}
-	\item Column 1: estimated regression coefficients $\hat{\beta}$
-	\item Column 2: $\exp(\hat{\beta})$
-	\item Column 3: standard error of the estimated coefficients $se\{\hat{\beta}\}$
-	\item Column 4: ratio of $\hat{\beta}$ to $se\{\hat{\beta}\}$ denoted by $Z$  
-	\item Column 5: $P$-value of $Z$ 
-	\item Column 6: lower bound of $100(1-\alpha)\%$ confidence interval for $\hat{\beta}$
-	\item Column 7: upper bound of $100(1-\alpha)\%$ confidence interval for $\hat{\beta}$.
-\end{itemize}
-Note that above $Z$ is the Wald test statistic which is asymptotically standard normal under the hypothesis that $\beta=\textbf{0}$.
-
-Moreover, {\tt Cox.dml} outputs two log files {\tt S} and {\tt T} containing a summary statistics of the fitted model as follows.
-File {\tt S} stores the following information 
-\begin{itemize}
-	\item Line 1: total number of observations
-	\item Line 2: total number of events
-	\item Line 3: log-likelihood (of the fitted model)
-	\item Line 4: AIC
-	\item Line 5: Cox \& Snell Rsquare
-	\item Line 6: maximum possible Rsquare. 
-\end{itemize}
-Above, the AIC is computed as in (\ref{eq:AIC}),
-the Cox \& Snell Rsquare is equal to $1-\exp\{ -l/n \}$, where $l$ is the log-rank test statistic as discussed above and $n$ is total number of observations,
-and the maximum possible Rsquare computed as $1-\exp\{ -2 L(\textbf{0})/n \}$ , where $L(\textbf{0})$ denotes the initial likelihood. 
-
-
-File {\tt T} contains the following information
-\begin{itemize}
-	\item Line 1: Likelihood ratio test statistic, degree of freedom of the corresponding Chi-squared distribution, $P$-value
-	\item Line 2: Wald test statistic, degree of freedom of the corresponding Chi-squared distribution, $P$-value
-	\item Line 3: Score (log-rank) test statistic, degree of freedom of the corresponding Chi-squared distribution, $P$-value.
-\end{itemize}
-
-Additionally, the following matrices will be stored. Note that these matrices are required for prediction.
-\begin{itemize}
-	 \item Order-preserving recoded timestamps $RT$, i.e., contiguously numbered from 1 $\ldots$ \#timestamps
-	 \item Feature matrix ordered by the timestamps $XO$
-	 \item Variance-covariance matrix of the coefficients $COV$
-	 \item Column indices of the feature matrix with baseline factors removed (if available) $MF$.  
-\end{itemize}
-
-
-\textbf{Prediction}
-Finally, the results of prediction is stored in Matrix $P$ with the following schema
-\begin{itemize}
-	\item Column 1: linear predictors
-	\item Column 2: standard error of the linear predictors
-	\item Column 3: risk
-	\item Column 4: standard error of the risk
-	\item Column 5: estimated cumulative hazard
-	\item Column 6: standard error of the estimated cumulative hazard.
-\end{itemize}
-
-
-
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f Cox.dml -nvargs X=/user/biadmin/X.mtx TE=/user/biadmin/TE
-	F=/user/biadmin/F R=/user/biadmin/R M=/user/biadmin/model.csv
-	T=/user/biadmin/test.csv COV=/user/biadmin/var-covar.csv XO=/user/biadmin/X-sorted.mtx fmt=csv
-	
-}\smallskip
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f Cox.dml -nvargs X=/user/biadmin/X.mtx TE=/user/biadmin/TE
-	F=/user/biadmin/F R=/user/biadmin/R M=/user/biadmin/model.csv
-	T=/user/biadmin/test.csv COV=/user/biadmin/var-covar.csv 
-	RT=/user/biadmin/recoded-timestamps.csv XO=/user/biadmin/X-sorted.csv 
-	MF=/user/biadmin/baseline.csv alpha=0.01 tol=0.000001 moi=100 mii=20 fmt=csv
-	
-}\smallskip
-
-\noindent To compute predictions:
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f Cox-predict.dml -nvargs X=/user/biadmin/X-sorted.mtx 
-	RT=/user/biadmin/recoded-timestamps.csv
-	M=/user/biadmin/model.csv Y=/user/biadmin/Y.mtx COV=/user/biadmin/var-covar.csv 
-	MF=/user/biadmin/baseline.csv P=/user/biadmin/predictions.csv fmt=csv
-	
-}
-
-

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/DecisionTrees.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/DecisionTrees.tex b/Algorithms Reference/DecisionTrees.tex
deleted file mode 100644
index cea26a4..0000000
--- a/Algorithms Reference/DecisionTrees.tex	
+++ /dev/null
@@ -1,312 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Decision Trees}
-\label{sec:decision_trees}
-
-\noindent{\bf Description}
-\smallskip
-
-
-Decision tree (for classification) is a classifier that is considered
-more interpretable than other statistical classifiers. This implementation
-is well-suited to handle large-scale data and builds a (binary) decision 
-tree in parallel.\\
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-	{\tt{}-f }path/\/{\tt{}decision-tree.dml}
-	{\tt{} -nvargs}
-	{\tt{} X=}path/file
-	{\tt{} Y=}path/file
-	{\tt{} R=}path/file
-	{\tt{} bins=}integer
-	{\tt{} depth=}integer
-	{\tt{} num\_leaf=}integer
-	{\tt{} num\_samples=}integer
-	{\tt{} impurity=}Gini$\mid$entropy
-	{\tt{} M=}path/file
-	{\tt{} O=}path/file
-	{\tt{} S\_map=}path/file
-	{\tt{} C\_map=}path/file
-	{\tt{} fmt=}format
-	
-}
-
- \smallskip
- \noindent{\bf Usage: Prediction}
- \smallskip
- 
- {\hangindent=\parindent\noindent\it%
- 	{\tt{}-f }path/\/{\tt{}decision-tree-predict.dml}
- 	{\tt{} -nvargs}
- 	{\tt{} X=}path/file
- 	{\tt{} Y=}path/file
- 	{\tt{} R=}path/file
- 	{\tt{} M=}path/file
- 	{\tt{} P=}path/file
- 	{\tt{} A=}path/file
- 	{\tt{} CM=}path/file
- 	{\tt{} fmt=}format
- 	
- }\smallskip
- 
- 
-\noindent{\bf Arguments}
-\begin{Description}
-	\item[{\tt X}:]
-	Location (on HDFS) to read the matrix of feature vectors; 
-	each row constitutes one feature vector. Note that categorical features in $X$ need to be both recoded and dummy coded.
-	\item[{\tt Y}:]
-	Location (on HDFS) to read the matrix of (categorical) 
-	labels that correspond to feature vectors in $X$. Note that class labels are assumed to be both recoded and dummy coded. 
-	This argument is optional for prediction. 
-	\item[{\tt R}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to read matrix $R$ which for each feature in $X$ contains column-ids (first column), start indices (second column), and end indices (third column).
-	If $R$ is not provided by default all features are assumed to be continuous-valued.   
-	\item[{\tt bins}:] (default:\mbox{ }{\tt 20})
-	Number of thresholds to choose for each continuous-valued feature (determined by equi-height binning). 
-	\item[{\tt depth}:] (default:\mbox{ }{\tt 25})
-	Maximum depth of the learned tree
-	\item[{\tt num\_leaf}:] (default:\mbox{ }{\tt 10})
-	Parameter that controls pruning. The tree
-	is not expanded if a node receives less than {\tt num\_leaf} training examples.
-	\item[{\tt num\_samples}:] (default:\mbox{ }{\tt 3000})
-	Parameter that decides when to switch to in-memory building of subtrees. If a node $v$ receives less than {\tt num\_samples}
-	training examples then this implementation switches to an in-memory subtree
-	building procedure to build the subtree under $v$ in its entirety.
-	\item[{\tt impurity}:] (default:\mbox{ }{\tt "Gini"})
-	Impurity measure used at internal nodes of the tree for selecting which features to split on. Possible value are entropy or Gini.
-	\item[{\tt M}:] 
-	Location (on HDFS) to write matrix $M$ containing the learned decision tree (see below for the schema) 
-	\item[{\tt O}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to store the training accuracy (\%). Note that this argument is optional.
-	\item[{\tt A}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to store the testing accuracy (\%) from a 
-	held-out test set during prediction. Note that this argument is optional.
-	\item[{\tt P}:] 
-	Location (on HDFS) to store predictions for a held-out test set
-	\item[{\tt CM}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to store the confusion matrix computed using a held-out test set. Note that this argument is optional.
-	\item[{\tt S\_map}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to write the mappings from the continuous-valued feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
-	\item[{\tt C\_map}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to write the mappings from the categorical feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
-	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-	see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
- \noindent{\bf Details}
- \smallskip
-
- 
-Decision trees~\cite{BreimanFOS84:dtree} are simple models of
-classification that,  due to their structure,  are easy to
-interpret. Given an example feature vector, each node in the learned
-tree runs a simple test on it. Based on the result of the test, the
-example is either diverted to the left subtree or to the right
-subtree. Once the example reaches a leaf, then the label stored at the
-leaf is returned as the prediction for the example.
-
-
-Building a decision tree from a fully labeled training set entails
-choosing appropriate splitting tests for each internal node in the tree and this is usually performed in a top-down manner. 
-The splitting test (denoted by $s$) requires
-first choosing a feature $j$ and depending on the type of $j$, either
-a threshold $\sigma$, in case $j$ is continuous-valued, or a subset of
-values $S \subseteq \text{Dom}(j)$ where $\text{Dom}(j)$ denotes
-domain of $j$, in case it is categorical. For continuous-valued
-features the test is thus of form $x_j < \sigma$ and for categorical
-features it is of form $x_j \in S$, where $x_j$ denotes the $j$th
-feature value of feature vector $x$. One way to determine which test
-to include, is to compare impurities of the tree nodes induced by the test.
-The {\it node impurity} measures the homogeneity of the labels at the node. This implementation supports two commonly used impurity measures (denoted by $\mathcal{I}$): {\it Entropy} $\mathcal{E}=\sum_{i=1}^{C}-f_i \log f_i$, as well as {\it Gini impurity} $\mathcal{G}=\sum_{i=1}^{C}f_i (1-f_i)$, where $C$ denotes the number of unique labels and $f_i$ is the frequency of label $i$.
-Once the impurity at the tree nodes has been obtained, the {\it best split} is chosen from a set of possible splits that maximizes the {\it information gain} at the node, i.e., $\arg\max_{s}\mathcal{IG}(X,s)$, where $\mathcal{IG}(X,s)$ denotes the information gain when the splitting test $s$ partitions the feature matrix $X$. 
-Assuming that $s$ partitions $X$ that contains $N$ feature vectors into $X_\text{left}$ and $X_\text{right}$ each including $N_\text{left}$ and $N_\text{right}$ feature vectors, respectively, $\mathcal{IG}(X,s)$ is given by 
-\begin{equation*}
-\mathcal{IG}(X,s)=\mathcal{I}(X)-\frac{N_\text{left}}{N}\mathcal{I}(X_\text{left})-\frac{N_\text{right}}{N}\mathcal{I}(X_\text{right}),
-\end{equation*}
-where $\mathcal{I}\in\{\mathcal{E},\mathcal{G}\}$.
-In the following we discuss the implementation details specific to {\tt decision-tree.dml}. 
-
-
-\textbf{Input format.} 
-In general implementations of the decision tree algorithm do not require categorical features to be dummy coded. For improved efficiency and reducing the training time, our implementation however assumes dummy coded categorical features and dummy coded class labels.  
-
-
-\textbf{Tree construction.}
-Learning a decision tree on large-scale data has received some
-attention in the literature. The current implementation includes logic
-for choosing tests for multiple nodes that belong to the same level in
-the decision tree in parallel (breadth-first expansion) and for
-building entire subtrees under multiple nodes in parallel (depth-first
-subtree building). Empirically it has been demonstrated that it is
-advantageous to perform breadth-first expansion for the nodes
-belonging to the top levels of the tree and to perform depth-first
-subtree building for nodes belonging to the lower levels of the tree~\cite{PandaHBB09:dtree}. The parameter {\tt num\_samples} controls when we
-switch to  depth-first subtree building. Any node in the decision tree
-that receives $\leq$ {\tt num\_samples} training examples, the subtree
-under it is built in its entirety in one shot.
-
-
-\textbf{Stopping rule and pruning.} 
-The splitting of data at the internal nodes stops when at least one the following criteria is satisfied:
-\begin{itemize}
-	\item the depth of the internal node reaches the input parameter {\tt depth} controlling the maximum depth of the learned tree, or
-	\item no candidate split achieves information gain.
-\end{itemize}
-This implementation also allows for some automated pruning via the argument {\tt num\_leaf}. If
-a node receives $\leq$ {\tt num\_leaf} training examples, then a leaf
-is built in its place.
-
-
-\textbf{Continuous-valued features.}
-For a continuous-valued feature
-$j$ the number of candidate thresholds $\sigma$ to choose from is of
-the order of the number of examples present in the training set. Since
-for large-scale data this can result in a large number of candidate
-thresholds, the user can limit this number via the arguments {\tt bins} which controls the number of candidate thresholds considered
-for each continuous-valued feature. For each continuous-valued
-feature, the implementation computes an equi-height histogram to
-generate one candidate threshold per equi-height bin.
-
-
-\textbf{Categorical features.}
-In order to determine the best value subset to split on in the case of categorical features, this implementation greedily includes values from the feature's domain until the information gain stops improving.
-In particular, for a categorical feature $j$ the $|Dom(j)|$ feature values are sorted by impurity and the resulting split candidates $|Dom(j)|-1$ are examined; the sequence of feature values which results in the maximum information gain is then selected.
-
-
-\textbf{Description of the model.} 
-The learned decision tree is represented in a matrix $M$ that
-contains at least 6 rows. Each column in the matrix contains the parameters relevant to a single node in the tree. 
-Note that for building the tree model, our implementation splits the feature matrix $X$ into $X_\text{cont}$ containing continuous-valued features and $X_\text{cat}$ containing categorical features. In the following, the continuous-valued (resp. categorical) feature-ids correspond to the indices of the features in $X_\text{cont}$ (resp. $X_\text{cat}$). 
-Moreover, we refer to an internal node as a continuous-valued (categorical) node if the feature that this nodes looks at is continuous-valued (categorical).
-Below is a description of what each row in the matrix contains.
-\begin{itemize}
-\item Row 1: stores the node-ids. These ids correspond to the node-ids in a complete binary tree.
-\item Row 2: for internal nodes stores the offsets (the number of columns) in $M$ to the left child, and otherwise 0.
-\item Row 3: stores the feature index of the feature (id of a continuous-valued feature in $X_\text{cont}$ if the feature is continuous-valued or id of a categorical feature in $X_\text{cat}$ if the feature is categorical) that this node looks at if the node is an internal node, otherwise 0. 
-\item Row 4: store the type of the feature that this node looks at if the node is an internal node: 1 for continuous-valued and 2 for categorical features, 
-otherwise the label this leaf node is supposed to predict.
-\item Row 5: for the internal nodes contains 1 if the feature chosen for the node is continuous-valued, or the size of the subset of values used for splitting at the node stored in rows 6,7,$\ldots$ if the feature chosen for the node is categorical. For the leaf nodes, Row 5 contains the number of misclassified training examples reaching at this node. 
-\item Row 6,7,$\ldots$: for the internal nodes, row 6 stores the threshold to which the example's feature value is compared if the feature chosen for this node is continuous-valued, otherwise if the feature chosen for this node is categorical rows 6,7,$\ldots$ store the value subset chosen for the node.
-For the leaf nodes, row 6 contains 1 if the node is impure and the number of training examples at the node is greater than {\tt num\_leaf}, otherwise 0. 	
-\end{itemize}
-As an example, Figure~\ref{dtree} shows a decision tree with $5$ nodes and its matrix
-representation.
-
-\begin{figure}
-\begin{minipage}{0.3\linewidth}
-\begin{center}
-\begin{tikzpicture}
-\node (labelleft) [draw,shape=circle,minimum size=16pt] at (2,0) {$2$};
-\node (labelright) [draw,shape=circle,minimum size=16pt] at (3.25,0) {$1$};
-
-\node (rootleft) [draw,shape=rectangle,minimum size=16pt] at (2.5,1) {$x_5 \in \{2,3\}$};
-\node (rootlabel) [draw,shape=circle,minimum size=16pt] at (0.9,1) {$1$};
-\node (root) [draw,shape=rectangle,minimum size=16pt] at (1.75,2) {$x_3 < 0.45$};
-
-\draw[-latex] (root) -- (rootleft);
-\draw[-latex] (root) -- (rootlabel);
-\draw[-latex] (rootleft) -- (labelleft);
-\draw[-latex] (rootleft) -- (labelright);
-
-\end{tikzpicture}
-\end{center}
-\begin{center}
-(a)
-\end{center}
-\end{minipage}
-\hfill
-\begin{minipage}{0.65\linewidth}
-\begin{center}
-\begin{tabular}{c|c|c|c|c|c|}
-& Col 1 & Col 2 & Col 3 & Col 4 & Col 5\\
-\hline
-Row 1 & 1 & 2 & 3 & 6 & 7 \\
-\hline
-Row 2 & 1 & 0 & 1 & 0 & 0 \\
-\hline
-Row 3 & 3 & 5 & 0 & 0 & 0 \\
-\hline
-Row 4 & 1 & 1 & 2 & 2 & 1 \\
-\hline
-Row 5 & 1 & 0 & 2 & 0 & 0 \\
-\hline
-Row 6 & 0.45 & 0 & 2 & 0 & 0 \\
-\hline
-Row 7 &  &  & 3 &  & \\
-\hline
-\end{tabular}
-\end{center}
-\begin{center}
-(b)
-\end{center}
-\end{minipage}
-\caption{(a) An example tree and its (b) matrix representation. $x$ denotes an example and $x_j$ is the value of the $j$th continuous-valued (resp. categorical) feature in $X_\text{cont}$ (resp. $X_\text{cat}$). In this example all leaf nodes are pure and no training example is misclassified.}
-\label{dtree}
-\end{figure}
-
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-
-The matrix corresponding to the learned model as well as the training accuracy (if requested) is written to a file in the format specified. See
-details where the structure of the model matrix is described.
-Recall that in our implementation $X$ is split into $X_\text{cont}$ and $X_\text{cat}$. If requested, the mappings of the continuous-valued feature-ids in $X_\text{cont}$ (stored at {\tt S\_map}) and the categorical feature-ids in $X_\text{cat}$ (stored at {\tt C\_map}) to the global feature-ids in $X$ will be provided. 
-Depending on what arguments are provided during
-invocation, the {\tt decision-tree-predict.dml} script may compute one or more of predictions, accuracy and confusion matrix in the requested output format. 
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f decision-tree.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
-	R=/user/biadmin/R.csv M=/user/biadmin/model.csv
-	bins=20 depth=25 num\_leaf=10 num\_samples=3000 impurity=Gini fmt=csv
-	
-}\smallskip
-
-
-\noindent To compute predictions:
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f decision-tree-predict.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx R=/user/biadmin/R.csv
-	M=/user/biadmin/model.csv  P=/user/biadmin/predictions.csv
-	A=/user/biadmin/accuracy.csv CM=/user/biadmin/confusion.csv fmt=csv
-	
-}\smallskip
-
-
-%\noindent{\bf References}
-%
-%\begin{itemize}
-%\item B. Panda, J. Herbach, S. Basu, and R. Bayardo. \newblock{PLANET: massively parallel learning of tree ensembles with MapReduce}. In Proceedings of the VLDB Endowment, 2009.
-%\item L. Breiman, J. Friedman, R. Olshen, and C. Stone. \newblock{Classification and Regression Trees}. Wadsworth and Brooks, 1984.
-%\end{itemize}


[50/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1347] Accept SparkSession in Java/Scala MLContext API

Posted by de...@apache.org.
[SYSTEMML-1347] Accept SparkSession in Java/Scala MLContext API

Add MLContext constructor for SparkSession.
In MLContext, store SparkSession reference instead of SparkContext.
Remove unused monitoring parameter in MLContext.
Simplifications in MLContextUtil and MLContextConversionUtil.
Method for creating SparkSession in AutomatedTestBase.
Update tests for SparkSession.
Add MLContext SparkSession constructor to MLContext guide.

Closes #405.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/b91f9bfe
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/b91f9bfe
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/b91f9bfe

Branch: refs/heads/gh-pages
Commit: b91f9bfec2a4c7e009ff84e540fab150398078ee
Parents: d1fa154
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Fri Apr 7 10:35:55 2017 -0700
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Fri Apr 7 10:35:55 2017 -0700

----------------------------------------------------------------------
 spark-mlcontext-programming-guide.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/b91f9bfe/spark-mlcontext-programming-guide.md
----------------------------------------------------------------------
diff --git a/spark-mlcontext-programming-guide.md b/spark-mlcontext-programming-guide.md
index c28eaf5..3b7bfc8 100644
--- a/spark-mlcontext-programming-guide.md
+++ b/spark-mlcontext-programming-guide.md
@@ -47,10 +47,10 @@ spark-shell --executor-memory 4G --driver-memory 4G --jars SystemML.jar
 
 ## Create MLContext
 
-All primary classes that a user interacts with are located in the `org.apache.sysml.api.mlcontext package`.
-For convenience, we can additionally add a static import of ScriptFactory to shorten the syntax for creating Script objects.
-An `MLContext` object can be created by passing its constructor a reference to the `SparkContext`. If successful, you
-should see a "`Welcome to Apache SystemML!`" message.
+All primary classes that a user interacts with are located in the `org.apache.sysml.api.mlcontext` package.
+For convenience, we can additionally add a static import of `ScriptFactory` to shorten the syntax for creating `Script` objects.
+An `MLContext` object can be created by passing its constructor a reference to the `SparkSession` (`spark`) or `SparkContext` (`sc`).
+If successful, you should see a "`Welcome to Apache SystemML!`" message.
 
 <div class="codetabs">
 
@@ -58,7 +58,7 @@ should see a "`Welcome to Apache SystemML!`" message.
 {% highlight scala %}
 import org.apache.sysml.api.mlcontext._
 import org.apache.sysml.api.mlcontext.ScriptFactory._
-val ml = new MLContext(sc)
+val ml = new MLContext(spark)
 {% endhighlight %}
 </div>
 
@@ -70,7 +70,7 @@ import org.apache.sysml.api.mlcontext._
 scala> import org.apache.sysml.api.mlcontext.ScriptFactory._
 import org.apache.sysml.api.mlcontext.ScriptFactory._
 
-scala> val ml = new MLContext(sc)
+scala> val ml = new MLContext(spark)
 
 Welcome to Apache SystemML!
 
@@ -1753,7 +1753,7 @@ Archiver-Version: Plexus Archiver
 Artifact-Id: systemml
 Build-Jdk: 1.8.0_60
 Build-Time: 2017-02-03 22:32:43 UTC
-Built-By: deroneriksson
+Built-By: sparkuser
 Created-By: Apache Maven 3.3.9
 Group-Id: org.apache.systemml
 Main-Class: org.apache.sysml.api.DMLScript


[39/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

Posted by de...@apache.org.
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/GLM.tex
----------------------------------------------------------------------
diff --git a/alg-ref/GLM.tex b/alg-ref/GLM.tex
new file mode 100644
index 0000000..8555a5b
--- /dev/null
+++ b/alg-ref/GLM.tex
@@ -0,0 +1,431 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Generalized Linear Models (GLM)}
+\label{sec:GLM}
+
+\noindent{\bf Description}
+\smallskip
+
+Generalized Linear Models~\cite{Gill2000:GLM,McCullagh1989:GLM,Nelder1972:GLM}
+extend the methodology of linear and logistic regression to a variety of
+distributions commonly assumed as noise effects in the response variable.
+As before, we are given a collection
+of records $(x_1, y_1)$, \ldots, $(x_n, y_n)$ where $x_i$ is a numerical vector of
+explanatory (feature) variables of size~\mbox{$\dim x_i = m$}, and $y_i$ is the
+response (dependent) variable observed for this vector.  GLMs assume that some
+linear combination of the features in~$x_i$ determines the \emph{mean}~$\mu_i$
+of~$y_i$, while the observed $y_i$ is a random outcome of a noise distribution
+$\Prob[y\mid \mu_i]\,$\footnote{$\Prob[y\mid \mu_i]$ is given by a density function
+if $y$ is continuous.}
+with that mean~$\mu_i$:
+\begin{equation*}
+x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} 
+\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim \Prob[y\mid \mu_i]
+\end{equation*}
+
+In linear regression the response mean $\mu_i$ \emph{equals} some linear combination
+over~$x_i$, denoted above by~$\eta_i$.
+In logistic regression with $y\in\{0, 1\}$ (Bernoulli) the mean of~$y$ is the same
+as $\Prob[y=1]$ and equals $1/(1+e^{-\eta_i})$, the logistic function of~$\eta_i$.
+In GLM, $\mu_i$ and $\eta_i$ can be related via any given smooth monotone function
+called the \emph{link function}: $\eta_i = g(\mu_i)$.  The unknown linear combination
+parameters $\beta_j$ are assumed to be the same for all records.
+
+The goal of the regression is to estimate the parameters~$\beta_j$ from the observed
+data.  Once the~$\beta_j$'s are accurately estimated, we can make predictions
+about~$y$ for a new feature vector~$x$.  To do so, compute $\eta$ from~$x$ and use
+the inverted link function $\mu = g^{-1}(\eta)$ to compute the mean $\mu$ of~$y$;
+then use the distribution $\Prob[y\mid \mu]$ to make predictions about~$y$.
+Both $g(\mu)$ and $\Prob[y\mid \mu]$ are user-provided.  Our GLM script supports
+a standard set of distributions and link functions, see below for details.
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}GLM.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} Y=}path/file
+{\tt{} B=}path/file
+{\tt{} fmt=}format
+{\tt{} O=}path/file
+{\tt{} Log=}path/file
+{\tt{} dfam=}int
+{\tt{} vpow=}double
+{\tt{} link=}int
+{\tt{} lpow=}double
+{\tt{} yneg=}double
+{\tt{} icpt=}int
+{\tt{} reg=}double
+{\tt{} tol=}double
+{\tt{} disp=}double
+{\tt{} moi=}int
+{\tt{} mii=}int
+
+}
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the matrix of feature vectors; each row constitutes
+an example.
+\item[{\tt Y}:]
+Location to read the response matrix, which may have 1 or 2 columns
+\item[{\tt B}:]
+Location to store the estimated regression parameters (the $\beta_j$'s), with the
+intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\item[{\tt O}:] (default:\mbox{ }{\tt " "})
+Location to write certain summary statistics described in Table~\ref{table:GLM:stats},
+by default it is standard output.
+\item[{\tt Log}:] (default:\mbox{ }{\tt " "})
+Location to store iteration-specific variables for monitoring and debugging purposes,
+see Table~\ref{table:GLM:log} for details.
+\item[{\tt dfam}:] (default:\mbox{ }{\tt 1})
+Distribution family code to specify $\Prob[y\mid \mu]$, see Table~\ref{table:commonGLMs}:\\
+{\tt 1} = power distributions with $\Var(y) = \mu^{\alpha}$;
+{\tt 2} = binomial or Bernoulli
+\item[{\tt vpow}:] (default:\mbox{ }{\tt 0.0})
+When {\tt dfam=1}, this provides the~$q$ in $\Var(y) = a\mu^q$, the power
+dependence of the variance of~$y$ on its mean.  In particular, use:\\
+{\tt 0.0} = Gaussian,
+{\tt 1.0} = Poisson,
+{\tt 2.0} = Gamma,
+{\tt 3.0} = inverse Gaussian
+\item[{\tt link}:] (default:\mbox{ }{\tt 0})
+Link function code to determine the link function~$\eta = g(\mu)$:\\
+{\tt 0} = canonical link (depends on the distribution family), see Table~\ref{table:commonGLMs};\\
+{\tt 1} = power functions,
+{\tt 2} = logit,
+{\tt 3} = probit,
+{\tt 4} = cloglog,
+{\tt 5} = cauchit
+\item[{\tt lpow}:] (default:\mbox{ }{\tt 1.0})
+When {\tt link=1}, this provides the~$s$ in $\eta = \mu^s$, the power link
+function; {\tt lpow=0.0} gives the log link $\eta = \log\mu$.  Common power links:\\
+{\tt -2.0} = $1/\mu^2$,
+{\tt -1.0} = reciprocal,
+{\tt 0.0} = log,
+{\tt 0.5} = sqrt,
+{\tt 1.0} = identity
+\item[{\tt yneg}:] (default:\mbox{ }{\tt 0.0})
+When {\tt dfam=2} and the response matrix $Y$ has 1~column,
+this specifies the $y$-value used for Bernoulli ``No'' label.
+All other $y$-values are treated as the ``Yes'' label.
+For example, {\tt yneg=-1.0} may be used when $y\in\{-1, 1\}$;
+either {\tt yneg=1.0} or {\tt yneg=2.0} may be used when $y\in\{1, 2\}$.
+\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
+Intercept and shifting/rescaling of the features in~$X$:\\
+{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\
+{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
+{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
+\item[{\tt reg}:] (default:\mbox{ }{\tt 0.0})
+L2-regularization parameter (lambda)
+\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
+Tolerance (epsilon) used in the convergence criterion: we terminate the outer iterations
+when the deviance changes by less than this factor; see below for details
+\item[{\tt disp}:] (default:\mbox{ }{\tt 0.0})
+Dispersion parameter, or {\tt 0.0} to estimate it from data
+\item[{\tt moi}:] (default:\mbox{ }{\tt 200})
+Maximum number of outer (Fisher scoring) iterations
+\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
+Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
+limit provided
+\end{Description}
+
+
+\begin{table}[t]\small\centerline{%
+\begin{tabular}{|ll|}
+\hline
+Name & Meaning \\
+\hline
+{\tt TERMINATION\_CODE}  & A positive integer indicating success/failure as follows: \\
+                         & $1 = {}$Converged successfully;
+                           $2 = {}$Maximum \# of iterations reached; \\
+                         & $3 = {}$Input ({\tt X}, {\tt Y}) out of range;
+                           $4 = {}$Distribution/link not supported \\
+{\tt BETA\_MIN}          & Smallest beta value (regression coefficient), excluding the intercept \\
+{\tt BETA\_MIN\_INDEX}   & Column index for the smallest beta value \\
+{\tt BETA\_MAX}          & Largest beta value (regression coefficient), excluding the intercept \\
+{\tt BETA\_MAX\_INDEX}   & Column index for the largest beta value \\
+{\tt INTERCEPT}          & Intercept value, or NaN if there is no intercept (if {\tt icpt=0}) \\
+{\tt DISPERSION}         & Dispersion used to scale deviance, provided in {\tt disp} input argument \\
+                         & or estimated (same as {\tt DISPERSION\_EST}) if {\tt disp} argument is${} \leq 0$ \\
+{\tt DISPERSION\_EST}    & Dispersion estimated from the dataset \\
+{\tt DEVIANCE\_UNSCALED} & Deviance from the saturated model, assuming dispersion${} = 1.0$ \\
+{\tt DEVIANCE\_SCALED}   & Deviance from the saturated model, scaled by {\tt DISPERSION} value \\
+\hline
+\end{tabular}}
+\caption{Besides~$\beta$, GLM regression script computes a few summary statistics listed above.
+They are provided in CSV format, one comma-separated name-value pair per each line.}
+\label{table:GLM:stats}
+\end{table}
+
+
+
+
+
+
+\begin{table}[t]\small\centerline{%
+\begin{tabular}{|ll|}
+\hline
+Name & Meaning \\
+\hline
+{\tt NUM\_CG\_ITERS}     & Number of inner (Conj.\ Gradient) iterations in this outer iteration \\
+{\tt IS\_TRUST\_REACHED} & $1 = {}$trust region boundary was reached, $0 = {}$otherwise \\
+{\tt POINT\_STEP\_NORM}  & L2-norm of iteration step from old point ($\beta$-vector) to new point \\
+{\tt OBJECTIVE}          & The loss function we minimize (negative partial log-likelihood) \\
+{\tt OBJ\_DROP\_REAL}    & Reduction in the objective during this iteration, actual value \\
+{\tt OBJ\_DROP\_PRED}    & Reduction in the objective predicted by a quadratic approximation \\
+{\tt OBJ\_DROP\_RATIO}   & Actual-to-predicted reduction ratio, used to update the trust region \\
+{\tt GRADIENT\_NORM}     & L2-norm of the loss function gradient (omitted if point is rejected) \\
+{\tt LINEAR\_TERM\_MIN}  & The minimum value of $X \pxp \beta$, used to check for overflows \\
+{\tt LINEAR\_TERM\_MAX}  & The maximum value of $X \pxp \beta$, used to check for overflows \\
+{\tt IS\_POINT\_UPDATED} & $1 = {}$new point accepted; $0 = {}$new point rejected, old point restored \\
+{\tt TRUST\_DELTA}       & Updated trust region size, the ``delta'' \\
+\hline
+\end{tabular}}
+\caption{
+The {\tt Log} file for GLM regression contains the above \mbox{per-}iteration
+variables in CSV format, each line containing triple (Name, Iteration\#, Value) with Iteration\#
+being~0 for initial values.}
+\label{table:GLM:log}
+\end{table}
+
+\begin{table}[t]\hfil
+\begin{tabular}{|ccccccc|}
+\hline
+\multicolumn{4}{|c}{INPUT PARAMETERS}              & Distribution  & Link      & Cano- \\
+{\tt dfam} & {\tt vpow} & {\tt link} & {\tt\ lpow} & family        & function  & nical?\\
+\hline
+{\tt 1}    & {\tt 0.0}  & {\tt 1}    & {\tt -1.0}  & Gaussian      & inverse   &       \\
+{\tt 1}    & {\tt 0.0}  & {\tt 1}    & {\tt\ 0.0}  & Gaussian      & log       &       \\
+{\tt 1}    & {\tt 0.0}  & {\tt 1}    & {\tt\ 1.0}  & Gaussian      & identity  & Yes   \\
+{\tt 1}    & {\tt 1.0}  & {\tt 1}    & {\tt\ 0.0}  & Poisson       & log       & Yes   \\
+{\tt 1}    & {\tt 1.0}  & {\tt 1}    & {\tt\ 0.5}  & Poisson       & sq.root   &       \\
+{\tt 1}    & {\tt 1.0}  & {\tt 1}    & {\tt\ 1.0}  & Poisson       & identity  &       \\
+{\tt 1}    & {\tt 2.0}  & {\tt 1}    & {\tt -1.0}  & Gamma         & inverse   & Yes   \\
+{\tt 1}    & {\tt 2.0}  & {\tt 1}    & {\tt\ 0.0}  & Gamma         & log       &       \\
+{\tt 1}    & {\tt 2.0}  & {\tt 1}    & {\tt\ 1.0}  & Gamma         & identity  &       \\
+{\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt -2.0}  & Inverse Gauss & $1/\mu^2$ & Yes   \\
+{\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt -1.0}  & Inverse Gauss & inverse   &       \\
+{\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt\ 0.0}  & Inverse Gauss & log       &       \\
+{\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt\ 1.0}  & Inverse Gauss & identity  &       \\
+\hline
+{\tt 2}    & {\tt  *}   & {\tt 1}    & {\tt\ 0.0}  & Binomial      & log       &       \\
+{\tt 2}    & {\tt  *}   & {\tt 1}    & {\tt\ 0.5}  & Binomial      & sq.root   &       \\
+{\tt 2}    & {\tt  *}   & {\tt 2}    & {\tt\  *}   & Binomial      & logit     & Yes   \\
+{\tt 2}    & {\tt  *}   & {\tt 3}    & {\tt\  *}   & Binomial      & probit    &       \\
+{\tt 2}    & {\tt  *}   & {\tt 4}    & {\tt\  *}   & Binomial      & cloglog   &       \\
+{\tt 2}    & {\tt  *}   & {\tt 5}    & {\tt\  *}   & Binomial      & cauchit   &       \\
+\hline
+\end{tabular}\hfil
+\caption{Common GLM distribution families and link functions.
+(Here ``{\tt *}'' stands for ``any value.'')}
+\label{table:commonGLMs}
+\end{table}
+
+\noindent{\bf Details}
+\smallskip
+
+In GLM, the noise distribution $\Prob[y\mid \mu]$ of the response variable~$y$
+given its mean~$\mu$ is restricted to have the \emph{exponential family} form
+\begin{equation}
+Y \sim\, \Prob[y\mid \mu] \,=\, \exp\left(\frac{y\theta - b(\theta)}{a}
++ c(y, a)\right),\,\,\textrm{where}\,\,\,\mu = \E(Y) = b'(\theta).
+\label{eqn:GLM}
+\end{equation}
+Changing the mean in such a distribution simply multiplies all \mbox{$\Prob[y\mid \mu]$}
+by~$e^{\,y\hspace{0.2pt}\theta/a}$ and rescales them so that they again integrate to~1.
+Parameter $\theta$ is called \emph{canonical}, and the function $\theta = b'^{\,-1}(\mu)$
+that relates it to the mean is called the~\emph{canonical link}; constant~$a$ is called
+\emph{dispersion} and rescales the variance of~$y$.  Many common distributions can be put
+into this form, see Table~\ref{table:commonGLMs}.  The canonical parameter~$\theta$
+is often chosen to coincide with~$\eta$, the linear combination of the regression features;
+other choices for~$\eta$ are possible too.
+
+Rather than specifying the canonical link, GLM distributions are commonly defined
+by their variance $\Var(y)$ as the function of the mean~$\mu$.  It can be shown
+from Eq.~(\ref{eqn:GLM}) that $\Var(y) = a\,b''(\theta) = a\,b''(b'^{\,-1}(\mu))$.
+For example, for the Bernoulli distribution $\Var(y) = \mu(1-\mu)$, for the Poisson
+distribution \mbox{$\Var(y) = \mu$}, and for the Gaussian distribution
+$\Var(y) = a\cdot 1 = \sigma^2$.
+It turns out that for many common distributions $\Var(y) = a\mu^q$, a power function.
+We support all distributions where $\Var(y) = a\mu^q$, as well as the Bernoulli and
+the binomial distributions.
+
+For distributions with $\Var(y) = a\mu^q$ the canonical link is also a power function,
+namely $\theta = \mu^{1-q}/(1-q)$, except for the Poisson ($q = 1$) whose canonical link is
+$\theta = \log\mu$.  We support all power link functions in the form $\eta = \mu^s$,
+dropping any constant factor, with $\eta = \log\mu$ for $s=0$.  The binomial distribution
+has its own family of link functions, which includes logit (the canonical link),
+probit, cloglog, and cauchit (see Table~\ref{table:binomial_links}); we support these
+only for the binomial and Bernoulli distributions.  Links and distributions are specified
+via four input parameters: {\tt dfam}, {\tt vpow}, {\tt link}, and {\tt lpow} (see
+Table~\ref{table:commonGLMs}).
+
+\begin{table}[t]\hfil
+\begin{tabular}{|cc|cc|}
+\hline
+Name & Link function & Name & Link function \\
+\hline
+Logit   & $\displaystyle \eta = 1 / \big(1 + e^{-\mu}\big)^{\mathstrut}$ &
+Cloglog & $\displaystyle \eta = \log \big(\!- \log(1 - \mu)\big)^{\mathstrut}$ \\
+Probit  & $\displaystyle \mu  = \frac{1}{\sqrt{2\pi}}\int\nolimits_{-\infty_{\mathstrut}}^{\,\eta\mathstrut}
+          \!\!\!\!\! e^{-\frac{t^2}{2}} dt$ & 
+Cauchit & $\displaystyle \eta = \tan\pi(\mu - 1/2)$ \\
+\hline
+\end{tabular}\hfil
+\caption{The supported non-power link functions for the Bernoulli and the binomial
+distributions.  (Here $\mu$~is the Bernoulli mean.)}
+\label{table:binomial_links}
+\end{table}
+
+The observed response values are provided to the regression script as matrix~$Y$
+having 1 or 2 columns.  If a power distribution family is selected ({\tt dfam=1}),
+matrix $Y$ must have 1~column that provides $y_i$ for each~$x_i$ in the corresponding
+row of matrix~$X$.  When {\tt dfam=2} and $Y$ has 1~column, we assume the Bernoulli
+distribution for $y_i\in\{y_{\mathrm{neg}}, y_{\mathrm{pos}}\}$ with $y_{\mathrm{neg}}$
+from the input parameter {\tt yneg} and with $y_{\mathrm{pos}} \neq y_{\mathrm{neg}}$.  
+When {\tt dfam=2} and $Y$ has 2~columns, we assume the
+binomial distribution; for each row~$i$ in~$X$, cells $Y[i, 1]$ and $Y[i, 2]$ provide
+the positive and the negative binomial counts respectively.  Internally we convert
+the 1-column Bernoulli into the 2-column binomial with 0-versus-1 counts.
+
+We estimate the regression parameters via L2-regularized negative log-likelihood
+minimization:
+\begin{equation*}
+f(\beta; X, Y) \,\,=\,\, -\sum\nolimits_{i=1}^n \big(y_i\theta_i - b(\theta_i)\big)
+\,+\,(\lambda/2) \sum\nolimits_{j=1}^m \beta_j^2\,\,\to\,\,\min
+\end{equation*}
+where $\theta_i$ and $b(\theta_i)$ are from~(\ref{eqn:GLM}); note that $a$
+and $c(y, a)$ are constant w.r.t.~$\beta$ and can be ignored here.
+The canonical parameter $\theta_i$ depends on both $\beta$ and~$x_i$:
+\begin{equation*}
+\theta_i \,\,=\,\, b'^{\,-1}(\mu_i) \,\,=\,\, b'^{\,-1}\big(g^{-1}(\eta_i)\big) \,\,=\,\,
+\big(b'^{\,-1}\circ g^{-1}\big)\left(\beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}\right)
+\end{equation*}
+The user-provided (via {\tt reg}) regularization coefficient $\lambda\geq 0$ can be used
+to mitigate overfitting and degeneracy in the data.  Note that the intercept is never
+regularized.
+
+Our iterative minimizer for $f(\beta; X, Y)$ uses the Fisher scoring approximation
+to the difference $\varDelta f(z; \beta) = f(\beta + z; X, Y) \,-\, f(\beta; X, Y)$,
+recomputed at each iteration:
+\begin{gather*}
+\varDelta f(z; \beta) \,\,\,\approx\,\,\, 1/2 \cdot z^T A z \,+\, G^T z,
+\,\,\,\,\textrm{where}\,\,\,\, A \,=\, X^T\!\diag(w) X \,+\, \lambda I\\
+\textrm{and}\,\,\,\,G \,=\, - X^T u \,+\, \lambda\beta,
+\,\,\,\textrm{with $n\,{\times}\,1$ vectors $w$ and $u$ given by}\\
+\forall\,i = 1\ldots n: \,\,\,\,
+w_i = \big[v(\mu_i)\,g'(\mu_i)^2\big]^{-1}
+\!\!\!\!\!\!,\,\,\,\,\,\,\,\,\,
+u_i = (y_i - \mu_i)\big[v(\mu_i)\,g'(\mu_i)\big]^{-1}
+\!\!\!\!\!\!.\,\,\,\,
+\end{gather*}
+Here $v(\mu_i)=\Var(y_i)/a$, the variance of $y_i$ as the function of the mean, and
+$g'(\mu_i) = d \eta_i/d \mu_i$ is the link function derivative.  The Fisher scoring
+approximation is minimized by trust-region conjugate gradient iterations (called the
+\emph{inner} iterations, with the Fisher scoring iterations as the \emph{outer}
+iterations), which approximately solve the following problem:
+\begin{equation*}
+1/2 \cdot z^T A z \,+\, G^T z \,\,\to\,\,\min\,\,\,\,\textrm{subject to}\,\,\,\,
+\|z\|_2 \leq \delta
+\end{equation*}
+The conjugate gradient algorithm closely follows Algorithm~7.2 on page~171
+of~\cite{Nocedal2006:Optimization}.
+The trust region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$
+and updated as described in~\cite{Nocedal2006:Optimization}.
+The user can specify the maximum number of the outer and the inner iterations with
+input parameters {\tt moi} and {\tt mii}, respectively.  The Fisher scoring algorithm
+terminates successfully if $2|\varDelta f(z; \beta)| < (D_1(\beta) + 0.1)\hspace{0.5pt}\eps$
+where $\eps > 0$ is a tolerance supplied by the user via {\tt tol}, and $D_1(\beta)$ is
+the unit-dispersion deviance estimated as
+\begin{equation*}
+D_1(\beta) \,\,=\,\, 2 \cdot \big(\Prob[Y \mid \!
+\begin{smallmatrix}\textrm{saturated}\\\textrm{model}\end{smallmatrix}, a\,{=}\,1]
+\,\,-\,\,\Prob[Y \mid X, \beta, a\,{=}\,1]\,\big)
+\end{equation*}
+The deviance estimate is also produced as part of the output.  Once the Fisher scoring
+algorithm terminates, if requested by the user, we estimate the dispersion~$a$ from
+Eq.~\ref{eqn:GLM} using Pearson residuals
+\begin{equation}
+\hat{a} \,\,=\,\, \frac{1}{n-m}\cdot \sum_{i=1}^n \frac{(y_i - \mu_i)^2}{v(\mu_i)}
+\label{eqn:dispersion}
+\end{equation}
+and use it to adjust our deviance estimate: $D_{\hat{a}}(\beta) = D_1(\beta)/\hat{a}$.
+If input argument {\tt disp} is {\tt 0.0} we estimate $\hat{a}$, otherwise we use its
+value as~$a$.  Note that in~(\ref{eqn:dispersion}) $m$~counts the intercept
+($m \leftarrow m+1$) if it is present.
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+The estimated regression parameters (the $\hat{\beta}_j$'s) are populated into
+a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
+input argument.  What this matrix contains, and its size, depends on the input
+argument {\tt icpt}, which specifies the user's intercept and rescaling choice:
+\begin{Description}
+\item[{\tt icpt=0}:] No intercept, matrix~$B$ has size $m\,{\times}\,1$, with
+$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to~$m = {}$ncol$(X)$.
+\item[{\tt icpt=1}:] There is intercept, but no shifting/rescaling of~$X$; matrix~$B$
+has size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from 1 to~$m$,
+and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept coefficient.
+\item[{\tt icpt=2}:] There is intercept, and the features in~$X$ are shifted to
+mean${} = 0$ and rescaled to variance${} = 1$; then there are two versions of
+the~$\hat{\beta}_j$'s, one for the original features and another for the
+shifted/rescaled features.  Now matrix~$B$ has size $(m\,{+}\,1) \times 2$, with
+$B[\cdot, 1]$ for the original features and $B[\cdot, 2]$ for the shifted/rescaled
+features, in the above format.  Note that $B[\cdot, 2]$ are iteratively estimated
+and $B[\cdot, 1]$ are obtained from $B[\cdot, 2]$ by complementary shifting and
+rescaling.
+\end{Description}
+Our script also estimates the dispersion $\hat{a}$ (or takes it from the user's input)
+and the deviances $D_1(\hat{\beta})$ and $D_{\hat{a}}(\hat{\beta})$, see
+Table~\ref{table:GLM:stats} for details.  A log file with variables monitoring
+progress through the iterations can also be made available, see Table~\ref{table:GLM:log}.
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f GLM.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
+  B=/user/biadmin/B.mtx fmt=csv dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.01 tol=0.00000001
+  disp=1.0 moi=100 mii=10 O=/user/biadmin/stats.csv Log=/user/biadmin/log.csv
+
+}
+
+\smallskip
+\noindent{\bf See Also}
+\smallskip
+
+In case of binary classification problems, consider using L2-SVM or binary logistic
+regression; for multiclass classification, use multiclass~SVM or multinomial logistic
+regression.  For the special cases of linear regression and logistic regression, it
+may be more efficient to use the corresponding specialized scripts instead of~GLM.

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/GLMpredict.tex
----------------------------------------------------------------------
diff --git a/alg-ref/GLMpredict.tex b/alg-ref/GLMpredict.tex
new file mode 100644
index 0000000..ceb249d
--- /dev/null
+++ b/alg-ref/GLMpredict.tex
@@ -0,0 +1,474 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Regression Scoring and Prediction}
+
+\noindent{\bf Description}
+\smallskip
+
+Script {\tt GLM-predict.dml} is intended to cover all linear model based regressions,
+including linear regression, binomial and multinomial logistic regression, and GLM
+regressions (Poisson, gamma, binomial with probit link~etc.).  Having just one scoring
+script for all these regressions simplifies maintenance and enhancement while ensuring
+compatible interpretations for output statistics.
+
+The script performs two functions, prediction and scoring.  To perform prediction,
+the script takes two matrix inputs: a collection of records $X$ (without the response
+attribute) and the estimated regression parameters~$B$, also known as~$\beta$.  
+To perform scoring, in addition to $X$ and~$B$, the script takes the matrix of actual
+response values~$Y$ that are compared to the predictions made with $X$ and~$B$.  Of course
+there are other, non-matrix, input arguments that specify the model and the output
+format, see below for the full list.
+
+We assume that our test/scoring dataset is given by $n\,{\times}\,m$-matrix $X$ of
+numerical feature vectors, where each row~$x_i$ represents one feature vector of one
+record; we have \mbox{$\dim x_i = m$}.  Each record also includes the response
+variable~$y_i$ that may be numerical, single-label categorical, or multi-label categorical.
+A single-label categorical $y_i$ is an integer category label, one label per record;
+a multi-label $y_i$ is a vector of integer counts, one count for each possible label,
+which represents multiple single-label events (observations) for the same~$x_i$.  Internally
+we convert single-label categoricals into multi-label categoricals by replacing each
+label~$l$ with an indicator vector~$(0,\ldots,0,1_l,0,\ldots,0)$.  In prediction-only
+tasks the actual $y_i$'s are not needed to the script, but they are needed for scoring.
+
+To perform prediction, the script matrix-multiplies $X$ and $B$, adding the intercept
+if available, then applies the inverse of the model's link function.  
+All GLMs assume that the linear combination of the features in~$x_i$ and the betas
+in~$B$ determines the means~$\mu_i$ of the~$y_i$'s (in numerical or multi-label
+categorical form) with $\dim \mu_i = \dim y_i$.  The observed $y_i$ is assumed to follow
+a specified GLM family distribution $\Prob[y\mid \mu_i]$ with mean(s)~$\mu_i$:
+\begin{equation*}
+x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} 
+\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim \Prob[y\mid \mu_i]
+\end{equation*}
+If $y_i$ is numerical, the predicted mean $\mu_i$ is a real number.  Then our script's
+output matrix $M$ is the $n\,{\times}\,1$-vector of these means~$\mu_i$.
+Note that $\mu_i$ predicts the mean of $y_i$, not the actual~$y_i$.  For example,
+in Poisson distribution, the mean is usually fractional, but the actual~$y_i$ is
+always integer.
+
+If $y_i$ is categorical, i.e.\ a vector of label counts for record~$i$, then $\mu_i$
+is a vector of non-negative real numbers, one number $\mu_{i,l}$ per each label~$l$.
+In this case we divide the $\mu_{i,l}$ by their sum $\sum_l \mu_{i,l}$ to obtain
+predicted label probabilities~\mbox{$p_{i,l}\in [0, 1]$}.  The output matrix $M$ is
+the $n \times (k\,{+}\,1)$-matrix of these probabilities, where $n$ is the number of
+records and $k\,{+}\,1$ is the number of categories\footnote{We use $k+1$ because
+there are $k$ non-baseline categories and one baseline category, with regression
+parameters $B$ having $k$~columns.}.  Note again that we do not predict the labels
+themselves, nor their actual counts per record, but we predict the labels' probabilities. 
+
+Going from predicted probabilities to predicted labels, in the single-label categorical
+case, requires extra information such as the cost of false positive versus
+false negative errors.  For example, if there are 5 categories and we \emph{accurately}
+predicted their probabilities as $(0.1, 0.3, 0.15, 0.2, 0.25)$, just picking the
+highest-probability label would be wrong 70\% of the time, whereas picking the
+lowest-probability label might be right if, say, it represents a diagnosis of cancer
+or another rare and serious outcome.  Hence, we keep this step outside the scope of
+{\tt GLM-predict.dml} for now.
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}GLM-predict.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} Y=}path/file
+{\tt{} B=}path/file
+{\tt{} M=}path/file
+{\tt{} O=}path/file
+{\tt{} dfam=}int
+{\tt{} vpow=}double
+{\tt{} link=}int
+{\tt{} lpow=}double
+{\tt{} disp=}double
+{\tt{} fmt=}format
+
+}
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the $n\,{\times}\,m$-matrix $X$ of feature vectors, each row
+constitutes one feature vector (one record)
+\item[{\tt Y}:] (default:\mbox{ }{\tt " "})
+Location to read the response matrix $Y$ needed for scoring (but optional for prediction),
+with the following dimensions: \\
+    $n \:{\times}\: 1$: acceptable for all distributions ({\tt dfam=1} or {\tt 2} or {\tt 3}) \\
+    $n \:{\times}\: 2$: for binomial ({\tt dfam=2}) if given by (\#pos, \#neg) counts \\
+    $n \:{\times}\: k\,{+}\,1$: for multinomial ({\tt dfam=3}) if given by category counts
+\item[{\tt M}:] (default:\mbox{ }{\tt " "})
+Location to write, if requested, the matrix of predicted response means (for {\tt dfam=1}) or
+probabilities (for {\tt dfam=2} or {\tt 3}):\\
+    $n \:{\times}\: 1$: for power-type distributions ({\tt dfam=1}) \\
+    $n \:{\times}\: 2$: for binomial distribution ({\tt dfam=2}), col\#~2 is the ``No'' probability \\
+    $n \:{\times}\: k\,{+}\,1$: for multinomial logit ({\tt dfam=3}), col\#~$k\,{+}\,1$ is for the baseline
+\item[{\tt B}:]
+Location to read matrix $B$ of the \mbox{betas}, i.e.\ estimated GLM regression parameters,
+with the intercept at row\#~$m\,{+}\,1$ if available:\\
+    $\dim(B) \,=\, m \:{\times}\: k$: do not add intercept \\
+    $\dim(B) \,=\, m\,{+}\,1 \:{\times}\: k$: add intercept as given by the last $B$-row \\
+    if $k > 1$, use only $B${\tt [, 1]} unless it is Multinomial Logit ({\tt dfam=3})
+\item[{\tt O}:] (default:\mbox{ }{\tt " "})
+Location to store the CSV-file with goodness-of-fit statistics defined in
+Table~\ref{table:GLMpred:stats}, the default is to print them to the standard output
+\item[{\tt dfam}:] (default:\mbox{ }{\tt 1})
+GLM distribution family code to specify the type of distribution $\Prob[y\,|\,\mu]$
+that we assume: \\
+{\tt 1} = power distributions with $\Var(y) = \mu^{\alpha}$, see Table~\ref{table:commonGLMs};\\
+{\tt 2} = binomial; 
+{\tt 3} = multinomial logit
+\item[{\tt vpow}:] (default:\mbox{ }{\tt 0.0})
+Power for variance defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt dfam}$\,{\neq}\,1$):
+when {\tt dfam=1}, this provides the~$q$ in $\Var(y) = a\mu^q$, the power
+dependence of the variance of~$y$ on its mean.  In particular, use:\\
+{\tt 0.0} = Gaussian,
+{\tt 1.0} = Poisson,
+{\tt 2.0} = Gamma,
+{\tt 3.0} = inverse Gaussian
+\item[{\tt link}:] (default:\mbox{ }{\tt 0})
+Link function code to determine the link function~$\eta = g(\mu)$, ignored for
+multinomial logit ({\tt dfam=3}):\\
+{\tt 0} = canonical link (depends on the distribution family), see Table~\ref{table:commonGLMs};\\
+{\tt 1} = power functions,
+{\tt 2} = logit,
+{\tt 3} = probit,
+{\tt 4} = cloglog,
+{\tt 5} = cauchit
+\item[{\tt lpow}:] (default:\mbox{ }{\tt 1.0})
+Power for link function defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt link}$\,{\neq}\,1$):
+when {\tt link=1}, this provides the~$s$ in $\eta = \mu^s$, the power link
+function; {\tt lpow=0.0} gives the log link $\eta = \log\mu$.  Common power links:\\
+{\tt -2.0} = $1/\mu^2$,
+{\tt -1.0} = reciprocal,
+{\tt 0.0} = log,
+{\tt 0.5} = sqrt,
+{\tt 1.0} = identity
+\item[{\tt disp}:] (default:\mbox{ }{\tt 1.0})
+Dispersion value, when available; must be positive
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix {\tt M} file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+\begin{table}[t]\small\centerline{%
+\begin{tabular}{|lccl|}
+\hline
+Name & \hspace{-0.6em}CID\hspace{-0.5em} & \hspace{-0.3em}Disp?\hspace{-0.6em} & Meaning \\
+\hline
+{\tt LOGLHOOD\_Z}          &   & + & Log-likelihood $Z$-score (in st.\ dev.'s from the mean) \\
+{\tt LOGLHOOD\_Z\_PVAL}    &   & + & Log-likelihood $Z$-score p-value, two-sided \\
+{\tt PEARSON\_X2}          &   & + & Pearson residual $X^2$-statistic \\
+{\tt PEARSON\_X2\_BY\_DF}  &   & + & Pearson $X^2$ divided by degrees of freedom \\
+{\tt PEARSON\_X2\_PVAL}    &   & + & Pearson $X^2$ p-value \\
+{\tt DEVIANCE\_G2}         &   & + & Deviance from the saturated model $G^2$-statistic \\
+{\tt DEVIANCE\_G2\_BY\_DF} &   & + & Deviance $G^2$ divided by degrees of freedom \\
+{\tt DEVIANCE\_G2\_PVAL}   &   & + & Deviance $G^2$ p-value \\
+{\tt AVG\_TOT\_Y}          & + &   & $Y$-column average for an individual response value \\
+{\tt STDEV\_TOT\_Y}        & + &   & $Y$-column st.\ dev.\ for an individual response value \\
+{\tt AVG\_RES\_Y}          & + &   & $Y$-column residual average of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\
+{\tt STDEV\_RES\_Y}        & + &   & $Y$-column residual st.\ dev.\ of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\
+{\tt PRED\_STDEV\_RES}     & + & + & Model-predicted $Y$-column residual st.\ deviation\\
+{\tt PLAIN\_R2}            & + &   & Plain $R^2$ of $Y$-column residual with bias included \\
+{\tt ADJUSTED\_R2}         & + &   & Adjusted $R^2$ of $Y$-column residual w.\ bias included \\
+{\tt PLAIN\_R2\_NOBIAS}    & + &   & Plain $R^2$ of $Y$-column residual, bias subtracted \\
+{\tt ADJUSTED\_R2\_NOBIAS} & + &   & Adjusted $R^2$ of $Y$-column residual, bias subtracted \\
+\hline
+\end{tabular}}
+\caption{The above goodness-of-fit statistics are provided in CSV format, one per each line, with four
+columns: (Name, [CID], [Disp?], Value).  The columns are: 
+``Name'' is the string identifier for the statistic, see the table;
+``CID'' is an optional integer value that specifies the $Y$-column index for \mbox{per-}column statistics
+(note that a bi-/multinomial one-column {\tt Y}-input is converted into multi-column);
+``Disp?'' is an optional Boolean value ({\tt TRUE} or {\tt FALSE}) that tells us
+whether or not scaling by the input dispersion parameter {\tt disp} has been applied to this
+statistic;
+``Value''  is the value of the statistic.}
+\label{table:GLMpred:stats}
+\end{table}
+
+\noindent{\bf Details}
+\smallskip
+
+The output matrix $M$ of predicted means (or probabilities) is computed by matrix-multiplying $X$
+with the first column of~$B$ or with the whole~$B$ in the multinomial case, adding the intercept
+if available (conceptually, appending an extra column of ones to~$X$); then applying the inverse
+of the model's link function.  The difference between ``means'' and ``probabilities'' in the
+categorical case becomes significant when there are ${\geq}\,2$ observations per record
+(with the multi-label records) or when the labels such as $-1$ and~$1$ are viewed and averaged
+as numerical response values (with the single-label records).  To avoid any \mbox{mix-up} or
+information loss, we separately return the predicted probability of each category label for each
+record.
+
+When the ``actual'' response values $Y$ are available, the summary statistics are computed
+and written out as described in Table~\ref{table:GLMpred:stats}.  Below we discuss each of
+these statistics in detail.  Note that in the categorical case (binomial and multinomial)
+$Y$ is internally represented as the matrix of observation counts for each label in each record,
+rather than just the label~ID for each record.  The input~$Y$ may already be a matrix of counts,
+in which case it is used as-is.  But if $Y$ is given as a vector of response labels, each
+response label is converted into an indicator vector $(0,\ldots,0,1_l,0,\ldots,0)$ where~$l$
+is the label~ID for this record.  All negative (e.g.~$-1$) or zero label~IDs are converted to
+the $1 + {}$maximum label~ID.  The largest label~ID is viewed as the ``baseline'' as explained
+in the section on Multinomial Logistic Regression.  We assume that there are $k\geq 1$
+non-baseline categories and one (last) baseline category.
+
+We also estimate residual variances for each response value, although we do not output them,
+but use them only inside the summary statistics, scaled and unscaled by the input dispersion
+parameter {\tt disp}, as described below.
+
+\smallskip
+{\tt LOGLHOOD\_Z} and {\tt LOGLHOOD\_Z\_PVAL} statistics measure how far the log-likelihood
+of~$Y$ deviates from its expected value according to the model.  The script implements them
+only for the binomial and the multinomial distributions, returning NaN for all other distributions.
+Pearson's~$X^2$ and deviance~$G^2$ often perform poorly with bi- and multinomial distributions
+due to low cell counts, hence we need this extra goodness-of-fit measure.  To compute these
+statistics, we use:
+\begin{Itemize}
+\item the $n\times (k\,{+}\,1)$-matrix~$Y$ of multi-label response counts, in which $y_{i,j}$
+is the number of times label~$j$ was observed in record~$i$;
+\item the model-estimated probability matrix~$P$ of the same dimensions that satisfies
+$\sum_{j=1}^{k+1} p_{i,j} = 1$ for all~$i=1,\ldots,n$ and where $p_{i,j}$ is the model
+probability of observing label~$j$ in record~$i$;
+\item the $n\,{\times}\,1$-vector $N$ where $N_i$ is the aggregated count of observations
+in record~$i$ (all $N_i = 1$ if each record has only one response label).
+\end{Itemize}
+We start by computing the multinomial log-likelihood of $Y$ given~$P$ and~$N$, as well as
+the expected log-likelihood given a random~$Y$ and the variance of this log-likelihood if
+$Y$ indeed follows the proposed distribution:
+\begin{align*}
+\ell (Y) \,\,&=\,\, \log \Prob[Y \,|\, P, N] \,\,=\,\, \sum_{i=1}^{n} \,\sum_{j=1}^{k+1}  \,y_{i,j}\log p_{i,j} \\
+\E_Y \ell (Y)  \,\,&=\,\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\mu_{i,j} \log p_{i,j} 
+    \,\,=\,\, \sum_{i=1}^{n}\, N_i \,\sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j} \\
+\Var_Y \ell (Y) \,&=\, \sum_{i=1}^{n} \,N_i \left(\sum_{j=1}^{k+1} \,p_{i,j} \big(\log p_{i,j}\big)^2
+    - \Bigg( \sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j}\Bigg) ^ {\!\!2\,} \right)
+\end{align*}
+Then we compute the $Z$-score as the difference between the actual and the expected
+log-likelihood $\ell(Y)$ divided by its expected standard deviation, and its two-sided
+p-value in the Normal distribution assumption ($\ell(Y)$ should approach normality due
+to the Central Limit Theorem):
+\begin{equation*}
+Z   \,=\, \frac {\ell(Y) - \E_Y \ell(Y)}{\sqrt{\Var_Y \ell(Y)}};\quad
+\mathop{\textrm{p-value}}(Z) \,=\, \Prob \Big[\,\big|\mathop{\textrm{Normal}}(0,1)\big| \, > \, |Z|\,\Big]
+\end{equation*}
+A low p-value would indicate ``underfitting'' if $Z\ll 0$ or ``overfitting'' if $Z\gg 0$.  Here
+``overfitting'' means that higher-probability labels occur more often than their probabilities
+suggest.
+
+We also apply the dispersion input ({\tt disp}) to compute the ``scaled'' version of the $Z$-score
+and its p-value.  Since $\ell(Y)$ is a linear function of~$Y$, multiplying the GLM-predicted
+variance of~$Y$ by {\tt disp} results in multiplying $\Var_Y \ell(Y)$ by the same {\tt disp}.  This, in turn,
+translates into dividing the $Z$-score by the square root of the dispersion:
+\begin{equation*}
+Z_{\texttt{disp}}  \,=\, \big(\ell(Y) \,-\, \E_Y \ell(Y)\big) \,\big/\, \sqrt{\texttt{disp}\cdot\Var_Y \ell(Y)}
+\,=\, Z / \sqrt{\texttt{disp}}
+\end{equation*}
+Finally, we recalculate the p-value with this new $Z$-score.
+
+\smallskip
+{\tt PEARSON\_X2}, {\tt PEARSON\_X2\_BY\_DF}, and {\tt PEARSON\_X2\_PVAL}:
+Pearson's residual $X^2$-statistic is a commonly used goodness-of-fit measure for linear models~\cite{McCullagh1989:GLM}.
+The idea is to measure how well the model-predicted means and variances match the actual behavior
+of response values.  For each record $i$, we estimate the mean $\mu_i$ and the variance $v_i$
+(or $\texttt{disp}\cdot v_i$) and use them to normalize the residual: 
+$r_i = (y_i - \mu_i) / \sqrt{v_i}$.  These normalized residuals are then squared, aggregated
+by summation, and tested against an appropriate $\chi^2$~distribution.  The computation of~$X^2$
+is slightly different for categorical data (bi- and multinomial) than it is for numerical data,
+since $y_i$ has multiple correlated dimensions~\cite{McCullagh1989:GLM}:
+\begin{equation*}
+X^2\,\textrm{(numer.)} \,=\,  \sum_{i=1}^{n}\, \frac{(y_i - \mu_i)^2}{v_i};\quad
+X^2\,\textrm{(categ.)} \,=\,  \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\frac{(y_{i,j} - N_i 
+\hspace{0.5pt} p_{i,j})^2}{N_i \hspace{0.5pt} p_{i,j}}
+\end{equation*}
+The number of degrees of freedom~\#d.f.\ for the $\chi^2$~distribution is $n - m$ for numerical data and
+$(n - m)k$ for categorical data, where $k = \mathop{\texttt{ncol}}(Y) - 1$.  Given the dispersion
+parameter {\tt disp}, the $X^2$ statistic is scaled by division: \mbox{$X^2_{\texttt{disp}} = X^2 / \texttt{disp}$}.
+If the dispersion is accurate, $X^2 / \texttt{disp}$ should be close to~\#d.f.  In fact, $X^2 / \textrm{\#d.f.}$
+over the \emph{training} data is the dispersion estimator used in our {\tt GLM.dml} script, 
+see~(\ref{eqn:dispersion}).  Here we provide $X^2 / \textrm{\#d.f.}$ and $X^2_{\texttt{disp}} / \textrm{\#d.f.}$
+as {\tt PEARSON\_X2\_BY\_DF} to enable dispersion comparison between the training data and
+the test data.
+
+NOTE: For categorical data, both Pearson's $X^2$ and the deviance $G^2$ are unreliable (i.e.\ do not
+approach the $\chi^2$~distribution) unless the predicted means of multi-label counts
+$\mu_{i,j} = N_i \hspace{0.5pt} p_{i,j}$ are fairly large: all ${\geq}\,1$ and 80\% are
+at least~$5$~\cite{Cochran1954:chisq}.  They should not be used for ``one label per record'' categoricals.
+
+\smallskip
+{\tt DEVIANCE\_G2}, {\tt DEVIANCE\_G2\_BY\_DF}, and {\tt DEVIANCE\_G2\_PVAL}:
+Deviance $G^2$ is the log of the likelihood ratio between the ``saturated'' model and the
+linear model being tested for the given dataset, multiplied by two:
+\begin{equation}
+G^2 \,=\, 2 \,\log \frac{\Prob[Y \mid \textrm{saturated model}\hspace{0.5pt}]}%
+{\Prob[Y \mid \textrm{tested linear model}\hspace{0.5pt}]}
+\label{eqn:GLMpred:deviance}
+\end{equation}
+The ``saturated'' model sets the mean $\mu_i^{\mathrm{sat}}$ to equal~$y_i$ for every record
+(for categorical data, $p^{\mathrm{sat}}_{i,j} = y_{i,j} / N_i$), which represents the ``perfect fit.''
+For records with $y_{i,j} \in \{0, N_i\}$ or otherwise at a boundary, by continuity we set
+$0 \log 0 = 0$.  The GLM~likelihood functions defined in~(\ref{eqn:GLM}) become simplified
+in ratio~(\ref{eqn:GLMpred:deviance}) due to canceling out the term $c(y, a)$ since it is the same
+in both models.
+
+The log of a likelihood ratio between two nested models, times two, is known to approach
+a $\chi^2$ distribution as $n\to\infty$ if both models have fixed parameter spaces.  
+But this is not the case for the ``saturated'' model: it adds more parameters with each record.  
+In practice, however, $\chi^2$ distributions are used to compute the p-value of~$G^2$~\cite{McCullagh1989:GLM}.  
+The number of degrees of freedom~\#d.f.\ and the treatment of dispersion are the same as for
+Pearson's~$X^2$, see above.
+
+\Paragraph{Column-wise statistics.}  The rest of the statistics are computed separately
+for each column of~$Y$.  As explained above, $Y$~has two or more columns in bi- and multinomial case,
+either at input or after conversion.  Moreover, each $y_{i,j}$ in record~$i$ with $N_i \geq 2$ is
+counted as $N_i$ separate observations $y_{i,j,l}$ of 0 or~1 (where $l=1,\ldots,N_i$) with
+$y_{i,j}$~ones and $N_i-y_{i,j}$ zeros.
+For power distributions, including linear regression, $Y$~has only one column and all
+$N_i = 1$, so the statistics are computed for all~$Y$ with each record counted once.
+Below we denote $N = \sum_{i=1}^n N_i \,\geq n$.
+Here is the total average and the residual average (residual bias) of~$y_{i,j,l}$ for each $Y$-column:
+\begin{equation*}
+\texttt{AVG\_TOT\_Y}_j   \,=\, \frac{1}{N} \sum_{i=1}^n  y_{i,j}; \quad
+\texttt{AVG\_RES\_Y}_j   \,=\, \frac{1}{N} \sum_{i=1}^n \, (y_{i,j} - \mu_{i,j})
+\end{equation*}
+Dividing by~$N$ (rather than~$n$) gives the averages for~$y_{i,j,l}$ (rather than~$y_{i,j}$).
+The total variance, and the standard deviation, for individual observations~$y_{i,j,l}$ is
+estimated from the total variance for response values~$y_{i,j}$ using independence assumption:
+$\Var y_{i,j} = \Var \sum_{l=1}^{N_i} y_{i,j,l} = \sum_{l=1}^{N_i} \Var y_{i,j,l}$.
+This allows us to estimate the sum of squares for~$y_{i,j,l}$ via the sum of squares for~$y_{i,j}$:
+\begin{equation*}
+\texttt{STDEV\_TOT\_Y}_j \,=\, 
+\Bigg[\frac{1}{N-1} \sum_{i=1}^n  \Big( y_{i,j} -  \frac{N_i}{N} \sum_{i'=1}^n  y_{i'\!,j}\Big)^2\Bigg]^{1/2}
+\end{equation*}
+Analogously, we estimate the standard deviation of the residual $y_{i,j,l} - \mu_{i,j,l}$:
+\begin{equation*}
+\texttt{STDEV\_RES\_Y}_j \,=\, 
+\Bigg[\frac{1}{N-m'} \,\sum_{i=1}^n  \Big( y_{i,j} - \mu_{i,j} -  \frac{N_i}{N} \sum_{i'=1}^n  (y_{i'\!,j} - \mu_{i'\!,j})\Big)^2\Bigg]^{1/2}
+\end{equation*}
+Here $m'=m$ if $m$ includes the intercept as a feature and $m'=m+1$ if it does not.
+The estimated standard deviations can be compared to the model-predicted residual standard deviation
+computed from the predicted means by the GLM variance formula and scaled by the dispersion:
+\begin{equation*}
+\texttt{PRED\_STDEV\_RES}_j \,=\, \Big[\frac{\texttt{disp}}{N} \, \sum_{i=1}^n \, v(\mu_{i,j})\Big]^{1/2}
+\end{equation*}
+We also compute the $R^2$ statistics for each column of~$Y$, see Table~\ref{table:GLMpred:R2} for details.
+We compute two versions of~$R^2$: in one version the residual sum-of-squares (RSS) includes any bias in
+the residual that might be present (due to the lack of, or inaccuracy in, the intercept); in the other
+version of~RSS the bias is subtracted by ``centering'' the residual.  In both cases we subtract the bias in the total
+sum-of-squares (in the denominator), and $m'$ equals $m$~with the intercept or $m+1$ without the intercept.
+
+\begin{table}[t]\small\centerline{%
+\begin{tabular}{|c|c|}
+\multicolumn{2}{c}{$R^2$ where the residual sum-of-squares includes the bias contribution:} \\
+\hline
+\multicolumn{1}{|l|}{\tt PLAIN\_R2${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2${}_j \,\,= {}$} \\
+$ \displaystyle 1 - 
+\frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}%
+{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ & 
+$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m}}  \, 
+\frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}%
+{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ \\
+\hline
+\multicolumn{2}{c}{} \\
+\multicolumn{2}{c}{$R^2$ where the residual sum-of-squares is centered so that the bias is subtracted:} \\
+\hline
+\multicolumn{1}{|l|}{\tt PLAIN\_R2\_NOBIAS${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2\_NOBIAS${}_j \,\,= {}$} \\
+$ \displaystyle 1 - 
+\frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\!
+    \sum\limits_{i'=1}^n  (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}%
+{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ &
+$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m'}} \, 
+\frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! 
+    \sum\limits_{i'=1}^n  (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}%
+{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ \\
+\hline
+\end{tabular}}
+\caption{The $R^2$ statistics we compute in {\tt GLM-predict.dml}}
+\label{table:GLMpred:R2}
+\end{table}
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+The matrix of predicted means (if the response is numerical) or probabilities (if the response
+is categorical), see ``Description'' subsection above for more information.  Given {\tt Y}, we
+return some statistics in CSV format as described in Table~\ref{table:GLMpred:stats} and in the
+above text.
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+Note that in the examples below the value for ``{\tt disp}'' input argument
+is set arbitrarily.  The correct dispersion value should be computed from the training
+data during model estimation, or omitted if unknown (which sets it to~{\tt 1.0}).
+
+\smallskip\noindent
+Linear regression example:
+\par\hangindent=\parindent\noindent{\tt
+\hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0 disp=5.67
+  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
+  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
+
+}\smallskip\noindent
+Linear regression example, prediction only (no {\tt Y} given):
+\par\hangindent=\parindent\noindent{\tt
+\hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0
+  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
+
+}\smallskip\noindent
+Binomial logistic regression example:
+\par\hangindent=\parindent\noindent{\tt
+\hml -f GLM-predict.dml -nvargs dfam=2 link=2 disp=3.0004464
+  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
+  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
+
+}\smallskip\noindent
+Binomial probit regression example:
+\par\hangindent=\parindent\noindent{\tt
+\hml -f GLM-predict.dml -nvargs dfam=2 link=3 disp=3.0004464
+  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
+  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
+
+}\smallskip\noindent
+Multinomial logistic regression example:
+\par\hangindent=\parindent\noindent{\tt
+\hml -f GLM-predict.dml -nvargs dfam=3 
+  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
+  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
+
+}\smallskip\noindent
+Poisson regression with the log link example:
+\par\hangindent=\parindent\noindent{\tt
+\hml -f GLM-predict.dml -nvargs dfam=1 vpow=1.0 link=1 lpow=0.0 disp=3.45
+  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
+  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
+
+}\smallskip\noindent
+Gamma regression with the inverse (reciprocal) link example:
+\par\hangindent=\parindent\noindent{\tt
+\hml -f GLM-predict.dml -nvargs dfam=1 vpow=2.0 link=1 lpow=-1.0 disp=1.99118
+  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
+  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
+
+}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/KaplanMeier.tex
----------------------------------------------------------------------
diff --git a/alg-ref/KaplanMeier.tex b/alg-ref/KaplanMeier.tex
new file mode 100644
index 0000000..6ea6fbc
--- /dev/null
+++ b/alg-ref/KaplanMeier.tex
@@ -0,0 +1,289 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{Kaplan-Meier Survival Analysis}
+\label{sec:kaplan-meier}
+
+\noindent{\bf Description}
+\smallskip
+
+
+Survival analysis examines the time needed for a particular event of interest to occur.
+In medical research, for example, the prototypical such event is the death of a patient but the methodology can be applied to other application areas, e.g., completing a task by an individual in a psychological experiment or the failure of electrical components in engineering.   
+Kaplan-Meier or (product limit) method is a simple non-parametric approach for estimating survival probabilities from both censored and uncensored survival times.\\
+
+ 
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}KM.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} TE=}path/file
+{\tt{} GI=}path/file
+{\tt{} SI=}path/file
+{\tt{} O=}path/file
+{\tt{} M=}path/file
+{\tt{} T=}path/file
+{\tt{} alpha=}double
+{\tt{} etype=}greenwood$\mid$peto
+{\tt{} ctype=}plain$\mid$log$\mid$log-log
+{\tt{} ttype=}none$\mid$log-rank$\mid$wilcoxon
+{\tt{} fmt=}format
+
+}
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the input matrix of the survival data containing: 
+\begin{Itemize}
+	\item timestamps,
+	\item whether event occurred (1) or data is censored (0),
+	\item a number of factors (i.e., categorical features) for grouping and/or stratifying
+\end{Itemize}
+\item[{\tt TE}:]
+Location (on HDFS) to read the 1-column matrix $TE$ that contains the column indices of the input matrix $X$ corresponding to timestamps (first entry) and event information (second entry) 
+\item[{\tt GI}:]
+Location (on HDFS) to read the 1-column matrix $GI$ that contains the column indices of the input matrix $X$ corresponding to the factors (i.e., categorical features) to be used for grouping
+\item[{\tt SI}:]
+Location (on HDFS) to read the 1-column matrix $SI$ that contains the column indices of the input matrix $X$ corresponding to the factors (i.e., categorical features) to be used for grouping
+\item[{\tt O}:]
+Location (on HDFS) to write the matrix containing the results of the Kaplan-Meier analysis $KM$
+\item[{\tt M}:]
+Location (on HDFS) to write Matrix $M$ containing the following statistics: total number of events, median and its confidence intervals; if survival data for multiple groups and strata are provided each row of $M$ contains the above statistics per group and stratum.
+\item[{\tt T}:]
+If survival data from multiple groups is available and {\tt ttype=log-rank} or {\tt ttype=wilcoxon}, location (on HDFS) to write the two matrices that contains the result of the (stratified) test for comparing these groups; see below for details.
+\item[{\tt alpha}:](default:\mbox{ }{\tt 0.05})
+Parameter to compute $100(1-\alpha)\%$ confidence intervals for the survivor function and its median 
+\item[{\tt etype}:](default:\mbox{ }{\tt "greenwood"})
+Parameter to specify the error type according to "greenwood" or "peto"
+\item[{\tt ctype}:](default:\mbox{ }{\tt "log"})
+Parameter to modify the confidence interval; "plain" keeps the lower and upper bound of the confidence interval unmodified,	"log" corresponds to logistic transformation and "log-log" corresponds to the complementary log-log transformation
+\item[{\tt ttype}:](default:\mbox{ }{\tt "none"})
+If survival data for multiple groups is available specifies which test to perform for comparing 
+survival data across multiple groups: "none", "log-rank" or "wilcoxon" test
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+\noindent{\bf Details}
+\smallskip
+
+The Kaplan-Meier estimate is a non-parametric maximum likelihood estimate (MLE) of the survival function $S(t)$, i.e., the probability of survival from the time origin to a given future time. 
+As an illustration suppose that there are $n$ individuals with observed survival times $t_1,t_2,\ldots t_n$ out of which there are $r\leq n$ distinct death times $t_{(1)}\leq t_{(2)}\leq t_{(r)}$---since some of the observations may be censored, in the sense that the end-point of interest has not been observed for those individuals, and there may be more than one individual with the same survival time.
+Let $S(t_j)$ denote the probability of survival until time $t_j$, $d_j$ be the number of events at time $t_j$, and $n_j$ denote the number of individual at risk (i.e., those who die at time $t_j$ or later). 
+Assuming that the events occur independently, in Kaplan-Meier method the probability of surviving from $t_j$ to $t_{j+1}$ is estimated from $S(t_j)$ and given by
+\begin{equation*}
+\hat{S}(t) = \prod_{j=1}^{k} \left( \frac{n_j-d_j}{n_j} \right),
+\end{equation*}   
+for $t_k\leq t<t_{k+1}$, $k=1,2,\ldots r$, $\hat{S}(t)=1$ for $t<t_{(1)}$, and $t_{(r+1)}=\infty$. 
+Note that the value of $\hat{S}(t)$ is constant between times of event and therefore
+the estimate is a step function with jumps at observed event times.
+If there are no censored data this estimator would simply reduce to the empirical survivor function defined as $\frac{n_j}{n}$. Thus, the Kaplan-Meier estimate can be seen as the generalization of the empirical survivor function that handles censored observations.
+
+The methodology used in our {\tt KM.dml} script closely follows~\cite[Sec.~2]{collett2003:kaplanmeier}.
+For completeness we briefly discuss the equations used in our implementation.
+
+% standard error of the survivor function
+\textbf{Standard error of the survivor function.}
+The standard error of the estimated survivor function (controlled by parameter {\tt etype}) can be calculated as  
+\begin{equation*}
+\text{se} \{\hat{S}(t)\} \approx \hat{S}(t) {\bigg\{ \sum_{j=1}^{k} \frac{d_j}{n_j(n_j -   d_j)}\biggr\}}^2,
+\end{equation*}
+for $t_{(k)}\leq t<t_{(k+1)}$.
+This equation is known as the {\it Greenwood's} formula.
+An alternative approach is to apply the {\it Petos's} expression %~\cite{PetoPABCHMMPS1979:kaplanmeier} 
+\begin{equation*}
+\text{se}\{\hat{S}(t)\}=\frac{\hat{S}(t)\sqrt{1-\hat{S}(t)}}{\sqrt{n_k}},
+\end{equation*}
+for $t_{(k)}\leq t<t_{(k+1)}$. 
+%Note that this estimate is known to be conservative producing larger standard errors than they ought to be. The Greenwood estimate is therefore recommended for general use. 
+Once the standard error of $\hat{S}$ has been found we compute the following types of confidence intervals (controlled by parameter {\tt cctype}): 
+The ``plain'' $100(1-\alpha)\%$ confidence interval for $S(t)$ is computed using 
+\begin{equation*}
+\hat{S}(t)\pm z_{\alpha/2} \text{se}\{\hat{S}(t)\}, 
+\end{equation*} 
+where $z_{\alpha/2}$ is the upper $\alpha/2$-point of the standard normal distribution. 
+Alternatively, we can apply the ``log'' transformation using 
+\begin{equation*}
+\hat{S}(t)^{\exp[\pm z_{\alpha/2} \text{se}\{\hat{S}(t)\}/\hat{S}(t)]}
+\end{equation*}
+or the ``log-log'' transformation using 
+\begin{equation*}
+\hat{S}(t)^{\exp [\pm z_{\alpha/2} \text{se} \{\log [-\log \hat{S}(t)]\}]}.
+\end{equation*}
+
+% standard error of the median of survival times
+\textbf{Median, its standard error and confidence interval.}
+Denote by $\hat{t}(50)$ the estimated median of $\hat{S}$, i.e.,
+$\hat{t}(50)=\min \{ t_i \mid \hat{S}(t_i) < 0.5\}$,
+where $t_i$ is the observed survival time for individual $i$.
+The standard error of $\hat{t}(50)$ is given by
+\begin{equation*}
+\text{se}\{ \hat{t}(50) \} = \frac{1}{\hat{f}\{\hat{t}(50)\}} \text{se}[\hat{S}\{ \hat{t}(50) \}],
+\end{equation*}
+where $\hat{f}\{ \hat{t}(50) \}$ can be found from
+\begin{equation*}
+\hat{f}\{ \hat{t}(50) \} = \frac{\hat{S}\{ \hat{u}(50) \} -\hat{S}\{ \hat{l}(50) \} }{\hat{l}(50) - \hat{u}(50)}. 
+\end{equation*}
+Above, $\hat{u}(50)$ is the largest survival time for which $\hat{S}$ exceeds $0.5+\epsilon$, i.e., $\hat{u}(50)=\max \bigl\{ t_{(j)} \mid \hat{S}(t_{(j)}) \geq 0.5+\epsilon \bigr\}$,
+and $\hat{l}(50)$ is the smallest survivor time for which $\hat{S}$ is less than $0.5-\epsilon$,
+i.e., $\hat{l}(50)=\min \bigl\{ t_{(j)} \mid \hat{S}(t_{(j)}) \leq 0.5+\epsilon \bigr\}$,
+for small $\epsilon$.
+
+
+% comparing two or more groups of data
+\textbf{Log-rank test and Wilcoxon test.}
+Our implementation supports comparison of survival data from several groups using two non-parametric procedures (controlled by parameter {\tt ttype}): the {\it log-rank test} and the {\it Wilcoxon test} (also known as the {\it Breslow test}). 
+Assume that the survival times in $g\geq 2$ groups of survival data are to be compared. 
+Consider the {\it null hypothesis} that there is no difference in the survival times of the individuals in different groups. One way to examine the null hypothesis is to consider the difference between the observed number of deaths with the numbers expected under the null hypothesis.  
+In both tests we define the $U$-statistics ($U_{L}$ for the log-rank test and $U_{W}$ for the Wilcoxon test) to compare the observed and the expected number of deaths in $1,2,\ldots,g-1$ groups as follows:
+\begin{align*}
+U_{Lk} &= \sum_{j=1}^{r}\left( d_{kj} - \frac{n_{kj}d_j}{n_j} \right), \\
+U_{Wk} &= \sum_{j=1}^{r}n_j\left( d_{kj} - \frac{n_{kj}d_j}{n_j} \right),
+\end{align*}
+where $d_{kj}$ is the of number deaths at time $t_{(j)}$ in group $k$, 
+$n_{kj}$ is the number of individuals at risk at time $t_{(j)}$ in group $k$, and 
+$k=1,2,\ldots,g-1$ to form the vectors $U_L$ and $U_W$ with $(g-1)$ components.
+The covariance (variance) between $U_{Lk}$ and $U_{Lk'}$ (when $k=k'$) is computed as
+\begin{equation*}
+V_{Lkk'}=\sum_{j=1}^{r} \frac{n_{kj}d_j(n_j-d_j)}{n_j(n_j-1)} \left( \delta_{kk'}-\frac{n_{k'j}}{n_j} \right),
+\end{equation*}
+for $k,k'=1,2,\ldots,g-1$, with
+\begin{equation*}
+\delta_{kk'} = 
+\begin{cases}
+1 & \text{if } k=k'\\
+0 & \text{otherwise.}
+\end{cases}
+\end{equation*}
+These terms are combined in a {\it variance-covariance} matrix $V_L$ (referred to as the $V$-statistic).
+Similarly, the variance-covariance matrix for the Wilcoxon test $V_W$ is a matrix where the entry at position $(k,k')$ is given by
+\begin{equation*}
+V_{Wkk'}=\sum_{j=1}^{r} n_j^2 \frac{n_{kj}d_j(n_j-d_j)}{n_j(n_j-1)} \left( \delta_{kk'}-\frac{n_{k'j}}{n_j} \right).
+\end{equation*}
+
+Under the null hypothesis of no group differences, the test statistics $U_L^\top V_L^{-1} U_L$ for the log-rank test and  $U_W^\top V_W^{-1} U_W$ for the Wilcoxon test have a Chi-squared distribution on $(g-1)$ degrees of freedom.
+Our {\tt KM.dml} script also provides a stratified version of the log-rank or Wilcoxon test if requested.
+In this case, the values of the $U$- and $V$- statistics are computed for each stratum and then combined over all strata.
+
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+  
+Blow we list the results of the survival analysis computed by {\tt KM.dml}. 
+The calculated statistics are stored in matrix $KM$ with the following schema:
+\begin{itemize}
+	\item Column 1: timestamps 
+	\item Column 2: number of individuals at risk
+	\item Column 3: number of events
+	\item Column 4: Kaplan-Meier estimate of the survivor function $\hat{S}$ 
+	\item Column 5: standard error of $\hat{S}$
+	\item Column 6: lower bound of $100(1-\alpha)\%$ confidence interval for $\hat{S}$
+	\item Column 7: upper bound of $100(1-\alpha)\%$ confidence interval for $\hat{S}$
+\end{itemize}
+Note that if survival data for multiple groups and/or strata is available, each collection of 7 columns in $KM$ stores the results per group and/or per stratum. 
+In this case $KM$ has $7g+7s$ columns, where $g\geq 1$ and $s\geq 1$ denote the number of groups and strata, respectively. 
+
+
+Additionally, {\tt KM.dml} stores the following statistics in the 1-row matrix $M$ whose number of columns depends on the number of groups ($g$) and strata ($s$) in the data. Below $k$ denotes the number of factors used for grouping and $l$ denotes the number of factors used for stratifying. 
+\begin{itemize}
+	\item Columns 1 to $k$: unique combination of values in the $k$ factors used for grouping 
+	\item Columns $k+1$ to $k+l$: unique combination of values in the $l$ factors used for stratifying  
+	\item Column $k+l+1$: total number of records 
+	\item Column $k+l+2$: total number of events
+    \item Column $k+l+3$: median of $\hat{S}$
+    \item Column $k+l+4$: lower bound of $100(1-\alpha)\%$ confidence interval for the median of $\hat{S}$
+    \item Column $k+l+5$: upper bound of $100(1-\alpha)\%$ confidence interval for the median of $\hat{S}$. 
+\end{itemize}
+If there is only 1 group and 1 stratum available $M$ will be a 1-row matrix with 5 columns where
+\begin{itemize}
+	\item Column 1: total number of records
+	\item Column 2: total number of events
+	\item Column 3: median of $\hat{S}$
+	\item Column 4: lower bound of $100(1-\alpha)\%$ confidence interval for the median of $\hat{S}$
+	\item Column 5: upper bound of $100(1-\alpha)\%$ confidence interval for the median of $\hat{S}$.
+\end{itemize} 
+
+If a comparison of the survival data across multiple groups needs to be performed, {\tt KM.dml} computes two matrices $T$ and $T\_GROUPS\_OE$ that contain a summary of the test. The 1-row matrix $T$ stores the following statistics: 
+\begin{itemize}
+	\item Column 1: number of groups in the survival data
+ 	\item Column 2: degree of freedom for Chi-squared distributed test statistic
+	\item Column 3: value of test statistic
+	\item Column 4: $P$-value.
+\end{itemize}
+Matrix $T\_GROUPS\_OE$ contains the following statistics for each of $g$ groups:
+\begin{itemize}
+	\item Column 1: number of events
+	\item Column 2: number of observed death times ($O$)
+	\item Column 3: number of expected death times ($E$)
+	\item Column 4: $(O-E)^2/E$
+	\item Column 5: $(O-E)^2/V$.
+\end{itemize}
+
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+	\hml -f KM.dml -nvargs X=/user/biadmin/X.mtx TE=/user/biadmin/TE
+	GI=/user/biadmin/GI SI=/user/biadmin/SI O=/user/biadmin/kaplan-meier.csv
+	M=/user/biadmin/model.csv alpha=0.01 etype=greenwood ctype=plain fmt=csv
+	
+}\smallskip
+
+{\hangindent=\parindent\noindent\tt
+	\hml -f KM.dml -nvargs X=/user/biadmin/X.mtx TE=/user/biadmin/TE
+	GI=/user/biadmin/GI SI=/user/biadmin/SI O=/user/biadmin/kaplan-meier.csv
+	M=/user/biadmin/model.csv T=/user/biadmin/test.csv alpha=0.01 etype=peto 
+	ctype=log ttype=log-rank fmt=csv
+	
+}
+
+%
+%\smallskip
+%\noindent{\bf References}
+%\begin{itemize}
+%	\item
+%	R.~Peto, M.C.~Pike, P.~Armitage, N.E.~Breslow, D.R.~Cox, S.V.~Howard, N.~Mantel, K.~McPherson, J.~Peto, and P.G.~Smith.
+%	\newblock Design and analysis of randomized clinical trials requiring prolonged observation of each patient.
+%	\newblock {\em British Journal of Cancer}, 35:1--39, 1979.
+%\end{itemize}
+
+%@book{collett2003:kaplanmeier,
+%	title={Modelling Survival Data in Medical Research, Second Edition},
+%	author={Collett, D.},
+%	isbn={9781584883258},
+%	lccn={2003040945},
+%	series={Chapman \& Hall/CRC Texts in Statistical Science},
+%	year={2003},
+%	publisher={Taylor \& Francis}
+%}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/alg-ref/Kmeans.tex
----------------------------------------------------------------------
diff --git a/alg-ref/Kmeans.tex b/alg-ref/Kmeans.tex
new file mode 100644
index 0000000..2b5492c
--- /dev/null
+++ b/alg-ref/Kmeans.tex
@@ -0,0 +1,371 @@
+\begin{comment}
+
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+\end{comment}
+
+\subsection{K-Means Clustering}
+
+\noindent{\bf Description}
+\smallskip
+
+Given a collection of $n$ records with a pairwise similarity measure,
+the goal of clustering is to assign a category label to each record so that
+similar records tend to get the same label.  In contrast to multinomial
+logistic regression, clustering is an \emph{unsupervised}\/ learning problem
+with neither category assignments nor label interpretations given in advance.
+In $k$-means clustering, the records $x_1, x_2, \ldots, x_n$ are numerical
+feature vectors of $\dim x_i = m$ with the squared Euclidean distance 
+$\|x_i - x_{i'}\|_2^2$ as the similarity measure.  We want to partition
+$\{x_1, \ldots, x_n\}$ into $k$ clusters $\{S_1, \ldots, S_k\}$ so that
+the aggregated squared distance from records to their cluster means is
+minimized:
+\begin{equation}
+\textrm{WCSS}\,\,=\,\, \sum_{i=1}^n \,\big\|x_i - \mean(S_j: x_i\in S_j)\big\|_2^2 \,\,\to\,\,\min
+\label{eqn:WCSS}
+\end{equation}
+The aggregated distance measure in~(\ref{eqn:WCSS}) is called the
+\emph{within-cluster sum of squares}~(WCSS).  It can be viewed as a measure
+of residual variance that remains in the data after the clustering assignment,
+conceptually similar to the residual sum of squares~(RSS) in linear regression.
+However, unlike for the RSS, the minimization of~(\ref{eqn:WCSS}) is an NP-hard 
+problem~\cite{AloiseDHP2009:kmeans}.
+
+Rather than searching for the global optimum in~(\ref{eqn:WCSS}), a heuristic algorithm
+called Lloyd's algorithm is typically used.  This iterative algorithm maintains
+and updates a set of $k$~\emph{centroids} $\{c_1, \ldots, c_k\}$, one centroid per cluster.
+It defines each cluster $S_j$ as the set of all records closer to~$c_j$ than
+to any other centroid.  Each iteration of the algorithm reduces the WCSS in two steps:
+\begin{Enumerate}
+\item Assign each record to the closest centroid, making $\mean(S_j)\neq c_j$;
+\label{step:kmeans:recluster}
+\item Reset each centroid to its cluster's mean: $c_j := \mean(S_j)$.
+\label{step:kmeans:recenter}
+\end{Enumerate}
+After Step~\ref{step:kmeans:recluster} the centroids are generally different from the cluster
+means, so we can compute another ``within-cluster sum of squares'' based on the centroids:
+\begin{equation}
+\textrm{WCSS\_C}\,\,=\,\, \sum_{i=1}^n \,\big\|x_i - \mathop{\textrm{centroid}}(S_j: x_i\in S_j)\big\|_2^2
+\label{eqn:WCSS:C}
+\end{equation}
+This WCSS\_C after Step~\ref{step:kmeans:recluster} is less than the means-based WCSS
+before Step~\ref{step:kmeans:recluster} (or equal if convergence achieved), and in
+Step~\ref{step:kmeans:recenter} the WCSS cannot exceed the WCSS\_C for \emph{the same}
+clustering; hence the WCSS reduction.
+
+Exact convergence is reached when each record becomes closer to its
+cluster's mean than to any other cluster's mean, so there are no more re-assignments
+and the centroids coincide with the means.  In practice, iterations may be stopped
+when the reduction in WCSS (or in WCSS\_C) falls below a minimum threshold, or upon
+reaching the maximum number of iterations.  The initialization of the centroids is also
+an important part of the algorithm.  The smallest WCSS obtained by the algorithm is not
+the global minimum and varies depending on the initial centroids.  We implement multiple
+parallel runs with different initial centroids and report the best result.
+
+\Paragraph{Scoring} 
+Our scoring script evaluates the clustering output by comparing it with a known category
+assignment.  Since cluster labels have no prior correspondence to the categories, we
+cannot count ``correct'' and ``wrong'' cluster assignments.  Instead, we quantify them in
+two ways:
+\begin{Enumerate}
+\item Count how many same-category and different-category pairs of records end up in the
+same cluster or in different clusters;
+\item For each category, count the prevalence of its most common cluster; for each
+cluster, count the prevalence of its most common category.
+\end{Enumerate}
+The number of categories and the number of clusters ($k$) do not have to be equal.  
+A same-category pair of records clustered into the same cluster is viewed as a
+``true positive,'' a different-category pair clustered together is a ``false positive,''
+a same-category pair clustered apart is a ``false negative''~etc.
+
+
+\smallskip
+\noindent{\bf Usage: K-means Script}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}Kmeans.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} C=}path/file
+{\tt{} k=}int
+{\tt{} runs=}int
+{\tt{} maxi=}int
+{\tt{} tol=}double
+{\tt{} samp=}int
+{\tt{} isY=}int
+{\tt{} Y=}path/file
+{\tt{} fmt=}format
+{\tt{} verb=}int
+
+}
+
+\smallskip
+\noindent{\bf Usage: K-means Scoring/Prediction}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}Kmeans-predict.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} C=}path/file
+{\tt{} spY=}path/file
+{\tt{} prY=}path/file
+{\tt{} fmt=}format
+{\tt{} O=}path/file
+
+}
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location to read matrix $X$ with the input data records as rows
+\item[{\tt C}:] (default:\mbox{ }{\tt "C.mtx"})
+Location to store the output matrix with the best available cluster centroids as rows
+\item[{\tt k}:]
+Number of clusters (and centroids)
+\item[{\tt runs}:] (default:\mbox{ }{\tt 10})
+Number of parallel runs, each run with different initial centroids
+\item[{\tt maxi}:] (default:\mbox{ }{\tt 1000})
+Maximum number of iterations per run
+\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
+Tolerance (epsilon) for single-iteration WCSS\_C change ratio
+\item[{\tt samp}:] (default:\mbox{ }{\tt 50})
+Average number of records per centroid in data samples used in the centroid
+initialization procedure
+\item[{\tt Y}:] (default:\mbox{ }{\tt "Y.mtx"})
+Location to store the one-column matrix $Y$ with the best available mapping of
+records to clusters (defined by the output centroids)
+\item[{\tt isY}:] (default:\mbox{ }{\tt 0})
+{\tt 0} = do not write matrix~$Y$,  {\tt 1} = write~$Y$
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\item[{\tt verb}:] (default:\mbox{ }{\tt 0})
+{\tt 0} = do not print per-iteration statistics for each run, {\tt 1} = print them
+(the ``verbose'' option)
+\end{Description}
+\smallskip
+\noindent{\bf Arguments --- Scoring/Prediction}
+\begin{Description}
+\item[{\tt X}:] (default:\mbox{ }{\tt " "})
+Location to read matrix $X$ with the input data records as rows,
+optional when {\tt prY} input is provided
+\item[{\tt C}:] (default:\mbox{ }{\tt " "})
+Location to read matrix $C$ with cluster centroids as rows, optional
+when {\tt prY} input is provided; NOTE: if both {\tt X} and {\tt C} are
+provided, {\tt prY} is an output, not input
+\item[{\tt spY}:] (default:\mbox{ }{\tt " "})
+Location to read a one-column matrix with the externally specified ``true''
+assignment of records (rows) to categories, optional for prediction without
+scoring
+\item[{\tt prY}:] (default:\mbox{ }{\tt " "})
+Location to read (or write, if {\tt X} and {\tt C} are present) a
+column-vector with the predicted assignment of rows to clusters;
+NOTE: No prior correspondence is assumed between the predicted
+cluster labels and the externally specified categories
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format for {\tt prY}, such as {\tt text}, {\tt mm},
+or {\tt csv}; see read/write functions in SystemML Language Reference
+for details
+\item[{\tt O}:] (default:\mbox{ }{\tt " "})
+Location to write the output statistics defined in 
+Table~\ref{table:kmeans:predict:stats}, by default print them to the
+standard output
+\end{Description}
+
+
+\begin{table}[t]\small\centerline{%
+\begin{tabular}{|lcl|}
+\hline
+Name & CID & Meaning \\
+\hline
+{\tt TSS}             &     & Total Sum of Squares (from the total mean) \\
+{\tt WCSS\_M}         &     & Within-Cluster  Sum of Squares (means as centers) \\
+{\tt WCSS\_M\_PC}     &     & Within-Cluster  Sum of Squares (means), in \% of TSS \\
+{\tt BCSS\_M}         &     & Between-Cluster Sum of Squares (means as centers) \\
+{\tt BCSS\_M\_PC}     &     & Between-Cluster Sum of Squares (means), in \% of TSS \\
+\hline
+{\tt WCSS\_C}         &     & Within-Cluster  Sum of Squares (centroids as centers) \\
+{\tt WCSS\_C\_PC}     &     & Within-Cluster  Sum of Squares (centroids), \% of TSS \\
+{\tt BCSS\_C}         &     & Between-Cluster Sum of Squares (centroids as centers) \\
+{\tt BCSS\_C\_PC}     &     & Between-Cluster Sum of Squares (centroids), \% of TSS \\
+\hline
+{\tt TRUE\_SAME\_CT}  &     & Same-category pairs predicted as Same-cluster, count \\
+{\tt TRUE\_SAME\_PC}  &     & Same-category pairs predicted as Same-cluster, \% \\
+{\tt TRUE\_DIFF\_CT}  &     & Diff-category pairs predicted as Diff-cluster, count \\
+{\tt TRUE\_DIFF\_PC}  &     & Diff-category pairs predicted as Diff-cluster, \% \\
+{\tt FALSE\_SAME\_CT} &     & Diff-category pairs predicted as Same-cluster, count \\
+{\tt FALSE\_SAME\_PC} &     & Diff-category pairs predicted as Same-cluster, \% \\
+{\tt FALSE\_DIFF\_CT} &     & Same-category pairs predicted as Diff-cluster, count \\
+{\tt FALSE\_DIFF\_PC} &     & Same-category pairs predicted as Diff-cluster, \% \\
+\hline
+{\tt SPEC\_TO\_PRED}  & $+$ & For specified category, the best predicted cluster id \\
+{\tt SPEC\_FULL\_CT}  & $+$ & For specified category, its full count \\
+{\tt SPEC\_MATCH\_CT} & $+$ & For specified category, best-cluster matching count \\
+{\tt SPEC\_MATCH\_PC} & $+$ & For specified category, \% of matching to full count \\
+{\tt PRED\_TO\_SPEC}  & $+$ & For predicted cluster, the best specified category id \\
+{\tt PRED\_FULL\_CT}  & $+$ & For predicted cluster, its full count \\
+{\tt PRED\_MATCH\_CT} & $+$ & For predicted cluster, best-category matching count \\
+{\tt PRED\_MATCH\_PC} & $+$ & For predicted cluster, \% of matching to full count \\
+\hline
+\end{tabular}}
+\caption{The {\tt O}-file for {\tt Kmeans-predict} provides the output statistics
+in CSV format, one per line, in the following format: (NAME, [CID], VALUE).  Note:
+the 1st group statistics are given if {\tt X} input is available;
+the 2nd group statistics are given if {\tt X} and {\tt C} inputs are available;
+the 3rd and 4th group statistics are given if {\tt spY} input is available;
+only the 4th group statistics contain a nonempty CID value;
+when present, CID contains either the specified category label or the
+predicted cluster label.}
+\label{table:kmeans:predict:stats}
+\end{table}
+
+
+\noindent{\bf Details}
+\smallskip
+
+Our clustering script proceeds in 3~stages: centroid initialization,
+parallel $k$-means iterations, and the best-available output generation.
+Centroids are initialized at random from the input records (the rows of~$X$),
+biased towards being chosen far apart from each other.  The initialization
+method is based on the {\tt k-means++} heuristic from~\cite{ArthurVassilvitskii2007:kmeans},
+with one important difference: to reduce the number of passes through~$X$,
+we take a small sample of $X$ and run the {\tt k-means++} heuristic over
+this sample.  Here is, conceptually, our centroid initialization algorithm
+for one clustering run:
+\begin{Enumerate}
+\item Sample the rows of~$X$ uniformly at random, picking each row with probability
+$p = ks / n$ where
+\begin{Itemize}
+\item $k$~is the number of centroids, 
+\item $n$~is the number of records, and
+\item $s$~is the {\tt samp} input parameter.
+\end{Itemize}
+If $ks \geq n$, the entire $X$ is used in place of its sample.
+\item Choose the first centroid uniformly at random from the sampled rows.
+\item Choose each subsequent centroid from the sampled rows, at random, with
+probability proportional to the squared Euclidean distance between the row and
+the nearest already-chosen centroid.
+\end{Enumerate}
+The sampling of $X$ and the selection of centroids are performed independently
+and in parallel for each run of the $k$-means algorithm.  When we sample the
+rows of~$X$, rather than tossing a random coin for each row, we compute the
+number of rows to skip until the next sampled row as $\lceil \log(u) / \log(1 - p) \rceil$
+where $u\in (0, 1)$ is uniformly random.  This time-saving trick works because
+\begin{equation*}
+\Prob [k-1 < \log_{1-p}(u) < k] \,\,=\,\, p(1-p)^{k-1} \,\,=\,\,
+\Prob [\textrm{skip $k-1$ rows}]
+\end{equation*}
+However, it requires us to estimate the maximum sample size, which we set
+near~$ks + 10\sqrt{ks}$ to make it generous enough.
+
+Once we selected the initial centroid sets, we start the $k$-means iterations
+independently in parallel for all clustering runs.  The number of clustering runs
+is given as the {\tt runs} input parameter.  Each iteration of each clustering run
+performs the following steps:
+\begin{Itemize}
+\item Compute the centroid-dependent part of squared Euclidean distances from
+all records (rows of~$X$) to each of the $k$~centroids using matrix product;
+\item Take the minimum of the above for each record;
+\item Update the current within-cluster sum of squares (WCSS) value, with centroids
+substituted instead of the means for efficiency;
+\item Check the convergence criterion:\hfil
+$\textrm{WCSS}_{\mathrm{old}} - \textrm{WCSS}_{\mathrm{new}} < \eps\cdot\textrm{WCSS}_{\mathrm{new}}$\linebreak
+as well as the number of iterations limit;
+\item Find the closest centroid for each record, sharing equally any records with multiple
+closest centroids;
+\item Compute the number of records closest to each centroid, checking for ``runaway''
+centroids with no records left (in which case the run fails);
+\item Compute the new centroids by averaging the records in their clusters.
+\end{Itemize}
+When a termination condition is satisfied, we store the centroids and the WCSS value
+and exit this run.  A run has to satisfy the WCSS convergence criterion to be considered
+successful.  Upon the termination of all runs, we select the smallest WCSS value among
+the successful runs, and write out this run's centroids.  If requested, we also compute
+the cluster assignment of all records in~$X$, using integers from 1 to~$k$ as the cluster
+labels.  The scoring script can then be used to compare the cluster assignment with
+an externally specified category assignment.
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+We output the $k$ centroids for the best available clustering, i.~e.\ whose WCSS
+is the smallest of all successful runs.
+The centroids are written as the rows of the $k\,{\times}\,m$-matrix into the output
+file whose path/name was provided as the ``{\tt C}'' input argument.  If the input
+parameter ``{\tt isY}'' was set to~{\tt 1}, we also output the one-column matrix with
+the cluster assignment for all the records.  This assignment is written into the
+file whose path/name was provided as the ``{\tt Y}'' input argument.
+The best WCSS value, as well as some information about the performance of the other
+runs, is printed during the script execution.  The scoring script {\tt Kmeans-predict}
+prints all its results in a self-explanatory manner, as defined in
+Table~\ref{table:kmeans:predict:stats}.
+
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f Kmeans.dml -nvargs X=/user/biadmin/X.mtx k=5 C=/user/biadmin/centroids.mtx fmt=csv
+
+}
+
+{\hangindent=\parindent\noindent\tt
+\hml -f Kmeans.dml -nvargs X=/user/biadmin/X.mtx k=5 runs=100 maxi=5000 
+tol=0.00000001 samp=20 C=/user/biadmin/centroids.mtx isY=1 Y=/user/biadmin/Yout.mtx verb=1
+
+}
+\noindent To predict {\tt Y} given {\tt X} and {\tt C}:
+
+{\hangindent=\parindent\noindent\tt
+\hml -f Kmeans-predict.dml -nvargs X=/user/biadmin/X.mtx
+         C=/user/biadmin/C.mtx prY=/user/biadmin/PredY.mtx O=/user/biadmin/stats.csv
+
+}
+\noindent To compare ``actual'' labels {\tt spY} with ``predicted'' labels given {\tt X} and {\tt C}:
+
+{\hangindent=\parindent\noindent\tt
+\hml -f Kmeans-predict.dml -nvargs X=/user/biadmin/X.mtx
+         C=/user/biadmin/C.mtx spY=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
+
+}
+\noindent To compare ``actual'' labels {\tt spY} with given ``predicted'' labels {\tt prY}:
+
+{\hangindent=\parindent\noindent\tt
+\hml -f Kmeans-predict.dml -nvargs spY=/user/biadmin/Y.mtx prY=/user/biadmin/PredY.mtx O=/user/biadmin/stats.csv
+
+}
+
+\smallskip
+\noindent{\bf References}
+\begin{itemize}
+\item
+D.~Aloise, A.~Deshpande, P.~Hansen, and P.~Popat.
+\newblock {NP}-hardness of {E}uclidean sum-of-squares clustering.
+\newblock {\em Machine Learning}, 75(2):245--248, May 2009.
+\item
+D.~Arthur and S.~Vassilvitskii.
+\newblock {\tt k-means++}: The advantages of careful seeding.
+\newblock In {\em Proceedings of the 18th Annual {ACM-SIAM} Symposium on
+  Discrete Algorithms ({SODA}~2007)}, pages 1027--1035, New Orleans~{LA},
+  {USA}, January 7--9 2007.
+\end{itemize}


[29/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1279] Decrease numCols to prevent spark codegen issue

Posted by de...@apache.org.
[SYSTEMML-1279] Decrease numCols to prevent spark codegen issue

Closes #395.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/bb97a4bc
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/bb97a4bc
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/bb97a4bc

Branch: refs/heads/gh-pages
Commit: bb97a4bc6213cf68eeea91097a71d1fd149c49ec
Parents: ba2819b
Author: Felix Schueler <fe...@ibm.com>
Authored: Thu Feb 16 16:13:14 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Thu Feb 16 16:13:14 2017 -0800

----------------------------------------------------------------------
 spark-mlcontext-programming-guide.md | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/bb97a4bc/spark-mlcontext-programming-guide.md
----------------------------------------------------------------------
diff --git a/spark-mlcontext-programming-guide.md b/spark-mlcontext-programming-guide.md
index e5df11f..c15c27f 100644
--- a/spark-mlcontext-programming-guide.md
+++ b/spark-mlcontext-programming-guide.md
@@ -124,7 +124,7 @@ None
 
 ## DataFrame Example
 
-For demonstration purposes, we'll use Spark to create a `DataFrame` called `df` of random `double`s from 0 to 1 consisting of 10,000 rows and 1,000 columns.
+For demonstration purposes, we'll use Spark to create a `DataFrame` called `df` of random `double`s from 0 to 1 consisting of 10,000 rows and 100 columns.
 
 <div class="codetabs">
 
@@ -134,7 +134,7 @@ import org.apache.spark.sql._
 import org.apache.spark.sql.types.{StructType,StructField,DoubleType}
 import scala.util.Random
 val numRows = 10000
-val numCols = 1000
+val numCols = 100
 val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCols)(Random.nextDouble)) }
 val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
 val df = spark.createDataFrame(data, schema)
@@ -155,8 +155,8 @@ import scala.util.Random
 scala> val numRows = 10000
 numRows: Int = 10000
 
-scala> val numCols = 1000
-numCols: Int = 1000
+scala> val numCols = 100
+numCols: Int = 100
 
 scala> val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCols)(Random.nextDouble)) }
 data: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[1] at map at <console>:42
@@ -175,7 +175,7 @@ df: org.apache.spark.sql.DataFrame = [C0: double, C1: double, C2: double, C3: do
 We'll create a DML script to find the minimum, maximum, and mean values in a matrix. This
 script has one input variable, matrix `Xin`, and three output variables, `minOut`, `maxOut`, and `meanOut`.
 
-For performance, we'll specify metadata indicating that the matrix has 10,000 rows and 1,000 columns.
+For performance, we'll specify metadata indicating that the matrix has 10,000 rows and 100 columns.
 
 We'll create a DML script using the ScriptFactory `dml` method with the `minMaxMean` script String. The
 input variable is specified to be our `DataFrame` `df` with `MatrixMetadata` `mm`. The output
@@ -218,7 +218,7 @@ meanOut = mean(Xin)
 "
 
 scala> val mm = new MatrixMetadata(numRows, numCols)
-mm: org.apache.sysml.api.mlcontext.MatrixMetadata = rows: 10000, columns: 1000, non-zeros: None, rows per block: None, columns per block: None
+mm: org.apache.sysml.api.mlcontext.MatrixMetadata = rows: 10000, columns: 100, non-zeros: None, rows per block: None, columns per block: None
 
 scala> val minMaxMeanScript = dml(minMaxMean).in("Xin", df, mm).out("minOut", "maxOut", "meanOut")
 minMaxMeanScript: org.apache.sysml.api.mlcontext.Script =
@@ -929,7 +929,7 @@ Symbol Table:
   [1] (Double) meanOut: 0.5000954668004209
   [2] (Double) maxOut: 0.9999999956646207
   [3] (Double) minOut: 1.4149740823476975E-7
-  [4] (Matrix) Xin: Matrix: scratch_space/temp_1166464711339222, [10000 x 1000, nnz=10000000, blocks (1000 x 1000)], binaryblock, not-dirty
+  [4] (Matrix) Xin: Matrix: scratch_space/temp_1166464711339222, [10000 x 100, nnz=1000000, blocks (1000 x 1000)], binaryblock, not-dirty
 
 Script String:
 
@@ -980,7 +980,7 @@ Symbol Table:
   [1] (Double) meanOut: 0.5000954668004209
   [2] (Double) maxOut: 0.9999999956646207
   [3] (Double) minOut: 1.4149740823476975E-7
-  [4] (Matrix) Xin: Matrix: scratch_space/temp_1166464711339222, [10000 x 1000, nnz=10000000, blocks (1000 x 1000)], binaryblock, not-dirty
+  [4] (Matrix) Xin: Matrix: scratch_space/temp_1166464711339222, [10000 x 100, nnz=1000000, blocks (1000 x 1000)], binaryblock, not-dirty
 
 scala> minMaxMeanScript.clearAll
 
@@ -1129,7 +1129,7 @@ meanOut = mean(Xin)
 "
 
 scala> val mm = new MatrixMetadata(numRows, numCols)
-mm: org.apache.sysml.api.mlcontext.MatrixMetadata = rows: 10000, columns: 1000, non-zeros: None, rows per block: None, columns per block: None
+mm: org.apache.sysml.api.mlcontext.MatrixMetadata = rows: 10000, columns: 100, non-zeros: None, rows per block: None, columns per block: None
 
 scala> val minMaxMeanScript = dml(minMaxMean).in("Xin", df, mm).out("minOut", "maxOut", "meanOut")
 minMaxMeanScript: org.apache.sysml.api.mlcontext.Script =
@@ -1147,7 +1147,7 @@ scala> val (min, max, mean) = ml.execute(minMaxMeanScript).getTuple[Double, Doub
 PROGRAM
 --MAIN PROGRAM
 ----GENERIC (lines 1-8) [recompile=false]
-------(12) TRead Xin [10000,1000,1000,1000,10000000] [0,0,76 -> 76MB] [chkpt], CP
+------(12) TRead Xin [10000,100,1000,1000,1000000] [0,0,76 -> 76MB] [chkpt], CP
 ------(13) ua(minRC) (12) [0,0,-1,-1,-1] [76,0,0 -> 76MB], CP
 ------(21) TWrite minOut (13) [0,0,-1,-1,-1] [0,0,0 -> 0MB], CP
 ------(14) ua(maxRC) (12) [0,0,-1,-1,-1] [76,0,0 -> 76MB], CP
@@ -1523,7 +1523,7 @@ There are currently two mechanisms for this in SystemML: **(1) BinaryBlockMatrix
 If you have an input DataFrame, it can be converted to a BinaryBlockMatrix, and this BinaryBlockMatrix
 can be passed as an input rather than passing in the DataFrame as an input.
 
-For example, suppose we had a 10000x1000 matrix represented as a DataFrame, as we saw in an earlier example.
+For example, suppose we had a 10000x100 matrix represented as a DataFrame, as we saw in an earlier example.
 Now suppose we create two Script objects with the DataFrame as an input, as shown below. In the Spark Shell,
 when executing this code, you can see that each of the two Script object creations requires the
 time-consuming data conversion step.
@@ -1533,7 +1533,7 @@ import org.apache.spark.sql._
 import org.apache.spark.sql.types.{StructType,StructField,DoubleType}
 import scala.util.Random
 val numRows = 10000
-val numCols = 1000
+val numCols = 100
 val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCols)(Random.nextDouble)) }
 val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
 val df = spark.createDataFrame(data, schema)
@@ -1554,7 +1554,7 @@ import org.apache.spark.sql._
 import org.apache.spark.sql.types.{StructType,StructField,DoubleType}
 import scala.util.Random
 val numRows = 10000
-val numCols = 1000
+val numCols = 100
 val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCols)(Random.nextDouble)) }
 val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
 val df = spark.createDataFrame(data, schema)


[14/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1208] Mimic main website styles in project documentation

Posted by de...@apache.org.
[SYSTEMML-1208] Mimic main website styles in project documentation

Update project documentation styles to be similar to main website styles.
Minor case corrections.

Closes #366.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/4b899f26
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/4b899f26
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/4b899f26

Branch: refs/heads/gh-pages
Commit: 4b899f26c72f329e064b4c433d75c0e8628e368f
Parents: 20e46a8
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Sat Jan 28 12:24:58 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Sat Jan 28 12:24:59 2017 -0800

----------------------------------------------------------------------
 _layouts/global.html        |  4 +-
 beginners-guide-python.md   |  4 +-
 css/main.css                | 79 +++++++++++++++++++++++++++++++++++-----
 developer-tools-systemml.md | 12 ++++--
 index.md                    | 22 +++++------
 python-reference.md         |  4 +-
 6 files changed, 94 insertions(+), 31 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4b899f26/_layouts/global.html
----------------------------------------------------------------------
diff --git a/_layouts/global.html b/_layouts/global.html
index 1aa5296..9e668a0 100644
--- a/_layouts/global.html
+++ b/_layouts/global.html
@@ -56,8 +56,8 @@
                                 <li><b>Language Guides:</b></li>
                                 <li><a href="dml-language-reference.html">DML Language Reference</a></li>
                                 <li><a href="beginners-guide-to-dml-and-pydml.html">Beginner's Guide to DML and PyDML</a></li>
-                                <li><a href="beginners-guide-python.html">Beginner's Guide for Python users</a></li>
-                                <li><a href="python-reference.html">Reference Guide for Python users</a></li>
+                                <li><a href="beginners-guide-python.html">Beginner's Guide for Python Users</a></li>
+                                <li><a href="python-reference.html">Reference Guide for Python Users</a></li>
                                 <li class="divider"></li>
                                 <li><b>ML Algorithms:</b></li>
                                 <li><a href="algorithms-reference.html">Algorithms Reference</a></li>

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4b899f26/beginners-guide-python.md
----------------------------------------------------------------------
diff --git a/beginners-guide-python.md b/beginners-guide-python.md
index 8bd957a..8a05ca6 100644
--- a/beginners-guide-python.md
+++ b/beginners-guide-python.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Beginner's Guide for Python users
-description: Beginner's Guide for Python users
+title: Beginner's Guide for Python Users
+description: Beginner's Guide for Python Users
 ---
 <!--
 {% comment %}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4b899f26/css/main.css
----------------------------------------------------------------------
diff --git a/css/main.css b/css/main.css
index 661d9e6..8a7426b 100644
--- a/css/main.css
+++ b/css/main.css
@@ -6,6 +6,9 @@
 body {
   padding-top: 51px !important;
   -moz-osx-font-smoothing: grayscale;
+  -webkit-font-smoothing: antialiased;
+  font-size: 16px;
+  background-color: #f9f9fb;
 }
 
 body #content {
@@ -18,25 +21,46 @@ pre {
 
 code {
   font-family: "Menlo", "Lucida Console", monospace;
-  background: white;
   border: none;
   padding: 0;
   color: #444444;
+  background-color: #f9f9fb;
+  font-size: 85%;
 }
 
 a code {
-  color: #0088cc;
+  color: #FF7032;
 }
 
 a:hover code {
-  color: #005580;
-  text-decoration: underline;
+  color: #d74108;
+  text-decoration: none;
+}
+
+.navbar .projecttitle > a:hover, .navbar .projecttitle > a:focus {
+  color: #FFF
 }
 
 .container {
   max-width: 914px;
 }
 
+#content {
+  margin-top: 30px;
+  margin-bottom: 50px;
+  color: #363f3f;
+}
+
+h1, h2, h3, h4, h5, h6 {
+  font-size: 2em;
+  line-height: 1.3em;
+  font-weight: 700;
+  margin-bottom: 0.5em;
+}
+
+pre {
+  background-color: #FFF
+}
 /* Branding */
 .brand {
   font-weight: normal !important;
@@ -57,6 +81,27 @@ img.logo {
 /* Navigation Bar */
 .navbar {
   background-color: rgba(0, 0, 0, 0.9);
+  height: 68px;
+}
+
+.navbar-brand {
+  font-size: 20px;
+}
+
+.navbar-brand.brand.projecttitle {
+  padding-top: 7px;
+}
+
+.navbar-right {
+  height: 100%;
+}
+
+.navbar-collapse.collapse {
+  height: 67px !important;
+}
+
+.navbar-header {
+  padding-top: 10px;
 }
 
 .navbar .container {
@@ -126,9 +171,6 @@ img.logo {
  */
 a.anchorjs-link:hover { text-decoration: none; }
 
-/**
- *SystemML additions
- */
 table td, table th {
   border: 1px solid #333;
   padding: 0 .5em;
@@ -170,6 +212,8 @@ table {
   font-style: normal;
   text-shadow: none;
   color: #FFF;
+  height: 100% !important;
+  padding-top: 25px;
 }
 
 .navbar a:hover {
@@ -177,10 +221,14 @@ table {
 }
 
 .navbar .nav > li > a:focus, .navbar .nav > li > a:hover {
-  background-color: #0c8672;
+  background-color: #ff5003;
   color: #FFF;
 }
 
+.navbar .nav > li {
+  height: 100%;
+}
+
 .dropdown-menu a {
   color: #FFF !important;
 }
@@ -191,12 +239,18 @@ table {
 }
 
 .dropdown-menu li > a:focus, .dropdown-menu li > a:hover {
-  background-color: #0c8672;
+  background-color: #ff5003;
   background-image: none;
+  font-color: red;
 }
 
 a {
-  color: #0c8672;
+  color: #FF7032;
+}
+
+a:focus, a:hover {
+  color: #d74108;
+  text-decoration: none;
 }
 
 #trademark {
@@ -227,6 +281,11 @@ a {
     color: #333 !important;
   }
 
+  .dropdown-menu a:hover {
+    color: #FFF !important;
+    transition: all 0.2s ease-in-out 0s;
+  }
+
   .dropdown-menu b {
     color: #333 !important;
   }

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4b899f26/developer-tools-systemml.md
----------------------------------------------------------------------
diff --git a/developer-tools-systemml.md b/developer-tools-systemml.md
index f50d7f1..f37c5b5 100644
--- a/developer-tools-systemml.md
+++ b/developer-tools-systemml.md
@@ -82,14 +82,18 @@ The lifecycle mappings are stored in a workspace metadata file as specified in E
 
 Please see below tips for resolving some compilation issues that might occur after importing the SystemML project.
 
-##### `Invalid cross-compiled libraries` error
+### Invalid cross-compiled libraries error
+
 Since Scala IDE bundles the latest versions (2.10.5 and 2.11.6 at this point), you need to add one in Eclipse Preferences -> Scala -> Installations by pointing to the <code>lib</code> directory of your Scala 2.10.4 distribution. Once this is done, select SystemML project, right-click, choose Scala -> Set Scala Installation and point to the 2.10.4 installation. This should clear all errors about invalid cross-compiled libraries. A clean build should succeed now.
 
-##### `Incompatible scala version ` error
+### Incompatible Scala version error
+
 Change IDE Scala version `Project->Properties->Scala Compiler -> Scala Installation`  to   `Fixed Scala Installation: 2.10.5`
 
-##### `Not found type * ` error
+### Not found type error
+
 Run command `mvn package`, and do `Project -> Refresh`
 
-##### `Marketplace not found ` error for Eclipse Luna
+### Marketplace not found error for Eclipse Luna
+
 Except for Scala IDE plugin install, please make sure to get update from "http://alchim31.free.fr/m2e-scala/update-site" to update maven connector for Scala.

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4b899f26/index.md
----------------------------------------------------------------------
diff --git a/index.md b/index.md
index fe8361a..add9a26 100644
--- a/index.md
+++ b/index.md
@@ -23,7 +23,7 @@ limitations under the License.
 {% endcomment %}
 -->
 
-SystemML is now an **Apache Incubator** project! Please see the [**Apache SystemML (incubating)**](http://systemml.apache.org/)
+SystemML is now an **Apache Incubator** project! Please see the [**Apache SystemML**](http://systemml.apache.org/)
 website for more information.
 
 SystemML is a flexible, scalable machine learning system.
@@ -33,8 +33,8 @@ SystemML's distinguishing characteristics are:
   2. **Multiple execution modes**, including Spark MLContext, Spark Batch, Hadoop Batch, Standalone, and JMLC.
   3. **Automatic optimization** based on data and cluster characteristics to ensure both efficiency and scalability.
 
-The [**SystemML GitHub README**](https://github.com/apache/incubator-systemml) describes
-building, testing, and running SystemML. Please read [**Contributing to SystemML**](contributing-to-systemml)
+The [SystemML GitHub README](https://github.com/apache/incubator-systemml) describes
+building, testing, and running SystemML. Please read [Contributing to SystemML](contributing-to-systemml)
 to find out how to help make SystemML even better!
 
 To download SystemML, visit the [downloads](http://systemml.apache.org/download) page.
@@ -42,20 +42,20 @@ To download SystemML, visit the [downloads](http://systemml.apache.org/download)
 
 ## Running SystemML
 
-* **[Beginner's Guide For Python Users](beginners-guide-python)** - Beginner's Guide for Python users.
-* **[Spark MLContext](spark-mlcontext-programming-guide)** - Spark MLContext is a programmatic API
+* [Beginner's Guide For Python Users](beginners-guide-python) - Beginner's Guide for Python users.
+* [Spark MLContext](spark-mlcontext-programming-guide) - Spark MLContext is a programmatic API
 for running SystemML from Spark via Scala, Python, or Java.
-  * [**Spark Shell Example (Scala)**](spark-mlcontext-programming-guide#spark-shell-example)
-  * [**Jupyter Notebook Example (PySpark)**](spark-mlcontext-programming-guide#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization)
-* **[Spark Batch](spark-batch-mode)** - Algorithms are automatically optimized to run across Spark clusters.
+  * [Spark Shell Example (Scala)](spark-mlcontext-programming-guide#spark-shell-example)
+  * [Jupyter Notebook Example (PySpark)](spark-mlcontext-programming-guide#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization)
+* [Spark Batch](spark-batch-mode) - Algorithms are automatically optimized to run across Spark clusters.
   * See [Invoking SystemML in Spark Batch Mode](spark-batch-mode) for detailed information.
-* **[Hadoop Batch](hadoop-batch-mode)** - Algorithms are automatically optimized when distributed across Hadoop clusters.
+* [Hadoop Batch](hadoop-batch-mode) - Algorithms are automatically optimized when distributed across Hadoop clusters.
   * See [Invoking SystemML in Hadoop Batch Mode](hadoop-batch-mode) for detailed information.
-* **[Standalone](standalone-guide)** - Standalone mode allows data scientists to rapidly prototype algorithms on a single
+* [Standalone](standalone-guide) - Standalone mode allows data scientists to rapidly prototype algorithms on a single
 machine in R-like and Python-like declarative languages.
   * The [Standalone Guide](standalone-guide) provides examples of algorithm execution
   in Standalone Mode.
-* **[JMLC](jmlc)** - Java Machine Learning Connector.
+* [JMLC](jmlc) - Java Machine Learning Connector.
   * See [Java Machine Learning Connector (JMLC)](jmlc) for more information.
 
 ## Language Guides

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/4b899f26/python-reference.md
----------------------------------------------------------------------
diff --git a/python-reference.md b/python-reference.md
index 3c2bbc3..65dcb5c 100644
--- a/python-reference.md
+++ b/python-reference.md
@@ -1,7 +1,7 @@
 ---
 layout: global
-title: Reference Guide for Python users
-description: Reference Guide for Python users
+title: Reference Guide for Python Users
+description: Reference Guide for Python Users
 ---
 <!--
 {% comment %}


[49/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1445] Add support for matrix-vector GPU axpy operation

Posted by de...@apache.org.
[SYSTEMML-1445] Add support for matrix-vector GPU axpy operation

Closes #445.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/d1fa154e
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/d1fa154e
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/d1fa154e

Branch: refs/heads/gh-pages
Commit: d1fa154e28bfe0f75d5a03db4b661045a9eea92a
Parents: a1d73f8
Author: Niketan Pansare <np...@us.ibm.com>
Authored: Fri Mar 31 17:14:11 2017 -0700
Committer: Niketan Pansare <np...@us.ibm.com>
Committed: Fri Mar 31 17:14:11 2017 -0700

----------------------------------------------------------------------
 beginners-guide-python.md | 5 -----
 1 file changed, 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/d1fa154e/beginners-guide-python.md
----------------------------------------------------------------------
diff --git a/beginners-guide-python.md b/beginners-guide-python.md
index 24f7151..9beba19 100644
--- a/beginners-guide-python.md
+++ b/beginners-guide-python.md
@@ -250,8 +250,6 @@ algorithm on digits datasets.
 # Scikit-learn way
 from sklearn import datasets
 from systemml.mllearn import LogisticRegression
-from pyspark.sql import SQLContext
-sqlCtx = SQLContext(sc)
 digits = datasets.load_digits()
 X_digits = digits.data
 y_digits = digits.target 
@@ -281,7 +279,6 @@ from pyspark.sql import SQLContext
 import pandas as pd
 from sklearn.metrics import accuracy_score
 import systemml as sml
-sqlCtx = SQLContext(sc)
 digits = datasets.load_digits()
 X_digits = digits.data
 y_digits = digits.target
@@ -314,7 +311,6 @@ from pyspark.ml import Pipeline
 from systemml.mllearn import LogisticRegression
 from pyspark.ml.feature import HashingTF, Tokenizer
 from pyspark.sql import SQLContext
-sqlCtx = SQLContext(sc)
 training = sqlCtx.createDataFrame([
     (0, "a b c d e spark", 1.0),
     (1, "b d", 2.0),
@@ -368,7 +364,6 @@ from sklearn import datasets
 from pyspark.sql import SQLContext
 import systemml as sml
 import pandas as pd
-sqlCtx = SQLContext(sc)
 digits = datasets.load_digits()
 X_digits = digits.data
 y_digits = digits.target + 1


[13/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1206][SYSTEMML-1207][SYSTEMML-1210] Logo, favicon, trademark for docs

Posted by de...@apache.org.
[SYSTEMML-1206][SYSTEMML-1207][SYSTEMML-1210] Logo, favicon, trademark for docs

Update project documentation logo to new logo from main website.
Update favicon.png to new favicon.png from main website.
Add trademark symbol to SystemML in header.
Modify styles, mostly for header.

Closes #365.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/20e46a8e
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/20e46a8e
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/20e46a8e

Branch: refs/heads/gh-pages
Commit: 20e46a8ec69b450e495517db74d7d9868fe71ab6
Parents: f802be0
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Thu Jan 26 17:02:42 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Thu Jan 26 17:02:42 2017 -0800

----------------------------------------------------------------------
 _layouts/global.html  |   2 +-
 css/main.css          |  21 +++++++++++++++------
 img/favicon.png       | Bin 2774 -> 461 bytes
 img/systemml-logo.png | Bin 40071 -> 982 bytes
 4 files changed, 16 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/20e46a8e/_layouts/global.html
----------------------------------------------------------------------
diff --git a/_layouts/global.html b/_layouts/global.html
index f7cb969..1aa5296 100644
--- a/_layouts/global.html
+++ b/_layouts/global.html
@@ -28,7 +28,7 @@
                         <img class="logo" src="img/systemml-logo.png" alt="Apache SystemML (incubating)" title="Apache SystemML (incubating)"/>
                     </div>
                     <div class="navbar-brand brand projecttitle">
-                        <a href="index.html">Apache SystemML (incubating)</a><br/>
+                        <a href="index.html">Apache SystemML<sup id="trademark">\u2122</sup> (incubating)</a><br/>
                         <span class="version">{{site.SYSTEMML_VERSION}}</span>
                     </div>
                     <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target=".navbar-collapse">

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/20e46a8e/css/main.css
----------------------------------------------------------------------
diff --git a/css/main.css b/css/main.css
index 27bfe0d..661d9e6 100644
--- a/css/main.css
+++ b/css/main.css
@@ -5,6 +5,7 @@
 /* Overall */
 body {
   padding-top: 51px !important;
+  -moz-osx-font-smoothing: grayscale;
 }
 
 body #content {
@@ -38,25 +39,27 @@ a:hover code {
 
 /* Branding */
 .brand {
-  font-weight: bold !important;
+  font-weight: normal !important;
   padding-top: 0px;
   padding-bottom: 0px;
   max-width: 75%;
 }
 
 img.logo {
-  height: 100%;
-  margin-right: 0.2em;
+  height: 31px;
+  width: 32px;
+  margin-right: 10px;
   display: none;
+  margin-top: 10px;
+  margin-left: 10px;
 }
 
 /* Navigation Bar */
 .navbar {
-  background-color: #152935;
+  background-color: rgba(0, 0, 0, 0.9);
 }
 
 .navbar .container {
-  background-color: #152935;
   background-image: none;
 }
 
@@ -79,7 +82,7 @@ img.logo {
 }
 
 .navbar .projecttitle {
-  margin-top: 10px;
+  margin-top: 7px;
   height: 40px;
   white-space: nowrap;
 }
@@ -196,6 +199,12 @@ a {
   color: #0c8672;
 }
 
+#trademark {
+    font-size: 0.5em;
+    font-weight: 300;
+    vertical-align: middle;
+}
+
 /* Media queries */
 @media only screen and (max-device-width: 768px) and (orientation : landscape) {
   /* landscape mobile */

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/20e46a8e/img/favicon.png
----------------------------------------------------------------------
diff --git a/img/favicon.png b/img/favicon.png
index 2388972..c5311b9 100644
Binary files a/img/favicon.png and b/img/favicon.png differ

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/20e46a8e/img/systemml-logo.png
----------------------------------------------------------------------
diff --git a/img/systemml-logo.png b/img/systemml-logo.png
index 87ae161..85e07bf 100644
Binary files a/img/systemml-logo.png and b/img/systemml-logo.png differ


[08/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1185] SystemML Breast Cancer Project

Posted by de...@apache.org.
[SYSTEMML-1185] SystemML Breast Cancer Project

This is the initial commit of the SystemML breast cancer project!

Please reference the attached `README.md` for an overview, background
information, goals, our approach, etc.

At a high level, this PR introduces the following new files/folders:
* `README.md`: Project information, etc.
* `Preprocessing.ipynb`: PySpark notebook for preprocessing our
histopathology slides into an appropriate `DataFrame` for consumption by
SystemML.
* `MachineLearning.ipynb`: PySpark/SystemML notebook for our machine
learning approach thus far.  We started simple, and are currently in
need of engine improvements in order to proceed forward.
* `softmax_clf.dml`: Basic softmax model (multiclass logistic regression
with normalized probabilities) as a sanity check.
* `convnet.dml`: Our current deep convnet model.  We are starting simple
with a slightly extended "LeNet"-like network architecture.  The goal
will be to improve engine performance so that this model can be
efficiently trained, and then move on to larger, more recent types of
model architectures.
* `hyperparam_tuning.dml`: A separate script for performing a
hyperparameter search for our current convnet model.  This has been
extracted from the notebook as the current `parfor` engine
implementation is not yet sufficient for this type of necessary job.
* `data`: A placeholder folder into which the data could be downloaded.
* `nn`: A softlink that will point to the SystemML-NN library.
* `approach.svg`: Image of our overall pipeline used in `README.md`.

Overall, this project aim to serve as a large-scale, end-to-end
machine learning project that can drive necessary core improvements for
SystemML.

Closes #347


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/cc6f3c7e
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/cc6f3c7e
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/cc6f3c7e

Branch: refs/heads/gh-pages
Commit: cc6f3c7ea934e19bca13f0359cfe3fa63398dbe0
Parents: 94cf7c1
Author: Mike Dusenberry <mw...@us.ibm.com>
Authored: Fri Jan 20 12:02:55 2017 -0800
Committer: Mike Dusenberry <mw...@us.ibm.com>
Committed: Fri Jan 20 12:02:55 2017 -0800

----------------------------------------------------------------------
 img/projects/breast_cancer/approach.svg | 4 ++++
 1 file changed, 4 insertions(+)
----------------------------------------------------------------------



[11/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1190] Cleanup Scala UDF invocation

Posted by de...@apache.org.
[SYSTEMML-1190] Cleanup Scala UDF invocation

Closes #357.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/41cb5d7b
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/41cb5d7b
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/41cb5d7b

Branch: refs/heads/gh-pages
Commit: 41cb5d7b99baa11577747e62e35c5cb131155f90
Parents: 45fab15
Author: Niketan Pansare <np...@us.ibm.com>
Authored: Tue Jan 24 17:02:38 2017 -0800
Committer: Niketan Pansare <np...@us.ibm.com>
Committed: Tue Jan 24 17:02:38 2017 -0800

----------------------------------------------------------------------
 spark-mlcontext-programming-guide.md | 39 -------------------------------
 1 file changed, 39 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/41cb5d7b/spark-mlcontext-programming-guide.md
----------------------------------------------------------------------
diff --git a/spark-mlcontext-programming-guide.md b/spark-mlcontext-programming-guide.md
index 759d392..dcaa125 100644
--- a/spark-mlcontext-programming-guide.md
+++ b/spark-mlcontext-programming-guide.md
@@ -1636,45 +1636,6 @@ scala> for (i <- 1 to 5) {
 
 </div>
 
-## Passing Scala UDF to SystemML
-
-SystemML allows the users to pass a Scala UDF (with input/output types supported by SystemML)
-to the DML script via MLContext. The restrictions for the supported Scala UDFs are as follows:
-
-1. Only types specified by DML language is supported for parameters and return types (i.e. Int, Double, Boolean, String, double[][]).
-2. At minimum, the function should have 1 argument and 1 return value.
-3. At max, the function can have 10 arguments and 10 return values. 
-
-{% highlight scala %}
-import org.apache.sysml.api.mlcontext._
-import org.apache.sysml.api.mlcontext.ScriptFactory._
-val ml = new MLContext(sc)
-
-// Demonstrates how to pass a simple scala UDF to SystemML
-def addOne(x:Double):Double = x + 1
-ml.udf.register("addOne", addOne _)
-val script1 = dml("v = addOne(2.0); print(v)")
-ml.execute(script1)
-
-// Demonstrates operation on local matrices (double[][])
-def addOneToDiagonal(x:Array[Array[Double]]):Array[Array[Double]] = {  for(i <- 0 to x.length-1) x(i)(i) = x(i)(i) + 1; x }
-ml.udf.register("addOneToDiagonal", addOneToDiagonal _)
-val script2 = dml("m1 = matrix(0, rows=3, cols=3); m2 = addOneToDiagonal(m1); print(toString(m2));")
-ml.execute(script2)
-
-// Demonstrates multi-return function
-def multiReturnFn(x:Double):(Double, Int) = (x + 1, (x * 2).toInt)
-ml.udf.register("multiReturnFn", multiReturnFn _)
-val script3 = dml("[v1, v2] = multiReturnFn(2.0); print(v1)")
-ml.execute(script3)
-
-// Demonstrates multi-argument multi-return function
-def multiArgReturnFn(x:Double, y:Int):(Double, Int) = (x + 1, (x * y).toInt)
-ml.udf.register("multiArgReturnFn", multiArgReturnFn _)
-val script4 = dml("[v1, v2] = multiArgReturnFn(2.0, 1); print(v2)")
-ml.execute(script4)
-{% endhighlight %}
-
 ---
 
 # Jupyter (PySpark) Notebook Example - Poisson Nonnegative Matrix Factorization


[47/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1397] Describe process for deploying versioned docs

Posted by de...@apache.org.
[SYSTEMML-1397] Describe process for deploying versioned docs


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/7407b700
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/7407b700
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/7407b700

Branch: refs/heads/gh-pages
Commit: 7407b7001ef1835cdb6a5f70ebcce5fa901fa12a
Parents: bd23224
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Mon Mar 13 15:18:22 2017 -0700
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Mon Mar 13 15:18:22 2017 -0700

----------------------------------------------------------------------
 release-process.md | 99 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 80 insertions(+), 19 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/7407b700/release-process.md
----------------------------------------------------------------------
diff --git a/release-process.md b/release-process.md
index a75a281..d64d417 100644
--- a/release-process.md
+++ b/release-process.md
@@ -33,23 +33,6 @@ To be written. (Describe how the release candidate is built, including checksums
 the release candidate is deployed to servers for review.)
 
 
-## Release Documentation
-
-The `SYSTEMML_VERSION` value in docs/_config.yml should be updated to the correct release version. The documentation
-site should be built.
-The SystemML documentation site should be deployed to a docs version folder within the main website project (using
-svn). As an example, the documentation site for SystemML version 0.11.0 should be available
-at http://systemml.apache.org/docs/0.11.0.
-
-The Javadocs should be generated for the project and should be deployed to a docs version folder, such as
-http://systemml.apache.org/docs/0.11.0/api/java. Any other docs, such as Scaladocs if they are available, should
-be deployed to corresponding locations. Note that the version number specified in the Javadocs is determined by the project
-version number in the project pom.xml file.
-
-Additionally, the Javadocs should be deployed to http://systemml.apache.org/docs/latest/api/java
-if the Javadocs have not already been deployed to this location.
-
-
 # Release Candidate Checklist
 
 ## All Artifacts and Checksums Present
@@ -291,5 +274,83 @@ has been approved.
 
 ## Release Deployment
 
-To be written. (What steps need to be done? How is the release deployed to the central maven repo? What updates need to
-happen to the main website, such as updating the Downloads page? Where do the release notes for the release go?)
+To be written. (What steps need to be done? How is the release deployed to Apache dist and the central maven repo?
+Where do the release notes for the release go?)
+
+
+## Documentation Deployment
+
+This section describes how to deploy versioned project documentation to the main website.
+Note that versioned project documentation is committed directly to the `svn` project's `docs` folder.
+The versioned project documentation is not committed to the website's `git` project.
+
+Checkout branch in main project (`incubator-systemml`).
+
+	$ git checkout branch-0.13.0
+
+In `incubator-systemml/docs/_config.yml`, set:
+
+* `SYSTEMML_VERSION` to project version (0.13.0)
+* `FEEDBACK_LINKS` to `false` (only have feedback links on `LATEST` docs)
+* `API_DOCS_MENU` to `true` (adds `API Docs` menu to get to project javadocs)
+
+Generate `docs/_site` by running `bundle exec jekyll serve` in `incubator-systemml/docs`.
+
+	$ bundle exec jekyll serve
+
+Verify documentation site looks correct.
+
+In website `svn` project, create `incubator-systemml-website-site/docs/0.13.0` folder.
+
+Copy contents of `incubator-systemml/docs/_site` to `incubator-systemml-website-site/docs/0.13.0`.
+
+Delete any unnecessary files (`Gemfile`, `Gemfile.lock`).
+
+Create `incubator-systemml-website-site/docs/0.13.0/api/java` folder for javadocs.
+
+Update `incubator-systemml/pom.xml` project version to what should be displayed in javadocs (such as `0.13.0`).
+
+Build project (which generates javadocs).
+
+	$ mvn clean package -P distribution
+
+Copy contents of `incubator-systemml/target/apidocs` to `incubator-systemml-website-site/docs/0.13.0/api/java`.
+
+Open up `file:///.../incubator-systemml-website-site/docs/0.13.0/index.html` and verify `API Docs` &rarr; `Javadoc` link works and that the correct Javadoc version is displayed. Verify feedback links under `Issues` menu are not present.
+
+Clean up any unnecessary files (such as deleting `.DS_Store` files on OS X).
+
+	$ find . -name '.DS_Store' -type f -delete
+
+Commit the versioned project documentation to `svn`:
+
+	$ svn status
+	$ svn add docs/0.13.0
+	$ svn commit -m "Add 0.13.0 docs to website"
+
+Update `incubator-systemml-website/_src/documentation.html` to include 0.13.0 link.
+
+Start main website site by running `gulp` in `incubator-systemml-website`:
+
+	$ gulp
+
+Commit and push the update to `git` project.
+
+	$ git add -u
+	$ git commit -m "Add 0.13.0 link to documentation page"
+	$ git push
+	$ git push apache master
+
+Copy contents of `incubator-systemml-website/_site` (generated by `gulp`) to `incubator-systemml-website-site`.
+After doing so, we should see that `incubator-systemml-website-site/documentation.html` has been updated.
+
+	$ svn status
+	$ svn diff
+
+Commit the update to `documentation.html` to publish the website update.
+
+	$ svn commit -m "Add 0.13.0 link to documentation page"
+
+The versioned project documentation is now deployed to the main website, and the
+[Documentation Page](http://systemml.apache.org/documentation) contains a link to the versioned documentation.
+


[05/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1168] Document printf functionality

Posted by de...@apache.org.
[SYSTEMML-1168] Document printf functionality

Add print function's multi-argument printf formatting capabilities
to the DML Language Reference.

Closes #333.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/82682553
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/82682553
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/82682553

Branch: refs/heads/gh-pages
Commit: 82682553dab1264c86987b4aaff00735f983f673
Parents: a9695eb
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Fri Jan 6 21:30:10 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Fri Jan 6 21:30:10 2017 -0800

----------------------------------------------------------------------
 dml-language-reference.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/82682553/dml-language-reference.md
----------------------------------------------------------------------
diff --git a/dml-language-reference.md b/dml-language-reference.md
index eefdc44..80fc8ca 100644
--- a/dml-language-reference.md
+++ b/dml-language-reference.md
@@ -1454,7 +1454,7 @@ Function | Description | Parameters | Example
 -------- | ----------- | ---------- | -------
 append() | Append a string to another string separated by "\n" <br/> Limitation: The string may grow up to 1 MByte. | Input: (&lt;string&gt;, &lt;string&gt;) <br/> Output: &lt;string&gt; | s = "iter=" + i <br/> i = i + 1 <br/> s = append(s, "iter=" + i) <br/> write(s, "s.out")
 toString() | Formats a Matrix or Frame object into a string. <br/> "rows" & "cols" : number of rows and columns to print<br/> "decimal" : number of digits after the decimal<br/>"sparse" : set to true to print Matrix object in sparse format, i.e. _RowIndex_ _ColIndex_ _Value_<br/>"sep" and "linesep" : inter-element separator and the line separator strings| Input : (&lt;matrix&gt; or &lt;frame&gt;,<br/> &nbsp;&nbsp;rows=100,<br/> &nbsp;&nbsp;cols=100,<br/> &nbsp;&nbsp;decimal=3,<br/> &nbsp;&nbsp;sparse=FALSE,<br/> &nbsp;&nbsp;sep=" ",<br/> &nbsp;&nbsp;linesep="\n") <br/> Output: &lt;string&gt; | X = matrix(seq(1, 9), rows=3, cols=3)<br/>str = toString(X, sep=" \| ") <br/><br/>F = as.frame(X)<br/>print(toString(F, rows=2, cols=2))
-print() | Prints the value of a scalar variable x. This built-in takes an optional string parameter. | Input: (&lt;scalar&gt;) | print("hello") <br/> print("hello" + "world") <br/> print("value of x is " + x )
+print() | Prints a scalar variable. The print() function allows printf-style formatting by optionally allowing multiple arguments, where the first argument is the string that specifies the formatting and the additional arguments are the arguments to format. | Input: &lt;scalar&gt;<br/>or<br/>&lt;string, args...&gt; | print("hello") <br/> print("hello" + "world") <br/> print("value of x is " + x ) <br/><br/>a='hello';<br/>b=3;<br/>c=4.5;<br/>d=TRUE;<br/>print('%s %d %f %b', a, b, c, d); <br/><br/>a='hello';<br/>b='goodbye';<br/>c=4;<br/>d=3;<br/>e=3.0;<br/>f=5.0;<br/>g=FALSE;<br/>print('%s %d %f %b', (a+b), (c-d), (e*f), !g);
 stop() | Halts the execution of DML program by printing the message that is passed in as the argument. <br/> Note that the use of stop() is not allowed inside a parfor loop. |  Input: (&lt;scalar&gt;) | stop("Inputs to DML program are invalid") <br/> stop("Class labels must be either -1 or +1")
 order() | Sort a column of the matrix X in decreasing/increasing order and return either index (index.return=TRUE) or data (index.return=FALSE). | Input: (target=X, by=column, decreasing, index.return) | order(X, by=1, decreasing=FALSE, index.return=FALSE)
 


[42/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

Posted by de...@apache.org.
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/LinReg.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/LinReg.tex b/Algorithms Reference/LinReg.tex
deleted file mode 100644
index 67273c2..0000000
--- a/Algorithms Reference/LinReg.tex	
+++ /dev/null
@@ -1,328 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Linear Regression}
-\label{sec:LinReg}
-
-\noindent{\bf Description}
-\smallskip
-
-Linear Regression scripts are used to model the relationship between one numerical
-response variable and one or more explanatory (feature) variables.
-The scripts are given a dataset $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a
-numerical vector of feature variables and $y_i$ is a numerical response value for
-each training data record.  The feature vectors are provided as a matrix $X$ of size
-$n\,{\times}\,m$, where $n$ is the number of records and $m$ is the number of features.
-The observed response values are provided as a 1-column matrix~$Y$, with a numerical
-value $y_i$ for each~$x_i$ in the corresponding row of matrix~$X$.
-
-In linear regression, we predict the distribution of the response~$y_i$ based on
-a fixed linear combination of the features in~$x_i$.  We assume that
-there exist constant regression coefficients $\beta_0, \beta_1, \ldots, \beta_m$
-and a constant residual variance~$\sigma^2$ such that
-\begin{equation}
-y_i \sim \Normal(\mu_i, \sigma^2) \,\,\,\,\textrm{where}\,\,\,\,
-\mu_i \,=\, \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m}
-\label{eqn:linregdef}
-\end{equation}
-Distribution $y_i \sim \Normal(\mu_i, \sigma^2)$ models the ``unexplained'' residual
-noise and is assumed independent across different records.
-
-The goal is to estimate the regression coefficients and the residual variance.
-Once they are accurately estimated, we can make predictions about $y_i$ given~$x_i$
-in new records.  We can also use the $\beta_j$'s to analyze the influence of individual
-features on the response value, and assess the quality of this model by comparing
-residual variance in the response, left after prediction, with its total variance.
-
-There are two scripts in our library, both doing the same estimation, but using different
-computational methods.  Depending on the size and the sparsity of the feature matrix~$X$,
-one or the other script may be more efficient.  The ``direct solve'' script
-{\tt LinearRegDS} is more efficient when the number of features $m$ is relatively small
-($m \sim 1000$ or less) and matrix~$X$ is either tall or fairly dense
-(has~${\gg}\:m^2$ nonzeros); otherwise, the ``conjugate gradient'' script {\tt LinearRegCG}
-is more efficient.  If $m > 50000$, use only {\tt LinearRegCG}.
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}LinearRegDS.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} Y=}path/file
-{\tt{} B=}path/file
-{\tt{} O=}path/file
-{\tt{} icpt=}int
-{\tt{} reg=}double
-{\tt{} fmt=}format
-
-}\smallskip
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}LinearRegCG.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} Y=}path/file
-{\tt{} B=}path/file
-{\tt{} O=}path/file
-{\tt{} Log=}path/file
-{\tt{} icpt=}int
-{\tt{} reg=}double
-{\tt{} tol=}double
-{\tt{} maxi=}int
-{\tt{} fmt=}format
-
-}
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the matrix of feature vectors, each row constitutes
-one feature vector
-\item[{\tt Y}:]
-Location to read the 1-column matrix of response values
-\item[{\tt B}:]
-Location to store the estimated regression parameters (the $\beta_j$'s), with the
-intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
-\item[{\tt O}:] (default:\mbox{ }{\tt " "})
-Location to store the CSV-file of summary statistics defined in
-Table~\ref{table:linreg:stats}, the default is to print it to the standard output
-\item[{\tt Log}:] (default:\mbox{ }{\tt " "}, {\tt LinearRegCG} only)
-Location to store iteration-specific variables for monitoring and debugging purposes,
-see Table~\ref{table:linreg:log} for details.
-\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
-Intercept presence and shifting/rescaling the features in~$X$:\\
-{\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\
-{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
-{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
-\item[{\tt reg}:] (default:\mbox{ }{\tt 0.000001})
-L2-regularization parameter~\mbox{$\lambda\geq 0$}; set to nonzero for highly dependent,
-sparse, or numerous ($m \gtrsim n/10$) features
-\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}, {\tt LinearRegCG} only)
-Tolerance \mbox{$\eps\geq 0$} used in the convergence criterion: we terminate conjugate
-gradient iterations when the $\beta$-residual reduces in L2-norm by this factor
-\item[{\tt maxi}:] (default:\mbox{ }{\tt 0}, {\tt LinearRegCG} only)
-Maximum number of conjugate gradient iterations, or~0 if no maximum
-limit provided
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|ll|}
-\hline
-Name & Meaning \\
-\hline
-{\tt AVG\_TOT\_Y}          & Average of the response value $Y$ \\
-{\tt STDEV\_TOT\_Y}        & Standard Deviation of the response value $Y$ \\
-{\tt AVG\_RES\_Y}          & Average of the residual $Y - \mathop{\mathrm{pred}}(Y|X)$, i.e.\ residual bias \\
-{\tt STDEV\_RES\_Y}        & Standard Deviation of the residual $Y - \mathop{\mathrm{pred}}(Y|X)$ \\
-{\tt DISPERSION}           & GLM-style dispersion, i.e.\ residual sum of squares / \#deg.\ fr. \\
-{\tt PLAIN\_R2}            & Plain $R^2$ of residual with bias included vs.\ total average \\
-{\tt ADJUSTED\_R2}         & Adjusted $R^2$ of residual with bias included vs.\ total average \\
-{\tt PLAIN\_R2\_NOBIAS}    & Plain $R^2$ of residual with bias subtracted vs.\ total average \\
-{\tt ADJUSTED\_R2\_NOBIAS} & Adjusted $R^2$ of residual with bias subtracted vs.\ total average \\
-{\tt PLAIN\_R2\_VS\_0}     & ${}^*$Plain $R^2$ of residual with bias included vs.\ zero constant \\
-{\tt ADJUSTED\_R2\_VS\_0}  & ${}^*$Adjusted $R^2$ of residual with bias included vs.\ zero constant \\
-\hline
-\multicolumn{2}{r}{${}^{*\mathstrut}$ The last two statistics are only printed if there is no intercept ({\tt icpt=0})} \\
-\end{tabular}}
-\caption{Besides~$\beta$, linear regression scripts compute a few summary statistics
-listed above.  The statistics are provided in CSV format, one comma-separated name-value
-pair per each line.}
-\label{table:linreg:stats}
-\end{table}
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|ll|}
-\hline
-Name & Meaning \\
-\hline
-{\tt CG\_RESIDUAL\_NORM}  & L2-norm of conjug.\ grad.\ residual, which is $A \pxp \beta - t(X) \pxp y$ \\
-                          & where $A = t(X) \pxp X + \diag (\lambda)$, or a similar quantity \\
-{\tt CG\_RESIDUAL\_RATIO} & Ratio of current L2-norm of conjug.\ grad.\ residual over the initial \\
-\hline
-\end{tabular}}
-\caption{
-The {\tt Log} file for {\tt{}LinearRegCG} script contains the above \mbox{per-}iteration
-variables in CSV format, each line containing triple (Name, Iteration\#, Value) with
-Iteration\# being~0 for initial values.}
-\label{table:linreg:log}
-\end{table}
-
-
-\noindent{\bf Details}
-\smallskip
-
-To solve a linear regression problem over feature matrix~$X$ and response vector~$Y$,
-we can find coefficients $\beta_0, \beta_1, \ldots, \beta_m$ and $\sigma^2$ that maximize
-the joint likelihood of all $y_i$ for $i=1\ldots n$, defined by the assumed statistical
-model~(\ref{eqn:linregdef}).  Since the joint likelihood of the independent
-$y_i \sim \Normal(\mu_i, \sigma^2)$ is proportional to the product of
-$\exp\big({-}\,(y_i - \mu_i)^2 / (2\sigma^2)\big)$, we can take the logarithm of this
-product, then multiply by $-2\sigma^2 < 0$ to obtain a least squares problem:
-\begin{equation}
-\sum_{i=1}^n \, (y_i - \mu_i)^2 \,\,=\,\, 
-\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
-\,\,\to\,\,\min
-\label{eqn:linregls}
-\end{equation}
-This may not be enough, however.  The minimum may sometimes be attained over infinitely many
-$\beta$-vectors, for example if $X$ has an all-0 column, or has linearly dependent columns,
-or has fewer rows than columns~\mbox{($n < m$)}.  Even if~(\ref{eqn:linregls}) has a unique
-solution, other $\beta$-vectors may be just a little suboptimal\footnote{Smaller likelihood
-difference between two models suggests less statistical evidence to pick one model over the
-other.}, yet give significantly different predictions for new feature vectors.  This results
-in \emph{overfitting}: prediction error for the training data ($X$ and~$Y$) is much smaller
-than for the test data (new records).
-
-Overfitting and degeneracy in the data is commonly mitigated by adding a regularization penalty
-term to the least squares function:
-\begin{equation}
-\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
-\,+\,\, \lambda \sum_{j=1}^m \beta_j^2
-\,\,\to\,\,\min
-\label{eqn:linreglsreg}
-\end{equation}
-The choice of $\lambda>0$, the regularization constant, typically involves cross-validation
-where the dataset is repeatedly split into a training part (to estimate the~$\beta_j$'s) and
-a test part (to evaluate prediction accuracy), with the goal of maximizing the test accuracy.
-In our scripts, $\lambda$~is provided as input parameter~{\tt reg}.
-
-The solution to least squares problem~(\ref{eqn:linreglsreg}), through taking the derivative
-and setting it to~0, has the matrix linear equation form
-\begin{equation}
-A\left[\textstyle\beta_{1:m}\atop\textstyle\beta_0\right] \,=\, \big[X,\,1\big]^T Y,\,\,\,
-\textrm{where}\,\,\,
-A \,=\, \big[X,\,1\big]^T \big[X,\,1\big]\,+\,\hspace{0.5pt} \diag(\hspace{0.5pt}
-\underbrace{\raisebox{0pt}[0pt][0.5pt]{$\lambda,\ldots, \lambda$}}_{\raisebox{2pt}{$\scriptstyle m$}}
-\hspace{0.5pt}, 0)
-\label{eqn:linregeq}
-\end{equation}
-where $[X,\,1]$ is $X$~with an extra column of~1s appended on the right, and the
-diagonal matrix of $\lambda$'s has a zero to keep the intercept~$\beta_0$ unregularized.
-If the intercept is disabled by setting {\tt icpt=0}, the equation is simply
-\mbox{$X^T X \beta = X^T Y$}.
-
-We implemented two scripts for solving equation~(\ref{eqn:linregeq}): one is a ``direct solver''
-that computes $A$ and then solves $A\beta = [X,\,1]^T Y$ by calling an external package,
-the other performs linear conjugate gradient~(CG) iterations without ever materializing~$A$.
-The CG~algorithm closely follows Algorithm~5.2 in Chapter~5 of~\cite{Nocedal2006:Optimization}.
-Each step in the CG~algorithm computes a matrix-vector multiplication $q = Ap$ by first computing
-$[X,\,1]\, p$ and then $[X,\,1]^T [X,\,1]\, p$.  Usually the number of such multiplications,
-one per CG iteration, is much smaller than~$m$.  The user can put a hard bound on it with input 
-parameter~{\tt maxi}, or use the default maximum of~$m+1$ (or~$m$ if no intercept) by
-having {\tt maxi=0}.  The CG~iterations terminate when the L2-norm of vector
-$r = A\beta - [X,\,1]^T Y$ decreases from its initial value (for~$\beta=0$) by the tolerance
-factor specified in input parameter~{\tt tol}.
-
-The CG algorithm is more efficient if computing
-$[X,\,1]^T \big([X,\,1]\, p\big)$ is much faster than materializing $A$,
-an $(m\,{+}\,1)\times(m\,{+}\,1)$ matrix.  The Direct Solver~(DS) is more efficient if
-$X$ takes up a lot more memory than $A$ (i.e.\ $X$~has a lot more nonzeros than~$m^2$)
-and if $m^2$ is small enough for the external solver ($m \lesssim 50000$).  A more precise
-determination between CG and~DS is subject to further research.
-
-In addition to the $\beta$-vector, the scripts estimate the residual standard
-deviation~$\sigma$ and the~$R^2$, the ratio of ``explained'' variance to the total
-variance of the response variable.  These statistics only make sense if the number
-of degrees of freedom $n\,{-}\,m\,{-}\,1$ is positive and the regularization constant
-$\lambda$ is negligible or zero.  The formulas for $\sigma$ and $R^2$~are:
-\begin{equation*}
-R^2_{\textrm{plain}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}},\quad
-\sigma \,=\, \sqrt{\frac{\mathrm{RSS}}{n - m - 1}},\quad
-R^2_{\textrm{adj.}} = 1 - \frac{\sigma^2 (n-1)}{\mathrm{TSS}}
-\end{equation*}
-where
-\begin{equation*}
-\mathrm{RSS} \,=\, \sum_{i=1}^n \Big(y_i - \hat{\mu}_i - 
-\frac{1}{n} \sum_{i'=1}^n \,(y_{i'} - \hat{\mu}_{i'})\Big)^2; \quad
-\mathrm{TSS} \,=\, \sum_{i=1}^n \Big(y_i - \frac{1}{n} \sum_{i'=1}^n y_{i'}\Big)^2
-\end{equation*}
-Here $\hat{\mu}_i$ are the predicted means for $y_i$ based on the estimated
-regression coefficients and the feature vectors.  They may be biased when no
-intercept is present, hence the RSS formula subtracts the bias.
-
-Lastly, note that by choosing the input option {\tt icpt=2} the user can shift
-and rescale the columns of~$X$ to have zero average and the variance of~1.
-This is particularly important when using regularization over highly disbalanced
-features, because regularization tends to penalize small-variance columns (which
-need large~$\beta_j$'s) more than large-variance columns (with small~$\beta_j$'s).
-At the end, the estimated regression coefficients are shifted and rescaled to
-apply to the original features.
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-The estimated regression coefficients (the $\hat{\beta}_j$'s) are populated into
-a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
-input argument.  What this matrix contains, and its size, depends on the input
-argument {\tt icpt}, which specifies the user's intercept and rescaling choice:
-\begin{Description}
-\item[{\tt icpt=0}:] No intercept, matrix~$B$ has size $m\,{\times}\,1$, with
-$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to~$m = {}$ncol$(X)$.
-\item[{\tt icpt=1}:] There is intercept, but no shifting/rescaling of~$X$; matrix~$B$
-has size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from 1 to~$m$,
-and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept coefficient.
-\item[{\tt icpt=2}:] There is intercept, and the features in~$X$ are shifted to
-mean${} = 0$ and rescaled to variance${} = 1$; then there are two versions of
-the~$\hat{\beta}_j$'s, one for the original features and another for the
-shifted/rescaled features.  Now matrix~$B$ has size $(m\,{+}\,1) \times 2$, with
-$B[\cdot, 1]$ for the original features and $B[\cdot, 2]$ for the shifted/rescaled
-features, in the above format.  Note that $B[\cdot, 2]$ are iteratively estimated
-and $B[\cdot, 1]$ are obtained from $B[\cdot, 2]$ by complementary shifting and
-rescaling.
-\end{Description}
-The estimated summary statistics, including residual standard deviation~$\sigma$ and
-the~$R^2$, are printed out or sent into a file (if specified) in CSV format as
-defined in Table~\ref{table:linreg:stats}.  For conjugate gradient iterations,
-a log file with monitoring variables can also be made available, see
-Table~\ref{table:linreg:log}.
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-\hml -f LinearRegCG.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
-  B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv
-  icpt=2 reg=1.0 tol=0.00000001 maxi=100 Log=/user/biadmin/log.csv
-
-}
-{\hangindent=\parindent\noindent\tt
-\hml -f LinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
-  B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv
-  icpt=2 reg=1.0
-
-}
-
-% \smallskip
-% \noindent{\bf See Also}
-% \smallskip
-% 
-% In case of binary classification problems, please consider using L2-SVM or
-% binary logistic regression; for multiclass classification, use multiclass~SVM
-% or multinomial logistic regression.  For more complex distributions of the
-% response variable use the Generalized Linear Models script.

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/LogReg.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/LogReg.tex b/Algorithms Reference/LogReg.tex
deleted file mode 100644
index 43d4e15..0000000
--- a/Algorithms Reference/LogReg.tex	
+++ /dev/null
@@ -1,287 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Multinomial Logistic Regression}
-
-\noindent{\bf Description}
-\smallskip
-
-Our logistic regression script performs both binomial and multinomial logistic regression.
-The script is given a dataset $(X, Y)$ where matrix $X$ has $m$~columns and matrix $Y$ has
-one column; both $X$ and~$Y$ have $n$~rows.  The rows of $X$ and~$Y$ are viewed as a collection
-of records: $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a numerical vector of explanatory
-(feature) variables and $y_i$ is a categorical response variable.
-Each row~$x_i$ in~$X$ has size~\mbox{$\dim x_i = m$}, while its corresponding $y_i$ is an
-integer that represents the observed response value for record~$i$.
-
-The goal of logistic regression is to learn a linear model over the feature vector
-$x_i$ that can be used to predict how likely each categorical label is expected to
-be observed as the actual~$y_i$.
-Note that logistic regression predicts more than a label: it predicts the probability
-for every possible label.  The binomial case allows only two possible labels, the
-multinomial case has no such restriction.
-
-Just as linear regression estimates the mean value $\mu_i$ of a numerical response
-variable, logistic regression does the same for category label probabilities.
-In linear regression, the mean of $y_i$ is estimated as a linear combination of the features:
-$\mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m}$.
-In logistic regression, the
-label probability has to lie between 0 and~1, so a link function is applied to connect
-it to $\beta_0 + x_i\beta_{1:m}$.  If there are just two possible category labels, for example
-0~and~1, the logistic link looks as follows:
-\begin{equation*}
-\Prob[y_i\,{=}\,1\mid x_i; \beta] \,=\, 
-\frac{e^{\,\beta_0 + x_i\beta_{1:m}}}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}};
-\quad
-\Prob[y_i\,{=}\,0\mid x_i; \beta] \,=\, 
-\frac{1}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}}
-\end{equation*}
-Here category label~0 serves as the \emph{baseline}, and function
-$\exp(\beta_0 + x_i\beta_{1:m})$
-shows how likely we expect to see ``$y_i = 1$'' in comparison to the baseline.
-Like in a loaded coin, the predicted odds of seeing 1~versus~0 are
-$\exp(\beta_0 + x_i\beta_{1:m})$ to~1,
-with each feature $x_{i,j}$ multiplying its own factor $\exp(\beta_j x_{i,j})$ to the odds.
-Given a large collection of pairs $(x_i, y_i)$, $i=1\ldots n$, logistic regression seeks
-to find the $\beta_j$'s that maximize the product of probabilities
-\hbox{$\Prob[y_i\mid x_i; \beta]$}
-for actually observed $y_i$-labels (assuming no regularization).
-
-Multinomial logistic regression~\cite{Agresti2002:CDA} extends this link to $k \geq 3$ possible
-categories.  Again we identify one category as the baseline, for example the $k$-th category.
-Instead of a coin, here we have a loaded multisided die, one side per category.  Each non-baseline
-category $l = 1\ldots k\,{-}\,1$ has its own vector $(\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l})$
-of regression parameters with the intercept, making up a matrix $B$ of size
-$(m\,{+}\,1)\times(k\,{-}\,1)$.  The predicted odds of seeing non-baseline category~$l$ versus
-the baseline~$k$ are $\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)$
-to~1, and the predicted probabilities are:
-\begin{align}
-l < k:\quad\Prob[y_i\,{=}\,\makebox[0.5em][c]{$l$}\mid x_i; B] \,\,\,{=}\,\,\,&
-\frac{\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)}%
-{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)};
-\label{eqn:mlogreg:nonbaseprob}\\
-\Prob[y_i\,{=}\,\makebox[0.5em][c]{$k$}\mid x_i; B] \,\,\,{=}\,\,\,& \frac{1}%
-{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)}.
-\label{eqn:mlogreg:baseprob}
-\end{align}
-The goal of the regression is to estimate the parameter matrix~$B$ from the provided dataset
-$(X, Y) = (x_i, y_i)_{i=1}^n$ by maximizing the product of \hbox{$\Prob[y_i\mid x_i; B]$}
-over the observed labels~$y_i$.  Taking its logarithm, negating, and adding a regularization term
-gives us a minimization objective:
-\begin{equation}
-f(B; X, Y) \,\,=\,\,
--\sum_{i=1}^n \,\log \Prob[y_i\mid x_i; B] \,+\,
-\frac{\lambda}{2} \sum_{j=1}^m \sum_{l=1}^{k-1} |\beta_{j,l}|^2
-\,\,\to\,\,\min
-\label{eqn:mlogreg:loss}
-\end{equation}
-The optional regularization term is added to mitigate overfitting and degeneracy in the data;
-to reduce bias, the intercepts $\beta_{0,l}$ are not regularized.  Once the~$\beta_{j,l}$'s
-are accurately estimated, we can make predictions about the category label~$y$ for a new
-feature vector~$x$ using Eqs.~(\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob}).
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}MultiLogReg.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} Y=}path/file
-{\tt{} B=}path/file
-{\tt{} Log=}path/file
-{\tt{} icpt=}int
-{\tt{} reg=}double
-{\tt{} tol=}double
-{\tt{} moi=}int
-{\tt{} mii=}int
-{\tt{} fmt=}format
-
-}
-
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the input matrix of feature vectors; each row constitutes
-one feature vector.
-\item[{\tt Y}:]
-Location to read the input one-column matrix of category labels that correspond to
-feature vectors in~{\tt X}.  Note the following:\\
--- Each non-baseline category label must be a positive integer.\\
--- If all labels are positive, the largest represents the baseline category.\\
--- If non-positive labels such as $-1$ or~$0$ are present, then they represent the (same)
-baseline category and are converted to label $\max(\texttt{Y})\,{+}\,1$.
-\item[{\tt B}:]
-Location to store the matrix of estimated regression parameters (the $\beta_{j, l}$'s),
-with the intercept parameters~$\beta_{0, l}$ at position {\tt B[}$m\,{+}\,1$, $l${\tt ]}
-if available.  The size of {\tt B} is $(m\,{+}\,1)\times (k\,{-}\,1)$ with the intercepts
-or $m \times (k\,{-}\,1)$ without the intercepts, one column per non-baseline category
-and one row per feature.
-\item[{\tt Log}:] (default:\mbox{ }{\tt " "})
-Location to store iteration-specific variables for monitoring and debugging purposes,
-see Table~\ref{table:mlogreg:log} for details.
-\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
-Intercept and shifting/rescaling of the features in~$X$:\\
-{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\
-{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
-{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
-\item[{\tt reg}:] (default:\mbox{ }{\tt 0.0})
-L2-regularization parameter (lambda)
-\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
-Tolerance (epsilon) used in the convergence criterion
-\item[{\tt moi}:] (default:\mbox{ }{\tt 100})
-Maximum number of outer (Fisher scoring) iterations
-\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
-Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
-limit provided
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|ll|}
-\hline
-Name & Meaning \\
-\hline
-{\tt LINEAR\_TERM\_MIN}  & The minimum value of $X \pxp B$, used to check for overflows \\
-{\tt LINEAR\_TERM\_MAX}  & The maximum value of $X \pxp B$, used to check for overflows \\
-{\tt NUM\_CG\_ITERS}     & Number of inner (Conj.\ Gradient) iterations in this outer iteration \\
-{\tt IS\_TRUST\_REACHED} & $1 = {}$trust region boundary was reached, $0 = {}$otherwise \\
-{\tt POINT\_STEP\_NORM}  & L2-norm of iteration step from old point (matrix $B$) to new point \\
-{\tt OBJECTIVE}          & The loss function we minimize (negative regularized log-likelihood) \\
-{\tt OBJ\_DROP\_REAL}    & Reduction in the objective during this iteration, actual value \\
-{\tt OBJ\_DROP\_PRED}    & Reduction in the objective predicted by a quadratic approximation \\
-{\tt OBJ\_DROP\_RATIO}   & Actual-to-predicted reduction ratio, used to update the trust region \\
-{\tt IS\_POINT\_UPDATED} & $1 = {}$new point accepted; $0 = {}$new point rejected, old point restored \\
-{\tt GRADIENT\_NORM}     & L2-norm of the loss function gradient (omitted if point is rejected) \\
-{\tt TRUST\_DELTA}       & Updated trust region size, the ``delta'' \\
-\hline
-\end{tabular}}
-\caption{
-The {\tt Log} file for multinomial logistic regression contains the above \mbox{per-}iteration
-variables in CSV format, each line containing triple (Name, Iteration\#, Value) with Iteration\#
-being~0 for initial values.}
-\label{table:mlogreg:log}
-\end{table}
-
-
-\noindent{\bf Details}
-\smallskip
-
-We estimate the logistic regression parameters via L2-regularized negative
-log-likelihood minimization~(\ref{eqn:mlogreg:loss}).
-The optimization method used in the script closely follows the trust region
-Newton method for logistic regression described in~\cite{Lin2008:logistic}.
-For convenience, let us make some changes in notation:
-\begin{Itemize}
-\item Convert the input vector of observed category labels into an indicator matrix $Y$
-of size $n \times k$ such that $Y_{i, l} = 1$ if the $i$-th category label is~$l$ and
-$Y_{i, l} = 0$ otherwise;
-\item Append an extra column of all ones, i.e.\ $(1, 1, \ldots, 1)^T$, as the
-$m\,{+}\,1$-st column to the feature matrix $X$ to represent the intercept;
-\item Append an all-zero column as the $k$-th column to $B$, the matrix of regression
-parameters, to represent the baseline category;
-\item Convert the regularization constant $\lambda$ into matrix $\Lambda$ of the same
-size as $B$, placing 0's into the $m\,{+}\,1$-st row to disable intercept regularization,
-and placing $\lambda$'s everywhere else.
-\end{Itemize}
-Now the ($n\,{\times}\,k$)-matrix of predicted probabilities given
-by (\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob})
-and the objective function $f$ in~(\ref{eqn:mlogreg:loss}) have the matrix form
-\begin{align*}
-P \,\,&=\,\, \exp(XB) \,\,/\,\, \big(\exp(XB)\,1_{k\times k}\big)\\
-f \,\,&=\,\, - \,\,{\textstyle\sum} \,\,Y \cdot (X B)\, + \,
-{\textstyle\sum}\,\log\big(\exp(XB)\,1_{k\times 1}\big) \,+ \,
-(1/2)\,\, {\textstyle\sum} \,\,\Lambda \cdot B \cdot B
-\end{align*}
-where operations $\cdot\,$, $/$, $\exp$, and $\log$ are applied cellwise,
-and $\textstyle\sum$ denotes the sum of all cells in a matrix.
-The gradient of~$f$ with respect to~$B$ can be represented as a matrix too:
-\begin{equation*}
-\nabla f \,\,=\,\, X^T (P - Y) \,+\, \Lambda \cdot B
-\end{equation*}
-The Hessian $\mathcal{H}$ of~$f$ is a tensor, but, fortunately, the conjugate
-gradient inner loop of the trust region algorithm in~\cite{Lin2008:logistic}
-does not need to instantiate it.  We only need to multiply $\mathcal{H}$ by
-ordinary matrices of the same size as $B$ and $\nabla f$, and this can be done
-in matrix form:
-\begin{equation*}
-\mathcal{H}V \,\,=\,\, X^T \big( Q \,-\, P \cdot (Q\,1_{k\times k}) \big) \,+\,
-\Lambda \cdot V, \,\,\,\,\textrm{where}\,\,\,\,Q \,=\, P \cdot (XV)
-\end{equation*}
-At each Newton iteration (the \emph{outer} iteration) the minimization algorithm
-approximates the difference $\varDelta f(S; B) = f(B + S; X, Y) \,-\, f(B; X, Y)$
-attained in the objective function after a step $B \mapsto B\,{+}\,S$ by a
-second-degree formula
-\begin{equation*}
-\varDelta f(S; B) \,\,\,\approx\,\,\, (1/2)\,\,{\textstyle\sum}\,\,S \cdot \mathcal{H}S
- \,+\, {\textstyle\sum}\,\,S\cdot \nabla f
-\end{equation*}
-This approximation is then minimized by trust-region conjugate gradient iterations
-(the \emph{inner} iterations) subject to the constraint $\|S\|_2 \leq \delta$.
-The trust region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$
-and updated as described in~\cite{Lin2008:logistic}.
-Users can specify the maximum number of the outer and the inner iterations with
-input parameters {\tt moi} and {\tt mii}, respectively.  The iterative minimizer
-terminates successfully if $\|\nabla f\|_2 < \eps\,\|\nabla f_{B=0}\|_2$,
-where $\eps > 0$ is a tolerance supplied by the user via input parameter~{\tt tol}.
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-The estimated regression parameters (the $\hat{\beta}_{j, l}$) are populated into
-a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
-input argument.  Only the non-baseline categories ($1\leq l \leq k\,{-}\,1$) have
-their $\hat{\beta}_{j, l}$ in the output; to add the baseline category, just append
-a column of zeros.  If {\tt icpt=0} in the input command line, no intercepts are used
-and {\tt B} has size $m\times (k\,{-}\,1)$; otherwise {\tt B} has size 
-$(m\,{+}\,1)\times (k\,{-}\,1)$
-and the intercepts are in the $m\,{+}\,1$-st row.  If {\tt icpt=2}, then initially
-the feature columns in~$X$ are shifted to mean${} = 0$ and rescaled to variance${} = 1$.
-After the iterations converge, the $\hat{\beta}_{j, l}$'s are rescaled and shifted
-to work with the original features.
-
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-\hml -f MultiLogReg.dml -nvargs X=/user/biadmin/X.mtx 
-  Y=/user/biadmin/Y.mtx B=/user/biadmin/B.mtx fmt=csv
-  icpt=2 reg=1.0 tol=0.0001 moi=100 mii=10 Log=/user/biadmin/log.csv
-
-}
-
-
-\smallskip
-\noindent{\bf References}
-\begin{itemize}
-\item A.~Agresti.
-\newblock {\em Categorical Data Analysis}.
-\newblock Wiley Series in Probability and Statistics. Wiley-Interscience,  second edition, 2002.
-\end{itemize}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/MultiSVM.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/MultiSVM.tex b/Algorithms Reference/MultiSVM.tex
deleted file mode 100644
index 87880a9..0000000
--- a/Algorithms Reference/MultiSVM.tex	
+++ /dev/null
@@ -1,174 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsubsection{Multi-class Support Vector Machines}
-\label{msvm}
-
-\noindent{\bf Description}
-
-Support Vector Machines are used to model the relationship between a categorical 
-dependent variable y and one or more explanatory variables denoted X. This 
-implementation supports dependent variables that have domain size greater or
-equal to 2 and hence is not restricted to binary class labels.
-\\
-
-\noindent{\bf Usage}
-
-\begin{tabbing}
-\texttt{-f} \textit{path}/\texttt{m-svm.dml -nvargs}
-\=\texttt{X=}\textit{path}/\textit{file} 
-  \texttt{Y=}\textit{path}/\textit{file}
-  \texttt{icpt=}\textit{int}\\
-\>\texttt{tol=}\textit{double} 
-  \texttt{reg=}\textit{double}
-  \texttt{maxiter=}\textit{int} 
-  \texttt{model=}\textit{path}/\textit{file}\\
-\>\texttt{Log=}\textit{path}/\textit{file}
-  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
-\end{tabbing}
-
-\begin{tabbing}
-\texttt{-f} \textit{path}/\texttt{m-svm-predict.dml -nvargs}
-\=\texttt{X=}\textit{path}/\textit{file} 
-  \texttt{Y=}\textit{path}/\textit{file}
-  \texttt{icpt=}\textit{int}
-  \texttt{model=}\textit{path}/\textit{file}\\
-\>\texttt{scores=}\textit{path}/\textit{file}
-  \texttt{accuracy=}\textit{path}/\textit{file}\\
-\>\texttt{confusion=}\textit{path}/\textit{file}
-  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
-\end{tabbing}
-
-\noindent{\bf Arguments}
-
-\begin{itemize}
-\item X: Location (on HDFS) containing the explanatory variables 
-in a matrix. Each row constitutes an example.
-\item Y: Location (on HDFS) containing a 1-column matrix specifying 
-the categorical dependent variable (label). Labels are assumed to be 
-contiguously numbered from 1 $\ldots$ \#classes.  Note that, this 
-argument is optional for prediction.
-\item icpt (default: {\tt 0}): If set to 1 then a constant bias column
-is added to X.
-\item tol (default: {\tt 0.001}): Procedure terminates early if the reduction
-in objective function value is less than tolerance times the initial objective
-function value.
-\item reg (default: {\tt 1}): Regularization constant. See details to find 
-out where lambda appears in the objective function. If one were interested 
-in drawing an analogy with C-SVM, then C = 2/lambda. Usually, cross validation 
-is employed to determine the optimum value of lambda.
-\item maxiter (default: {\tt 100}): The maximum number of iterations.
-\item model: Location (on HDFS) that contains the learnt weights.
-\item Log: Location (on HDFS) to collect various metrics (e.g., objective 
-function value etc.) that depict progress across iterations while training.
-\item fmt (default: {\tt text}): Specifies the output format. Choice of 
-comma-separated values (csv) or as a sparse-matrix (text).
-\item scores: Location (on HDFS) to store scores for a held-out test set.
-Note that, this is an optional argument.
-\item accuracy: Location (on HDFS) to store the accuracy computed on a
-held-out test set. Note that, this is an optional argument.
-\item confusion: Location (on HDFS) to store the confusion matrix
-computed using a held-out test set. Note that, this is an optional 
-argument.
-\end{itemize}
-
-\noindent{\bf Details}
-
-Support vector machines learn a classification function by solving the
-following optimization problem ($L_2$-SVM):
-\begin{eqnarray*}
-&\textrm{argmin}_w& \frac{\lambda}{2} ||w||_2^2 + \sum_i \xi_i^2\\
-&\textrm{subject to:}& y_i w^{\top} x_i \geq 1 - \xi_i ~ \forall i
-\end{eqnarray*}
-where $x_i$ is an example from the training set with its label given by $y_i$, 
-$w$ is the vector of parameters and $\lambda$ is the regularization constant 
-specified by the user.
-
-To extend the above formulation (binary class SVM) to the multiclass setting,
-one standard approache is to learn one binary class SVM per class that 
-separates data belonging to that class from the rest of the training data 
-(one-against-the-rest SVM, see C. Scholkopf, 1995).
-
-To account for the missing bias term, one may augment the data with a column
-of constants which is achieved by setting intercept argument to 1 (C-J Hsieh 
-et al, 2008).
-
-This implementation optimizes the primal directly (Chapelle, 2007). It uses 
-nonlinear conjugate gradient descent to minimize the objective function 
-coupled with choosing step-sizes by performing one-dimensional Newton 
-minimization in the direction of the gradient.
-\\
-
-\noindent{\bf Returns}
-
-The learnt weights produced by m-svm.dml are populated into a matrix that 
-has as many columns as there are classes in the training data, and written 
-to file provided on HDFS (see model in section Arguments). The number of rows
-in this matrix is ncol(X) if intercept was set to 0 during invocation and ncol(X) + 1
-otherwise. The bias terms, if used, are placed in the last row. Depending on what
-arguments are provided during invocation, m-svm-predict.dml may compute one or more
-of scores, accuracy and confusion matrix in the output format specified.
-\\
-
-%%\noindent{\bf See Also}
-%%
-%%In case of binary classification problems, please consider using a binary class classifier
-%%learning algorithm, e.g., binary class $L_2$-SVM (see Section \ref{l2svm}) or logistic regression
-%%(see Section \ref{logreg}). To model the relationship between a scalar dependent variable 
-%%y and one or more explanatory variables X, consider Linear Regression instead (see Section 
-%%\ref{linreg-solver} or Section \ref{linreg-iterative}).
-%%\\
-%%
-\noindent{\bf Examples}
-\begin{verbatim}
-hadoop jar SystemML.jar -f m-svm.dml -nvargs X=/user/biadmin/X.mtx 
-                                             Y=/user/biadmin/y.mtx 
-                                             icpt=0 tol=0.001
-                                             reg=1.0 maxiter=100 fmt=csv 
-                                             model=/user/biadmin/weights.csv
-                                             Log=/user/biadmin/Log.csv
-\end{verbatim}
-
-\begin{verbatim}
-hadoop jar SystemML.jar -f m-svm-predict.dml -nvargs X=/user/biadmin/X.mtx 
-                                                     Y=/user/biadmin/y.mtx 
-                                                     icpt=0 fmt=csv
-                                                     model=/user/biadmin/weights.csv
-                                                     scores=/user/biadmin/scores.csv
-                                                     accuracy=/user/biadmin/accuracy.csv
-                                                     confusion=/user/biadmin/confusion.csv
-\end{verbatim}
-
-\noindent{\bf References}
-
-\begin{itemize}
-\item W. T. Vetterling and B. P. Flannery. \newblock{\em Conjugate Gradient Methods in Multidimensions in 
-Numerical Recipes in C - The Art in Scientific Computing.} \newblock W. H. Press and S. A. Teukolsky
-(eds.), Cambridge University Press, 1992.
-\item J. Nocedal and  S. J. Wright. \newblock{\em Numerical Optimization.} \newblock Springer-Verlag, 1999.
-\item C-J Hsieh, K-W Chang, C-J Lin, S. S. Keerthi and S. Sundararajan. \newblock {\em A Dual Coordinate 
-Descent Method for Large-scale Linear SVM.} \newblock International Conference of Machine Learning
-(ICML), 2008.
-\item Olivier Chapelle. \newblock{\em Training a Support Vector Machine in the Primal.} \newblock Neural 
-Computation, 2007.
-\item B. Scholkopf, C. Burges and V. Vapnik. \newblock{\em Extracting Support Data for a Given Task.} \newblock International Conference on Knowledge Discovery and Data Mining (ICDM), 1995.
-\end{itemize}
-

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/NaiveBayes.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/NaiveBayes.tex b/Algorithms Reference/NaiveBayes.tex
deleted file mode 100644
index b5f721d..0000000
--- a/Algorithms Reference/NaiveBayes.tex	
+++ /dev/null
@@ -1,155 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Naive Bayes}
-\label{naive_bayes}
-
-\noindent{\bf Description}
-
-Naive Bayes is very simple generative model used for classifying data. 
-This implementation learns a multinomial naive Bayes classifier which
-is applicable when all features are counts of categorical values.
-\\
-
-\noindent{\bf Usage}
-
-\begin{tabbing}
-\texttt{-f} \textit{path}/\texttt{naive-bayes.dml -nvargs} 
-\=\texttt{X=}\textit{path}/\textit{file} 
-  \texttt{Y=}\textit{path}/\textit{file} 
-  \texttt{laplace=}\textit{double}\\
-\>\texttt{prior=}\textit{path}/\textit{file}
-  \texttt{conditionals=}\textit{path}/\textit{file}\\
-\>\texttt{accuracy=}\textit{path}/\textit{file}
-  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
-\end{tabbing}
-
-\begin{tabbing}
-\texttt{-f} \textit{path}/\texttt{naive-bayes-predict.dml -nvargs} 
-\=\texttt{X=}\textit{path}/\textit{file} 
-  \texttt{Y=}\textit{path}/\textit{file} 
-  \texttt{prior=}\textit{path}/\textit{file}\\
-\>\texttt{conditionals=}\textit{path}/\textit{file}
-  \texttt{fmt=}\textit{csv}$\vert$\textit{text}\\
-\>\texttt{accuracy=}\textit{path}/\textit{file}
-  \texttt{confusion=}\textit{path}/\textit{file}\\
-\>\texttt{probabilities=}\textit{path}/\textit{file}
-\end{tabbing}
-
-\noindent{\bf Arguments}
-
-\begin{itemize}
-\item X: Location (on HDFS) to read the matrix of feature vectors; 
-each row constitutes one feature vector.
-\item Y: Location (on HDFS) to read the one-column matrix of (categorical) 
-labels that correspond to feature vectors in X. Classes are assumed to be
-contiguously labeled beginning from 1. Note that, this argument is optional
-for prediction.
-\item laplace (default: {\tt 1}): Laplace smoothing specified by the 
-user to avoid creation of 0 probabilities.
-\item prior: Location (on HDFS) that contains the class prior probabilites.
-\item conditionals: Location (on HDFS) that contains the class conditional
-feature distributions.
-\item fmt (default: {\tt text}): Specifies the output format. Choice of 
-comma-separated values (csv) or as a sparse-matrix (text).
-\item probabilities: Location (on HDFS) to store class membership probabilities
-for a held-out test set. Note that, this is an optional argument.
-\item accuracy: Location (on HDFS) to store the training accuracy during
-learning and testing accuracy from a held-out test set during prediction. 
-Note that, this is an optional argument for prediction.
-\item confusion: Location (on HDFS) to store the confusion matrix
-computed using a held-out test set. Note that, this is an optional 
-argument.
-\end{itemize}
-
-\noindent{\bf Details}
-
-Naive Bayes is a very simple generative classification model. It posits that 
-given the class label, features can be generated independently of each other.
-More precisely, the (multinomial) naive Bayes model uses the following 
-equation to estimate the joint probability of a feature vector $x$ belonging 
-to class $y$:
-\begin{equation*}
-\text{Prob}(y, x) = \pi_y \prod_{i \in x} \theta_{iy}^{n(i,x)}
-\end{equation*}
-where $\pi_y$ denotes the prior probability of class $y$, $i$ denotes a feature
-present in $x$ with $n(i,x)$ denoting its count and $\theta_{iy}$ denotes the 
-class conditional probability of feature $i$ in class $y$. The usual 
-constraints hold on $\pi$ and $\theta$:
-\begin{eqnarray*}
-&& \pi_y \geq 0, ~ \sum_{y \in \mathcal{C}} \pi_y = 1\\
-\forall y \in \mathcal{C}: && \theta_{iy} \geq 0, ~ \sum_i \theta_{iy} = 1
-\end{eqnarray*}
-where $\mathcal{C}$ is the set of classes.
-
-Given a fully labeled training dataset, it is possible to learn a naive Bayes 
-model using simple counting (group-by aggregates). To compute the class conditional
-probabilities, it is usually advisable to avoid setting $\theta_{iy}$ to 0. One way to 
-achieve this is using additive smoothing or Laplace smoothing. Some authors have argued
-that this should in fact be add-one smoothing. This implementation uses add-one smoothing
-by default but lets the user specify her/his own constant, if required.
-
-This implementation is sometimes referred to as \emph{multinomial} naive Bayes. Other
-flavours of naive Bayes are also popular.
-\\
-
-\noindent{\bf Returns}
-
-The learnt model produced by naive-bayes.dml is stored in two separate files. 
-The first file stores the class prior (a single-column matrix). The second file 
-stores the class conditional probabilities organized into a matrix with as many 
-rows as there are class labels and as many columns as there are features. 
-Depending on what arguments are provided during invocation, naive-bayes-predict.dml 
-may compute one or more of probabilities, accuracy and confusion matrix in the 
-output format specified. 
-\\
-
-\noindent{\bf Examples}
-
-\begin{verbatim}
-hadoop jar SystemML.jar -f naive-bayes.dml -nvargs 
-                           X=/user/biadmin/X.mtx 
-                           Y=/user/biadmin/y.mtx 
-                           laplace=1 fmt=csv
-                           prior=/user/biadmin/prior.csv
-                           conditionals=/user/biadmin/conditionals.csv
-                           accuracy=/user/biadmin/accuracy.csv
-\end{verbatim}
-
-\begin{verbatim}
-hadoop jar SystemML.jar -f naive-bayes-predict.dml -nvargs 
-                           X=/user/biadmin/X.mtx 
-                           Y=/user/biadmin/y.mtx 
-                           prior=/user/biadmin/prior.csv
-                           conditionals=/user/biadmin/conditionals.csv
-                           fmt=csv
-                           accuracy=/user/biadmin/accuracy.csv
-                           probabilities=/user/biadmin/probabilities.csv
-                           confusion=/user/biadmin/confusion.csv
-\end{verbatim}
-
-\noindent{\bf References}
-
-\begin{itemize}
-\item S. Russell and P. Norvig. \newblock{\em Artificial Intelligence: A Modern Approach.} Prentice Hall, 2009.
-\item A. McCallum and K. Nigam. \newblock{\em A comparison of event models for naive bayes text classification.} 
-\newblock AAAI-98 workshop on learning for text categorization, 1998.
-\end{itemize}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/PCA.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/PCA.tex b/Algorithms Reference/PCA.tex
deleted file mode 100644
index cef750e..0000000
--- a/Algorithms Reference/PCA.tex	
+++ /dev/null
@@ -1,142 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Principal Component Analysis}
-\label{pca}
-
-\noindent{\bf Description}
-
-Principal Component Analysis (PCA) is a simple, non-parametric method to transform the given data set with possibly correlated columns into a set of linearly uncorrelated or orthogonal columns, called {\em principal components}. The principal components are ordered in such a way that the first component accounts for the largest possible variance, followed by remaining principal components in the decreasing order of the amount of variance captured from the data. PCA is often used as a dimensionality reduction technique, where the original data is projected or rotated onto a low-dimensional space with basis vectors defined by top-$K$ (for a given value of $K$) principal components.
-\\
-
-\noindent{\bf Usage}
-
-\begin{tabbing}
-\texttt{-f} \textit{path}/\texttt{PCA.dml -nvargs} 
-\=\texttt{INPUT=}\textit{path}/\textit{file} 
-  \texttt{K=}\textit{int} \\
-\>\texttt{CENTER=}\textit{0/1}
-  \texttt{SCALE=}\textit{0/1}\\
-\>\texttt{PROJDATA=}\textit{0/1}
-  \texttt{OFMT=}\textit{csv}/\textit{text}\\
-\>\texttt{MODEL=}\textit{path}$\vert$\textit{file}
-  \texttt{OUTPUT=}\textit{path}/\textit{file}
-\end{tabbing}
-
-\noindent{\bf Arguments}
-
-\begin{itemize}
-\item INPUT: Location (on HDFS) to read the input matrix.
-\item K: Indicates dimension of the new vector space constructed from $K$ principal components. It must be a value between $1$ and the number of columns in the input data.
-\item CENTER (default: {\tt 0}): Indicates whether or not to {\em center} input data prior to the computation of principal components.
-\item SCALE (default: {\tt 0}): Indicates whether or not to {\em scale} input data prior to the computation of principal components.
-\item PROJDATA: Indicates whether or not the input data must be projected on to new vector space defined over principal components.
-\item OFMT (default: {\tt csv}): Specifies the output format. Choice of comma-separated values (csv) or as a sparse-matrix (text).
-\item MODEL: Either the location (on HDFS) where the computed model is stored; or the location of an existing model.
-\item OUTPUT: Location (on HDFS) to store the data rotated on to the new vector space.
-\end{itemize}
-
-\noindent{\bf Details}
-
-Principal Component Analysis (PCA) is a non-parametric procedure for orthogonal linear transformation of the input data to a new coordinate system, such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In other words, PCA first selects a normalized direction in $m$-dimensional space ($m$ is the number of columns in the input data) along which the variance in input data is maximized -- this is referred to as the first principal component. It then repeatedly finds other directions (principal components) in which the variance is maximized. At every step, PCA restricts the search for only those directions that are perpendicular to all previously selected directions. By doing so, PCA aims to reduce the redundancy among input variables. To understand the notion of redundancy, consider an extreme scenario with a data set comprising of two v
 ariables, where the first one denotes some quantity expressed in meters, and the other variable represents the same quantity but in inches. Both these variables evidently capture redundant information, and hence one of them can be removed. In a general scenario, keeping solely the linear combination of input variables would both express the data more concisely and reduce the number of variables. This is why PCA is often used as a dimensionality reduction technique.
-
-The specific method to compute such a new coordinate system is as follows -- compute a covariance matrix $C$ that measures the strength of correlation among all pairs of variables in the input data; factorize $C$ according to eigen decomposition to calculate its eigenvalues and eigenvectors; and finally, order eigenvectors in the decreasing order of their corresponding eigenvalue. The computed eigenvectors (also known as {\em loadings}) define the new coordinate system and the square root of eigen values provide the amount of variance in the input data explained by each coordinate or eigenvector. 
-\\
-
-%As an example, consider the data in Table~\ref{tab:pca_data}. 
-\begin{comment}
-\begin{table}
-\parbox{.35\linewidth}{
-\centering
-\begin{tabular}{cc}
-  \hline
-  x & y \\
-  \hline
-  2.5 & 2.4  \\
-  0.5 & 0.7  \\
-  2.2 & 2.9  \\
-  1.9 & 2.2  \\
-  3.1 & 3.0  \\
-  2.3 & 2.7  \\
-  2 & 1.6  \\
-  1 & 1.1  \\
-  1.5 & 1.6  \\
-  1.1 & 0.9  \\
-	\hline
-\end{tabular}
-\caption{Input Data}
-\label{tab:pca_data}
-}
-\hfill
-\parbox{.55\linewidth}{
-\centering
-\begin{tabular}{cc}
-  \hline
-  x & y \\
-  \hline
-  .69  & .49  \\
-  -1.31  & -1.21  \\
-  .39  & .99  \\
-  .09  & .29  \\
-  1.29  & 1.09  \\
-  .49  & .79  \\
-  .19  & -.31  \\
-  -.81  & -.81  \\
-  -.31  & -.31  \\
-  -.71  & -1.01  \\
-  \hline
-\end{tabular}
-\caption{Data after centering and scaling}
-\label{tab:pca_scaled_data}
-}
-\end{table}
-\end{comment}
-
-\noindent{\bf Returns}
-When MODEL is not provided, PCA procedure is applied on INPUT data to generate MODEL as well as the rotated data OUTPUT (if PROJDATA is set to $1$) in the new coordinate system. 
-The produced model consists of basis vectors MODEL$/dominant.eigen.vectors$ for the new coordinate system; eigen values MODEL$/dominant.eigen.values$; and the standard deviation MODEL$/dominant.eigen.standard.deviations$ of principal components.
-When MODEL is provided, INPUT data is rotated according to the coordinate system defined by MODEL$/dominant.eigen.vectors$. The resulting data is stored at location OUTPUT.
-\\
-
-\noindent{\bf Examples}
-
-\begin{verbatim}
-hadoop jar SystemML.jar -f PCA.dml -nvargs 
-            INPUT=/user/biuser/input.mtx  K=10
-            CENTER=1  SCALE=1
-            OFMT=csv PROJDATA=1
-				    # location to store model and rotated data
-            OUTPUT=/user/biuser/pca_output/   
-\end{verbatim}
-
-\begin{verbatim}
-hadoop jar SystemML.jar -f PCA.dml -nvargs 
-            INPUT=/user/biuser/test_input.mtx  K=10
-            CENTER=1  SCALE=1
-            OFMT=csv PROJDATA=1
-				    # location of an existing model
-            MODEL=/user/biuser/pca_output/       
-				    # location of rotated data
-            OUTPUT=/user/biuser/test_output.mtx  
-\end{verbatim}
-
-
-

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/RandomForest.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/RandomForest.tex b/Algorithms Reference/RandomForest.tex
deleted file mode 100644
index f9b47f3..0000000
--- a/Algorithms Reference/RandomForest.tex	
+++ /dev/null
@@ -1,215 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Random Forests}
-\label{random_forests}
-
-\noindent{\bf Description}
-\smallskip
-
-
-Random forest is one of the most successful machine learning methods for classification and regression. 
-It is an ensemble learning method that creates a model composed of a set of tree models.
-This implementation is well-suited to handle large-scale data and builds a random forest model for classification in parallel.\\
-
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-	{\tt{}-f }path/\/{\tt{}random-forest.dml}
-	{\tt{} -nvargs}
-	{\tt{} X=}path/file
-	{\tt{} Y=}path/file
-	{\tt{} R=}path/file
-	{\tt{} bins=}integer
-	{\tt{} depth=}integer
-	{\tt{} num\_leaf=}integer
-	{\tt{} num\_samples=}integer
-	{\tt{} num\_trees=}integer
-	{\tt{} subsamp\_rate=}double
-	{\tt{} feature\_subset=}double
-	{\tt{} impurity=}Gini$\mid$entropy
-	{\tt{} M=}path/file
-	{\tt{} C=}path/file
-	{\tt{} S\_map=}path/file
-	{\tt{} C\_map=}path/file
-	{\tt{} fmt=}format
-	
-}
-
- \smallskip
- \noindent{\bf Usage: Prediction}
- \smallskip
- 
- {\hangindent=\parindent\noindent\it%
- 	{\tt{}-f }path/\/{\tt{}random-forest-predict.dml}
- 	{\tt{} -nvargs}
- 	{\tt{} X=}path/file
- 	{\tt{} Y=}path/file
- 	{\tt{} R=}path/file
- 	{\tt{} M=}path/file
- 	{\tt{} C=}path/file
- 	{\tt{} P=}path/file
- 	{\tt{} A=}path/file
- 	{\tt{} OOB=}path/file
- 	{\tt{} CM=}path/file
- 	{\tt{} fmt=}format
- 	
- }\smallskip
- 
- 
-\noindent{\bf Arguments}
-\begin{Description}
-	\item[{\tt X}:]
-	Location (on HDFS) to read the matrix of feature vectors; 
-	each row constitutes one feature vector. Note that categorical features in $X$ need to be both recoded and dummy coded.
-	\item[{\tt Y}:]
-	Location (on HDFS) to read the matrix of (categorical) 
-	labels that correspond to feature vectors in $X$. Note that classes are assumed to be both recoded and dummy coded. 
-	This argument is optional for prediction. 
-	\item[{\tt R}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to read matrix $R$ which for each feature in $X$ contains column-ids (first column), start indices (second column), and end indices (third column).
-	If $R$ is not provided by default all features are assumed to be continuous-valued.   
-	\item[{\tt bins}:] (default:\mbox{ }{\tt 20})
-	Number of thresholds to choose for each continuous-valued feature (determined by equi-height binning). 
-	\item[{\tt depth}:] (default:\mbox{ }{\tt 25})
-	Maximum depth of the learned trees in the random forest model
-	\item[{\tt num\_leaf}:] (default:\mbox{ }{\tt 10})
-	Parameter that controls pruning. The tree
-	is not expanded if a node receives less than {\tt num\_leaf} training examples.
-	\item[{\tt num\_samples}:] (default:\mbox{ }{\tt 3000})
-	Parameter that decides when to switch to in-memory building of the subtrees in each tree of the random forest model. 
-	If a node $v$ receives less than {\tt num\_samples}
-	training examples then this implementation switches to an in-memory subtree
-	building procedure to build the subtree under $v$ in its entirety.
-	\item[{\tt num\_trees}:] (default:\mbox{ }{\tt 10})
-	Number of trees to be learned in the random forest model
-	\item[{\tt subsamp\_rate}:] (default:\mbox{ }{\tt 1.0})
-	Parameter controlling the size of each tree in the random forest model; samples are selected from a Poisson distribution with parameter {\tt subsamp\_rate}.
-	\item[{\tt feature\_subset}:] (default:\mbox{ }{\tt 0.5})
-	Parameter that controls the number of feature used as candidates for splitting at each tree node as a power of the number of features in the data, i.e., assuming the training set has $D$ features $D^{\tt feature\_subset}$ are used at each tree node.
-	\item[{\tt impurity}:] (default:\mbox{ }{\tt "Gini"})
-	Impurity measure used at internal nodes of the trees in the random forest model for selecting which features to split on. Possible value are entropy or Gini.
-	\item[{\tt M}:] 
-	Location (on HDFS) to write matrix $M$ containing the learned random forest (see Section~\ref{sec:decision_trees} and below for the schema) 
-	\item[{\tt C}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to store the number of counts (generated according to a Poisson distribution with parameter {\tt subsamp\_rate}) for each feature vector. Note that this argument is optional. If Out-Of-Bag (OOB) error estimate needs to be computed this parameter is passed as input to {\tt random-forest-predict.dml}. 
-	\item[{\tt A}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to store the testing accuracy (\%) from a 
-	held-out test set during prediction. Note that this argument is optional.
-	\item[{\tt OOB}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to store the Out-Of-Bag (OOB) error estimate of the training set. Note that the matrix of sample counts (stored at {\tt C}) needs to be provided for computing OOB error estimate. Note that this argument is optional.
-	\item[{\tt P}:] 
-	Location (on HDFS) to store predictions for a held-out test set
-	\item[{\tt CM}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to store the confusion matrix computed using a held-out test set. Note that this argument is optional.
-	\item[{\tt S\_map}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to write the mappings from the continuous-valued feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
-	\item[{\tt C\_map}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to write the mappings from the categorical feature-ids to the global feature-ids in $X$ (see below for details). Note that this argument is optional.
-	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-	see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
- \noindent{\bf Details}
- \smallskip
-
-Random forests~\cite{Breiman01:rforest} are learning algorithms for ensembles of decision trees. 
-The main idea is to build a number of decision trees on bootstrapped training samples, i.e., by taking repeatedly samples from a (single) training set. 
-Moreover, instead of considering all the features when building the trees only a random subset of the features---typically $\approx \sqrt{D}$, where $D$ is the number of features---is chosen each time a split test at a tree node is performed. 
-This procedure {\it decorrelates} the trees and makes it less prone to overfitting. 
-To build decision trees we utilize the techniques discussed in Section~\ref{sec:decision_trees} proposed in~\cite{PandaHBB09:dtree}; 
-the implementation details are similar to those of the decision trees script.
-Below we review some features of our implementation which differ from {\tt decision-tree.dml}.
-
-
-\textbf{Bootstrapped sampling.} 
-Each decision tree is fitted to a bootstrapped training set sampled with replacement (WR).  
-To improve efficiency, we generate $N$ sample counts according to a Poisson distribution with parameter {\tt subsamp\_rate},
-where $N$ denotes the total number of training points.
-These sample counts approximate WR sampling when $N$ is large enough and are generated upfront for each decision tree.
-
-
-\textbf{Bagging.}
-Decision trees suffer from {\it high variance} resulting in different models whenever trained on a random subsets of the data points.  
-{\it Bagging} is a general-purpose method to reduce the variance of a statistical learning method like decision trees.
-In the context of decision trees (for classification), for a given test feature vector 
-the prediction is computed by taking a {\it majority vote}: the overall prediction is the most commonly occurring class among all the tree predictions.
-
- 
-\textbf{Out-Of-Bag error estimation.} 
-Note that each bagged tree in a random forest model is trained on a subset (around $\frac{2}{3}$) of the observations (i.e., feature vectors).
-The remaining ($\frac{1}{3}$ of the) observations not used for training is called the {\it Out-Of-Bag} (OOB) observations. 
-This gives us a straightforward way to estimate the test error: to predict the class label of each test observation $i$ we use the trees in which $i$ was OOB.
-Our {\tt random-forest-predict.dml} script provides the OOB error estimate for a given training set if requested.  
-
-
-\textbf{Description of the model.} 
-Similar to decision trees, the learned random forest model is presented in a matrix $M$  with at least 7 rows.
-The information stored in the model is similar to that of decision trees with the difference that the tree-ids are stored
-in the second row and rows $2,3,\ldots$ from the decision tree model are shifted by one. See Section~\ref{sec:decision_trees} for a description of the model.
-
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-
-The matrix corresponding to the learned model is written to a file in the format specified. See Section~\ref{sec:decision_trees} where the details about the structure of the model matrix is described.
-Similar to {\tt decision-tree.dml}, $X$ is split into $X_\text{cont}$ and $X_\text{cat}$. 
-If requested, the mappings of the continuous feature-ids in $X_\text{cont}$ (stored at {\tt S\_map}) as well as the categorical feature-ids in $X_\text{cat}$ (stored at {\tt C\_map}) to the global feature-ids in $X$ will be provided. 
-The {\tt random-forest-predict.dml} script may compute one or more of
-predictions, accuracy, confusion matrix, and OOB error estimate in the requested output format depending on the input arguments used. 
- 
-
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f random-forest.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
-	R=/user/biadmin/R.csv M=/user/biadmin/model.csv
-	bins=20 depth=25 num\_leaf=10 num\_samples=3000 num\_trees=10 impurity=Gini fmt=csv
-	
-}\smallskip
-
-
-\noindent To compute predictions:
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f random-forest-predict.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx R=/user/biadmin/R.csv
-	M=/user/biadmin/model.csv P=/user/biadmin/predictions.csv
-	A=/user/biadmin/accuracy.csv CM=/user/biadmin/confusion.csv fmt=csv
-	
-}\smallskip
-
-
-%\noindent{\bf References}
-%
-%\begin{itemize}
-%\item B. Panda, J. Herbach, S. Basu, and R. Bayardo. \newblock{PLANET: massively parallel learning of tree ensembles with MapReduce}. In Proceedings of the VLDB Endowment, 2009.
-%\item L. Breiman. \newblock{Random Forests}. Machine Learning, 45(1), 5--32, 2001.
-%\end{itemize}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/StepGLM.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/StepGLM.tex b/Algorithms Reference/StepGLM.tex
deleted file mode 100644
index 3869990..0000000
--- a/Algorithms Reference/StepGLM.tex	
+++ /dev/null
@@ -1,132 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Stepwise Generalized Linear Regression}
-
-\noindent{\bf Description}
-\smallskip
-
-Our stepwise generalized linear regression script selects a model based on the Akaike information criterion (AIC): the model that gives rise to the lowest AIC is provided. Note that currently only the Bernoulli distribution family is supported (see below for details). \\
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}StepGLM.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} Y=}path/file
-{\tt{} B=}path/file
-{\tt{} S=}path/file
-{\tt{} O=}path/file
-{\tt{} link=}int
-{\tt{} yneg=}double
-{\tt{} icpt=}int
-{\tt{} tol=}double
-{\tt{} disp=}double
-{\tt{} moi=}int
-{\tt{} mii=}int
-{\tt{} thr=}double
-{\tt{} fmt=}format
-
-}
-
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-	\item[{\tt X}:]
-	Location (on HDFS) to read the matrix of feature vectors; each row is
-	an example.
-	\item[{\tt Y}:]
-	Location (on HDFS) to read the response matrix, which may have 1 or 2 columns
-	\item[{\tt B}:]
-	Location (on HDFS) to store the estimated regression parameters (the $\beta_j$'s), with the
-	intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
-	\item[{\tt S}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to store the selected feature-ids in the order as computed by the algorithm,
-	by default it is standard output.
-	\item[{\tt O}:] (default:\mbox{ }{\tt " "})
-	Location (on HDFS) to write certain summary statistics described in Table~\ref{table:GLM:stats},
-	by default it is standard output. 
-	\item[{\tt link}:] (default:\mbox{ }{\tt 2})
-	Link function code to determine the link function~$\eta = g(\mu)$, see Table~\ref{table:commonGLMs}; currently the following link functions are supported: \\
-	{\tt 1} = log,
-	{\tt 2} = logit,
-	{\tt 3} = probit,
-	{\tt 4} = cloglog.
-	\item[{\tt yneg}:] (default:\mbox{ }{\tt 0.0})
-	Response value for Bernoulli ``No'' label, usually 0.0 or -1.0
-	\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
-	Intercept and shifting/rescaling of the features in~$X$:\\
-	{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\
-	{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
-	{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
-	\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
-	Tolerance (epsilon) used in the convergence criterion: we terminate the outer iterations
-	when the deviance changes by less than this factor; see below for details.
-	\item[{\tt disp}:] (default:\mbox{ }{\tt 0.0})
-	Dispersion parameter, or {\tt 0.0} to estimate it from data
-	\item[{\tt moi}:] (default:\mbox{ }{\tt 200})
-	Maximum number of outer (Fisher scoring) iterations
-	\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
-	Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
-	limit provided
-	\item[{\tt thr}:] (default:\mbox{ }{\tt 0.01})
-	Threshold to stop the algorithm: if the decrease in the value of the AIC falls below {\tt thr}
-	no further features are being checked and the algorithm stops.
-	\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-	Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-	see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
-\noindent{\bf Details}
-\smallskip
-
-Similar to {\tt StepLinearRegDS.dml} our stepwise GLM script builds a model by iteratively selecting predictive variables 
-using a forward selection strategy based on the AIC (\ref{eq:AIC}).
-Note that currently only the Bernoulli distribution family ({\tt fam=2} in Table~\ref{table:commonGLMs}) together with the following link functions are supported: log, logit, probit, and cloglog ({\tt link $\in\{1,2,3,4\}$ } in Table~\ref{table:commonGLMs}).  
-
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-Similar to the outputs from {\tt GLM.dml} the stepwise GLM script computes the estimated regression coefficients and stores them in matrix $B$ on HDFS; matrix $B$ follows the same format as the one produced by {\tt GLM.dml} (see Section~\ref{sec:GLM}).   
-Additionally, {\tt StepGLM.dml} outputs the variable indices (stored in the 1-column matrix $S$) in the order they have been selected by the algorithm, i.e., $i$th entry in matrix $S$ stores the variable which improves the AIC the most in $i$th iteration.  
-If the model with the lowest AIC includes no variables matrix $S$ will be empty. 
-Moreover, the estimated summary statistics as defined in Table~\ref{table:GLM:stats}
-are printed out or stored in a file on HDFS (if requested);
-these statistics will be provided only if the selected model is nonempty, i.e., contains at least one variable.
-
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f StepGLM.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx	B=/user/biadmin/B.mtx S=/user/biadmin/selected.csv O=/user/biadmin/stats.csv link=2 yneg=-1.0 icpt=2 tol=0.000001  moi=100 mii=10 thr=0.05 fmt=csv
-	
-}
-
-

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/StepLinRegDS.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/StepLinRegDS.tex b/Algorithms Reference/StepLinRegDS.tex
deleted file mode 100644
index 8c29fb1..0000000
--- a/Algorithms Reference/StepLinRegDS.tex	
+++ /dev/null
@@ -1,122 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Stepwise Linear Regression}
-
-\noindent{\bf Description}
-\smallskip
-
-Our stepwise linear regression script selects a linear model based on the Akaike information criterion (AIC): 
-the model that gives rise to the lowest AIC is computed. \\
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}StepLinearRegDS.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} Y=}path/file
-{\tt{} B=}path/file
-{\tt{} S=}path/file
-{\tt{} O=}path/file
-{\tt{} icpt=}int
-{\tt{} thr=}double
-{\tt{} fmt=}format
-
-}
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the matrix of feature vectors, each row contains
-one feature vector.
-\item[{\tt Y}:]
-Location (on HDFS) to read the 1-column matrix of response values
-\item[{\tt B}:]
-Location (on HDFS) to store the estimated regression parameters (the $\beta_j$'s), with the
-intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
-\item[{\tt S}:] (default:\mbox{ }{\tt " "})
-Location (on HDFS) to store the selected feature-ids in the order as computed by the algorithm;
-by default the selected feature-ids are forwarded to the standard output.
-\item[{\tt O}:] (default:\mbox{ }{\tt " "})
-Location (on HDFS) to store the CSV-file of summary statistics defined in
-Table~\ref{table:linreg:stats}; by default the summary statistics are forwarded to the standard output.
-\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
-Intercept presence and shifting/rescaling the features in~$X$:\\
-{\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\
-{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
-{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
-\item[{\tt thr}:] (default:\mbox{ }{\tt 0.01})
-Threshold to stop the algorithm: if the decrease in the value of the AIC falls below {\tt thr}
-no further features are being checked and the algorithm stops.
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
-\noindent{\bf Details}
-\smallskip
-
-Stepwise linear regression iteratively selects predictive variables in an automated procedure.
-Currently, our implementation supports forward selection: starting from an empty model (without any variable) 
-the algorithm examines the addition of each variable based on the AIC as a model comparison criterion. The AIC is defined as  
-\begin{equation}
-AIC = -2 \log{L} + 2 edf,\label{eq:AIC}
-\end{equation}    
-where $L$ denotes the likelihood of the fitted model and $edf$ is the equivalent degrees of freedom, i.e., the number of estimated parameters. 
-This procedure is repeated until including no additional variable improves the model by a certain threshold 
-specified in the input parameter {\tt thr}. 
-
-For fitting a model in each iteration we use the ``direct solve'' method as in the script {\tt LinearRegDS.dml} discussed in Section~\ref{sec:LinReg}.  
-
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-Similar to the outputs from {\tt LinearRegDS.dml} the stepwise linear regression script computes 
-the estimated regression coefficients and stores them in matrix $B$ on HDFS. 
-The format of matrix $B$ is identical to the one produced by the scripts for linear regression (see Section~\ref{sec:LinReg}).   
-Additionally, {\tt StepLinearRegDS.dml} outputs the variable indices (stored in the 1-column matrix $S$) 
-in the order they have been selected by the algorithm, i.e., $i$th entry in matrix $S$ corresponds to 
-the variable which improves the AIC the most in $i$th iteration.  
-If the model with the lowest AIC includes no variables matrix $S$ will be empty (contains one 0). 
-Moreover, the estimated summary statistics as defined in Table~\ref{table:linreg:stats}
-are printed out or stored in a file (if requested). 
-In the case where an empty model achieves the best AIC these statistics will not be produced. 
-
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f StepLinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
-	B=/user/biadmin/B.mtx S=/user/biadmin/selected.csv O=/user/biadmin/stats.csv
-	icpt=2 thr=0.05 fmt=csv
-	
-}
-
-

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/SystemML_Algorithms_Reference.bib
----------------------------------------------------------------------
diff --git a/Algorithms Reference/SystemML_Algorithms_Reference.bib b/Algorithms Reference/SystemML_Algorithms_Reference.bib
deleted file mode 100644
index 878e1dc..0000000
--- a/Algorithms Reference/SystemML_Algorithms_Reference.bib	
+++ /dev/null
@@ -1,215 +0,0 @@
-
-@article {Lin2008:logistic,
-   author       = {Chih-Jen Lin and Ruby C.\ Weng and S.\ Sathiya Keerthi},
-   title        = {Trust Region {N}ewton Method for Large-Scale Logistic Regression},
-   journal      = {Journal of Machine Learning Research},
-   month        = {April},
-   year         = {2008},
-   volume       = {9},
-   pages        = {627--650}
-}
-
-@book {Agresti2002:CDA,
-   author       = {Alan Agresti},
-   title        = {Categorical Data Analysis},
-   edition      = {Second},
-   series       = {Wiley Series in Probability and Statistics},
-   publisher    = {Wiley-Interscience},
-   year         = {2002},
-   pages        = {710}
-}
-
-@article {Nelder1972:GLM,
-   author       = {John Ashworth Nelder and Robert William Maclagan Wedderburn},
-   title        = {Generalized Linear Models},
-   journal      = {Journal of the Royal Statistical Society, Series~A (General)},
-   year         = {1972},
-   volume       = {135},
-   number       = {3},
-   pages        = {370--384}
-}
-
-@book {McCullagh1989:GLM,
-   author       = {Peter McCullagh and John Ashworth Nelder},
-   title        = {Generalized Linear Models},
-   edition      = {Second},
-   series       = {Monographs on Statistics and Applied Probability},
-   number       = {37},
-   year         = {1989},
-   publisher    = {Chapman~\&~Hall/CRC}, 
-   pages        = {532}
-}
-
-@book {Gill2000:GLM,
-   author       = {Jeff Gill},
-   title        = {Generalized Linear Models: A Unified Approach},
-   series       = {Sage University Papers Series on Quantitative Applications in the Social Sciences},
-   number       = {07-134},
-   year         = {2000},
-   publisher    = {Sage Publications},
-   pages        = {101}
-}
-
-@inproceedings {AgrawalKSX2002:hippocratic,
-   author       = {Rakesh Agrawal and Jerry Kiernan and Ramakrishnan Srikant and Yirong Xu},
-   title        = {Hippocratic Databases},
-   booktitle    = {Proceedings of the 28-th International Conference on Very Large Data Bases ({VLDB} 2002)},
-   address      = {Hong Kong, China},
-   month        = {August 20--23},
-   year         = {2002},
-   pages        = {143--154}
-}
-
-@book {Nocedal2006:Optimization,
-   title        = {Numerical Optimization},
-   author       = {Jorge Nocedal and Stephen Wright},
-   series       = {Springer Series in Operations Research and Financial Engineering},
-   pages        = {664},
-   edition      = {Second},
-   publisher    = {Springer},
-   year         = {2006}
-}
-
-@book {Hartigan1975:clustering,
-   author       = {John A.\ Hartigan},
-   title        = {Clustering Algorithms},
-   publisher    = {John Wiley~\&~Sons Inc.},
-   series       = {Probability and Mathematical Statistics},
-   month        = {April},
-   year         = {1975},
-   pages        = {365}
-}
-
-@inproceedings {ArthurVassilvitskii2007:kmeans,
-   title        = {{\tt k-means++}: The Advantages of Careful Seeding},
-   author       = {David Arthur and Sergei Vassilvitskii},
-   booktitle    = {Proceedings of the 18th Annual {ACM-SIAM} Symposium on Discrete Algorithms ({SODA}~2007)},
-   month        = {January 7--9}, 
-   year         = {2007},
-   address      = {New Orleans~{LA}, {USA}},
-   pages        = {1027--1035}
-}
-
-@article {AloiseDHP2009:kmeans,
-   author       = {Daniel Aloise and Amit Deshpande and Pierre Hansen and Preyas Popat},
-   title        = {{NP}-hardness of {E}uclidean Sum-of-squares Clustering},
-   journal      = {Machine Learning},
-   publisher    = {Kluwer Academic Publishers},
-   volume       = {75},
-   number       = {2}, 
-   month        = {May}, 
-   year         = {2009},
-   pages        = {245--248}
-}
-
-@article {Cochran1954:chisq,
-   author       = {William G.\ Cochran},
-   title        = {Some Methods for Strengthening the Common $\chi^2$ Tests},
-   journal      = {Biometrics},
-   volume       = {10},
-   number       = {4},
-   month        = {December},
-   year         = {1954},
-   pages        = {417--451}
-}
-
-@article {AcockStavig1979:CramersV,
-   author       = {Alan C.\ Acock and Gordon R.\ Stavig},
-   title        = {A Measure of Association for Nonparametric Statistics},
-   journal      = {Social Forces},
-   publisher    = {Oxford University Press},
-   volume       = {57},
-   number       = {4},
-   month        = {June},
-   year         = {1979},
-   pages        = {1381--1386}
-}
-
-@article {Stevens1946:scales,
-   author       = {Stanley Smith Stevens},
-   title        = {On the Theory of Scales of Measurement},
-   journal      = {Science},
-   month        = {June 7},
-   year         = {1946},
-   volume       = {103},
-   number       = {2684},
-   pages        = {677--680}
-}
-
-@book{collett2003:kaplanmeier,
-  title={Modelling Survival Data in Medical Research, Second Edition},
-  author={Collett, D.},
-  isbn={9781584883258},
-  lccn={2003040945},
-  series={Chapman \& Hall/CRC Texts in Statistical Science},
-  year={2003},
-  publisher={Taylor \& Francis}
-}
-
-@article{PetoPABCHMMPS1979:kaplanmeier,
-    title = {{Design and analysis of randomized clinical trials requiring prolonged observation of each patient. II. analysis and examples.}},
-    author = {Peto, R. and Pike, M. C. and Armitage, P. and Breslow, N. E. and Cox, D. R. and Howard, S. V. and Mantel, N. and McPherson, K. and Peto, J. and Smith, P. G.},
-    journal = {British journal of cancer},
-    number = {1},
-    pages = {1--39},
-    volume = {35},
-    year = {1977}
-}
-
-@inproceedings{ZhouWSP08:als,
-  author    = {Yunhong Zhou and
-               Dennis M. Wilkinson and
-               Robert Schreiber and
-               Rong Pan},
-  title     = {Large-Scale Parallel Collaborative Filtering for the Netflix Prize},
-  booktitle = {Algorithmic Aspects in Information and Management, 4th International
-               Conference, {AAIM} 2008, Shanghai, China, June 23-25, 2008. Proceedings},
-  pages     = {337--348},
-  year      = {2008}
-}
-
-@book{BreimanFOS84:dtree,
-  author    = {Leo Breiman and
-               J. H. Friedman and
-               R. A. Olshen and
-               C. J. Stone},
-  title     = {Classification and Regression Trees},
-  publisher = {Wadsworth},
-  year      = {1984},
-  isbn      = {0-534-98053-8},
-  timestamp = {Thu, 03 Jan 2002 11:51:52 +0100},
-  biburl    = {http://dblp.uni-trier.de/rec/bib/books/wa/BreimanFOS84},
-  bibsource = {dblp computer science bibliography, http://dblp.org}
-}
-
-@article{PandaHBB09:dtree,
-  author    = {Biswanath Panda and
-               Joshua Herbach and
-               Sugato Basu and
-               Roberto J. Bayardo},
-  title     = {{PLANET:} Massively Parallel Learning of Tree Ensembles with MapReduce},
-  journal   = {{PVLDB}},
-  volume    = {2},
-  number    = {2},
-  pages     = {1426--1437},
-  year      = {2009},
-  url       = {http://www.vldb.org/pvldb/2/vldb09-537.pdf},
-  timestamp = {Wed, 02 Sep 2009 09:21:18 +0200},
-  biburl    = {http://dblp.uni-trier.de/rec/bib/journals/pvldb/PandaHBB09},
-  bibsource = {dblp computer science bibliography, http://dblp.org}
-}
-
-@article{Breiman01:rforest,
-  author    = {Leo Breiman},
-  title     = {Random Forests},
-  journal   = {Machine Learning},
-  volume    = {45},
-  number    = {1},
-  pages     = {5--32},
-  year      = {2001},
-  url       = {http://dx.doi.org/10.1023/A:1010933404324},
-  doi       = {10.1023/A:1010933404324},
-  timestamp = {Thu, 26 May 2011 15:25:18 +0200},
-  biburl    = {http://dblp.uni-trier.de/rec/bib/journals/ml/Breiman01},
-  bibsource = {dblp computer science bibliography, http://dblp.org}
-}
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/SystemML_Algorithms_Reference.pdf
----------------------------------------------------------------------
diff --git a/Algorithms Reference/SystemML_Algorithms_Reference.pdf b/Algorithms Reference/SystemML_Algorithms_Reference.pdf
deleted file mode 100644
index 4087ba5..0000000
Binary files a/Algorithms Reference/SystemML_Algorithms_Reference.pdf and /dev/null differ


[36/50] [abbrv] incubator-systemml git commit: [SYSTEMML-942] added gpu option to MLContext API

Posted by de...@apache.org.
[SYSTEMML-942] added gpu option to MLContext API

Additionally,
- Changed initialization of CUDA libraries from static to per instance
- Added documentation to mlcontext programming guide

Closes #420


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/42e86e76
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/42e86e76
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/42e86e76

Branch: refs/heads/gh-pages
Commit: 42e86e76c1e324f53351fe5866ce5675482df15a
Parents: 4ec1b9f
Author: Nakul Jindal <na...@gmail.com>
Authored: Tue Mar 7 13:41:03 2017 -0800
Committer: Nakul Jindal <na...@gmail.com>
Committed: Tue Mar 7 13:41:03 2017 -0800

----------------------------------------------------------------------
 spark-mlcontext-programming-guide.md | 90 +++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/42e86e76/spark-mlcontext-programming-guide.md
----------------------------------------------------------------------
diff --git a/spark-mlcontext-programming-guide.md b/spark-mlcontext-programming-guide.md
index c15c27f..c28eaf5 100644
--- a/spark-mlcontext-programming-guide.md
+++ b/spark-mlcontext-programming-guide.md
@@ -1086,6 +1086,96 @@ mean: Double = 0.5002109404821844
 
 </div>
 
+## GPU
+
+If the driver node has a GPU, SystemML may be able to utilize it, subject to memory constraints and what instructions are used in the dml script
+
+<div class="codetabs">
+
+<div data-lang="Scala" markdown="1">
+{% highlight scala %}
+ml.setGPU(true)
+ml.setStatistics(true)
+val matMultScript = dml("""
+A = rand(rows=10, cols=1000)
+B = rand(rows=1000, cols=10)
+C = A %*% B
+print(toString(C))
+""")
+ml.execute(matMultScript)
+{% endhighlight %}
+</div>
+
+<div data-lang="Spark Shell" markdown="1">
+{% highlight scala %}
+scala> ml.setGPU(true)
+
+scala> ml.setStatistics(true)
+
+scala> val matMultScript = dml("""
+     | A = rand(rows=10, cols=1000)
+     | B = rand(rows=1000, cols=10)
+     | C = A %*% B
+     | print(toString(C))
+     | """)
+matMultScript: org.apache.sysml.api.mlcontext.Script =
+Inputs:
+None
+
+Outputs:
+None
+
+scala> ml.execute(matMultScript)
+249.977 238.545 233.700 234.489 248.556 244.423 249.051 255.043 249.117 251.605
+249.226 248.680 245.532 238.258 254.451 249.827 260.957 251.273 250.577 257.571
+258.703 246.969 243.463 246.547 250.784 251.758 251.654 258.318 251.817 254.097
+248.788 242.960 230.920 244.026 249.159 247.998 251.330 254.718 248.013 255.706
+253.251 248.788 235.785 242.941 252.096 248.675 256.865 251.677 252.872 250.490
+256.087 245.035 234.124 238.307 248.630 252.522 251.122 251.577 249.171 247.974
+245.419 243.114 232.262 239.776 249.583 242.351 250.972 249.244 246.729 251.807
+250.081 242.367 230.334 240.955 248.332 240.730 246.940 250.396 244.107 249.729
+247.368 239.882 234.353 237.087 252.337 248.801 246.627 249.077 244.305 245.621
+252.827 257.352 239.546 246.529 258.916 255.612 260.480 254.805 252.695 257.531
+
+SystemML Statistics:
+Total elapsed time:		0.000 sec.
+Total compilation time:		0.000 sec.
+Total execution time:		0.000 sec.
+Number of compiled Spark inst:	0.
+Number of executed Spark inst:	0.
+CUDA/CuLibraries init time:	0.000/0.003 sec.
+Number of executed GPU inst:	8.
+GPU mem tx time  (alloc/dealloc/toDev/fromDev):	0.003/0.002/0.010/0.002 sec.
+GPU mem tx count (alloc/dealloc/toDev/fromDev/evict):	24/24/0/16/8/0.
+GPU conversion time  (sparseConv/sp2dense/dense2sp):	0.000/0.000/0.000 sec.
+GPU conversion count (sparseConv/sp2dense/dense2sp):	0/0/0.
+Cache hits (Mem, WB, FS, HDFS):	40/0/0/0.
+Cache writes (WB, FS, HDFS):	21/0/0.
+Cache times (ACQr/m, RLS, EXP):	0.002/0.002/0.003/0.000 sec.
+HOP DAGs recompiled (PRED, SB):	0/0.
+HOP DAGs recompile time:	0.000 sec.
+Spark ctx create time (lazy):	0.000 sec.
+Spark trans counts (par,bc,col):0/0/0.
+Spark trans times (par,bc,col):	0.000/0.000/0.000 secs.
+Total JIT compile time:		11.426 sec.
+Total JVM GC count:		20.
+Total JVM GC time:		1.078 sec.
+Heavy hitter instructions (name, time, count):
+-- 1) 	toString 	0.085 sec 	8
+-- 2) 	rand 	0.027 sec 	16
+-- 3) 	gpu_ba+* 	0.018 sec 	8
+-- 4) 	print 	0.006 sec 	8
+-- 5) 	createvar 	0.003 sec 	24
+-- 6) 	rmvar 	0.003 sec 	40
+
+res20: org.apache.sysml.api.mlcontext.MLResults =
+None
+{% endhighlight %}
+</div>
+
+</div>
+
+Note that GPU instructions show up prepended with a "gpu" in the statistics.
 
 ## Explain
 


[22/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1181] Remove old Spark MLContext API documentation

Posted by de...@apache.org.
[SYSTEMML-1181] Remove old Spark MLContext API documentation

Closes #377.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/bfb93b03
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/bfb93b03
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/bfb93b03

Branch: refs/heads/gh-pages
Commit: bfb93b03c71493ac210a2b7594eb2a3d3c15fcb5
Parents: cb6f845
Author: Felix Schueler <fe...@ibm.com>
Authored: Fri Feb 3 18:06:10 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Fri Feb 3 18:06:10 2017 -0800

----------------------------------------------------------------------
 spark-mlcontext-programming-guide.md | 1057 -----------------------------
 1 file changed, 1057 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/bfb93b03/spark-mlcontext-programming-guide.md
----------------------------------------------------------------------
diff --git a/spark-mlcontext-programming-guide.md b/spark-mlcontext-programming-guide.md
index 45c0091..e5df11f 100644
--- a/spark-mlcontext-programming-guide.md
+++ b/spark-mlcontext-programming-guide.md
@@ -34,10 +34,6 @@ The Spark `MLContext` API offers a programmatic interface for interacting with S
 such as Scala, Java, and Python. As a result, it offers a convenient way to interact with SystemML from the Spark
 Shell and from Notebooks such as Jupyter and Zeppelin.
 
-**NOTE: A new MLContext API has been redesigned for future SystemML releases. The old API is available
-in previous versions of SystemML but is deprecated and will be removed soon, so please migrate to the new API.**
-
-
 # Spark Shell Example
 
 ## Start Spark Shell with SystemML
@@ -1822,1059 +1818,6 @@ plt.title('PNMF Training Loss')
 
 ---
 
-# Spark Shell Example - OLD API
-
-### ** **NOTE: This API is old and has been deprecated.** **
-**Please use the [new MLContext API](spark-mlcontext-programming-guide#spark-shell-example) instead.**
-
-## Start Spark Shell with SystemML
-
-To use SystemML with the Spark Shell, the SystemML jar can be referenced using the Spark Shell's `--jars` option.
-Instructions to build the SystemML jar can be found in the [SystemML GitHub README](https://github.com/apache/incubator-systemml).
-
-{% highlight bash %}
-./bin/spark-shell --executor-memory 4G --driver-memory 4G --jars SystemML.jar
-{% endhighlight %}
-
-Here is an example of Spark Shell with SystemML and YARN.
-
-{% highlight bash %}
-./bin/spark-shell --master yarn-client --num-executors 3 --driver-memory 5G --executor-memory 5G --executor-cores 4 --jars SystemML.jar
-{% endhighlight %}
-
-
-## Create MLContext
-
-An `MLContext` object can be created by passing its constructor a reference to the `SparkContext`.
-
-<div class="codetabs">
-
-<div data-lang="Spark Shell" markdown="1">
-{% highlight scala %}
-scala>import org.apache.sysml.api.MLContext
-import org.apache.sysml.api.MLContext
-
-scala> val ml = new MLContext(sc)
-ml: org.apache.sysml.api.MLContext = org.apache.sysml.api.MLContext@33e38c6b
-{% endhighlight %}
-</div>
-
-<div data-lang="Statements" markdown="1">
-{% highlight scala %}
-import org.apache.sysml.api.MLContext
-val ml = new MLContext(sc)
-{% endhighlight %}
-</div>
-
-</div>
-
-
-## Create DataFrame
-
-For demonstration purposes, we'll create a `DataFrame` consisting of 100,000 rows and 1,000 columns
-of random `double`s.
-
-<div class="codetabs">
-
-<div data-lang="Spark Shell" markdown="1">
-{% highlight scala %}
-scala> import org.apache.spark.sql._
-import org.apache.spark.sql._
-
-scala> import org.apache.spark.sql.types.{StructType,StructField,DoubleType}
-import org.apache.spark.sql.types.{StructType, StructField, DoubleType}
-
-scala> import scala.util.Random
-import scala.util.Random
-
-scala> val numRows = 100000
-numRows: Int = 100000
-
-scala> val numCols = 1000
-numCols: Int = 1000
-
-scala> val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCols)(Random.nextDouble)) }
-data: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[1] at map at <console>:33
-
-scala> val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
-schema: org.apache.spark.sql.types.StructType = StructType(StructField(C0,DoubleType,true), StructField(C1,DoubleType,true), StructField(C2,DoubleType,true), StructField(C3,DoubleType,true), StructField(C4,DoubleType,true), StructField(C5,DoubleType,true), StructField(C6,DoubleType,true), StructField(C7,DoubleType,true), StructField(C8,DoubleType,true), StructField(C9,DoubleType,true), StructField(C10,DoubleType,true), StructField(C11,DoubleType,true), StructField(C12,DoubleType,true), StructField(C13,DoubleType,true), StructField(C14,DoubleType,true), StructField(C15,DoubleType,true), StructField(C16,DoubleType,true), StructField(C17,DoubleType,true), StructField(C18,DoubleType,true), StructField(C19,DoubleType,true), StructField(C20,DoubleType,true), StructField(C21,DoubleType,true), ...
-
-scala> val df = spark.createDataFrame(data, schema)
-df: org.apache.spark.sql.DataFrame = [C0: double, C1: double, C2: double, C3: double, C4: double, C5: double, C6: double, C7: double, C8: double, C9: double, C10: double, C11: double, C12: double, C13: double, C14: double, C15: double, C16: double, C17: double, C18: double, C19: double, C20: double, C21: double, C22: double, C23: double, C24: double, C25: double, C26: double, C27: double, C28: double, C29: double, C30: double, C31: double, C32: double, C33: double, C34: double, C35: double, C36: double, C37: double, C38: double, C39: double, C40: double, C41: double, C42: double, C43: double, C44: double, C45: double, C46: double, C47: double, C48: double, C49: double, C50: double, C51: double, C52: double, C53: double, C54: double, C55: double, C56: double, C57: double, C58: double, C5...
-
-{% endhighlight %}
-</div>
-
-<div data-lang="Statements" markdown="1">
-{% highlight scala %}
-import org.apache.spark.sql._
-import org.apache.spark.sql.types.{StructType,StructField,DoubleType}
-import scala.util.Random
-val numRows = 100000
-val numCols = 1000
-val data = sc.parallelize(0 to numRows-1).map { _ => Row.fromSeq(Seq.fill(numCols)(Random.nextDouble)) }
-val schema = StructType((0 to numCols-1).map { i => StructField("C" + i, DoubleType, true) } )
-val df = spark.createDataFrame(data, schema)
-{% endhighlight %}
-</div>
-
-</div>
-
-
-## Helper Methods
-
-For convenience, we'll create some helper methods. The SystemML output data is encapsulated in
-an `MLOutput` object. The `getScalar()` method extracts a scalar value from a `DataFrame` returned by
-`MLOutput`. The `getScalarDouble()` method returns such a value as a `Double`, and the
-`getScalarInt()` method returns such a value as an `Int`.
-
-<div class="codetabs">
-
-<div data-lang="Spark Shell" markdown="1">
-{% highlight scala %}
-scala> import org.apache.sysml.api.MLOutput
-import org.apache.sysml.api.MLOutput
-
-scala> def getScalar(outputs: MLOutput, symbol: String): Any =
-     | outputs.getDF(spark.sqlContext, symbol).first()(1)
-getScalar: (outputs: org.apache.sysml.api.MLOutput, symbol: String)Any
-
-scala> def getScalarDouble(outputs: MLOutput, symbol: String): Double =
-     | getScalar(outputs, symbol).asInstanceOf[Double]
-getScalarDouble: (outputs: org.apache.sysml.api.MLOutput, symbol: String)Double
-
-scala> def getScalarInt(outputs: MLOutput, symbol: String): Int =
-     | getScalarDouble(outputs, symbol).toInt
-getScalarInt: (outputs: org.apache.sysml.api.MLOutput, symbol: String)Int
-
-{% endhighlight %}
-</div>
-
-<div data-lang="Statements" markdown="1">
-{% highlight scala %}
-import org.apache.sysml.api.MLOutput
-def getScalar(outputs: MLOutput, symbol: String): Any =
-outputs.getDF(spark.sqlContext, symbol).first()(1)
-def getScalarDouble(outputs: MLOutput, symbol: String): Double =
-getScalar(outputs, symbol).asInstanceOf[Double]
-def getScalarInt(outputs: MLOutput, symbol: String): Int =
-getScalarDouble(outputs, symbol).toInt
-
-{% endhighlight %}
-</div>
-
-</div>
-
-
-## Convert DataFrame to Binary-Block Matrix
-
-SystemML is optimized to operate on a binary-block format for matrix representation. For large
-datasets, conversion from DataFrame to binary-block can require a significant quantity of time.
-Explicit DataFrame to binary-block conversion allows algorithm performance to be measured separately
-from data conversion time.
-
-The SystemML binary-block matrix representation can be thought of as a two-dimensional array of blocks, where each block
-consists of a number of rows and columns. In this example, we specify a matrix consisting
-of blocks of size 1000x1000. The experimental `dataFrameToBinaryBlock()` method of `RDDConverterUtilsExt` is used
-to convert the `DataFrame df` to a SystemML binary-block matrix, which is represented by the datatype
-`JavaPairRDD[MatrixIndexes, MatrixBlock]`.
-
-<div class="codetabs">
-
-<div data-lang="Spark Shell" markdown="1">
-{% highlight scala %}
-scala> import org.apache.sysml.runtime.instructions.spark.utils.{RDDConverterUtilsExt => RDDConverterUtils}
-import org.apache.sysml.runtime.instructions.spark.utils.{RDDConverterUtilsExt=>RDDConverterUtils}
-
-scala> import org.apache.sysml.runtime.matrix.MatrixCharacteristics;
-import org.apache.sysml.runtime.matrix.MatrixCharacteristics
-
-scala> val numRowsPerBlock = 1000
-numRowsPerBlock: Int = 1000
-
-scala> val numColsPerBlock = 1000
-numColsPerBlock: Int = 1000
-
-scala> val mc = new MatrixCharacteristics(numRows, numCols, numRowsPerBlock, numColsPerBlock)
-mc: org.apache.sysml.runtime.matrix.MatrixCharacteristics = [100000 x 1000, nnz=-1, blocks (1000 x 1000)]
-
-scala> val sysMlMatrix = RDDConverterUtils.dataFrameToBinaryBlock(sc, df, mc, false)
-sysMlMatrix: org.apache.spark.api.java.JavaPairRDD[org.apache.sysml.runtime.matrix.data.MatrixIndexes,org.apache.sysml.runtime.matrix.data.MatrixBlock] = org.apache.spark.api.java.JavaPairRDD@2bce3248
-
-{% endhighlight %}
-</div>
-
-<div data-lang="Statements" markdown="1">
-{% highlight scala %}
-import org.apache.sysml.runtime.instructions.spark.utils.{RDDConverterUtilsExt => RDDConverterUtils}
-import org.apache.sysml.runtime.matrix.MatrixCharacteristics;
-val numRowsPerBlock = 1000
-val numColsPerBlock = 1000
-val mc = new MatrixCharacteristics(numRows, numCols, numRowsPerBlock, numColsPerBlock)
-val sysMlMatrix = RDDConverterUtils.dataFrameToBinaryBlock(sc, df, mc, false)
-
-{% endhighlight %}
-</div>
-
-</div>
-
-
-## DML Script
-
-For this example, we will utilize the following DML Script called `shape.dml` that reads in a matrix and outputs the number of rows and the
-number of columns, each represented as a matrix.
-
-{% highlight r %}
-X = read($Xin)
-m = matrix(nrow(X), rows=1, cols=1)
-n = matrix(ncol(X), rows=1, cols=1)
-write(m, $Mout)
-write(n, $Nout)
-{% endhighlight %}
-
-
-## Execute Script
-
-Let's execute our DML script, as shown in the example below. The call to `reset()` of `MLContext` is not necessary here, but this method should
-be called if you need to reset inputs and outputs or if you would like to call `execute()` with a different script.
-
-An example of registering the `DataFrame df` as an input to the `X` variable is shown but commented out. If a DataFrame is registered directly,
-it will implicitly be converted to SystemML's binary-block format. However, since we've already explicitly converted the DataFrame to the
-binary-block fixed variable `systemMlMatrix`, we will register this input to the `X` variable. We register the `m` and `n` variables
-as outputs.
-
-When SystemML is executed via `DMLScript` (such as in Standalone Mode), inputs are supplied as either command-line named arguments
-or positional argument. These inputs are specified in DML scripts by prepending them with a `$`. Values are read from or written
-to files using `read`/`write` (DML) and `load`/`save` (PyDML) statements. When utilizing the `MLContext` API,
-inputs and outputs can be other data representations, such as `DataFrame`s. The input and output data are bound to DML variables.
-The named arguments in the `shape.dml` script do not have default values set for them, so we create a `Map` to map the required named
-arguments to blank `String`s so that the script can pass validation.
-
-The `shape.dml` script is executed by the call to `execute()`, where we supply the `Map` of required named arguments. The
-execution results are returned as the `MLOutput` fixed variable `outputs`. The number of rows is obtained by calling the `getStaticInt()`
-helper method with the `outputs` object and `"m"`. The number of columns is retrieved by calling `getStaticInt()` with
-`outputs` and `"n"`.
-
-<div class="codetabs">
-
-<div data-lang="Spark Shell" markdown="1">
-{% highlight scala %}
-scala> ml.reset()
-
-scala> //ml.registerInput("X", df) // implicit conversion of DataFrame to binary-block
-
-scala> ml.registerInput("X", sysMlMatrix, numRows, numCols)
-
-scala> ml.registerOutput("m")
-
-scala> ml.registerOutput("n")
-
-scala> val nargs = Map("Xin" -> " ", "Mout" -> " ", "Nout" -> " ")
-nargs: scala.collection.immutable.Map[String,String] = Map(Xin -> " ", Mout -> " ", Nout -> " ")
-
-scala> val outputs = ml.execute("shape.dml", nargs)
-15/10/12 16:29:15 WARN : Your hostname, derons-mbp.usca.ibm.com resolves to a loopback/non-reachable address: 127.0.0.1, but we couldn't find any external IP address!
-15/10/12 16:29:15 WARN OptimizerUtils: Auto-disable multi-threaded text read for 'text' and 'csv' due to thread contention on JRE < 1.8 (java.version=1.7.0_80).
-outputs: org.apache.sysml.api.MLOutput = org.apache.sysml.api.MLOutput@4d424743
-
-scala> val m = getScalarInt(outputs, "m")
-m: Int = 100000
-
-scala> val n = getScalarInt(outputs, "n")
-n: Int = 1000
-
-{% endhighlight %}
-</div>
-
-<div data-lang="Statements" markdown="1">
-{% highlight scala %}
-ml.reset()
-//ml.registerInput("X", df) // implicit conversion of DataFrame to binary-block
-ml.registerInput("X", sysMlMatrix, numRows, numCols)
-ml.registerOutput("m")
-ml.registerOutput("n")
-val nargs = Map("Xin" -> " ", "Mout" -> " ", "Nout" -> " ")
-val outputs = ml.execute("shape.dml", nargs)
-val m = getScalarInt(outputs, "m")
-val n = getScalarInt(outputs, "n")
-
-{% endhighlight %}
-</div>
-
-</div>
-
-
-## DML Script as String
-
-The `MLContext` API allows a DML script to be specified
-as a `String`. Here, we specify a DML script as a fixed `String` variable called `minMaxMeanScript`.
-This DML will find the minimum, maximum, and mean value of a matrix.
-
-<div class="codetabs">
-
-<div data-lang="Spark Shell" markdown="1">
-{% highlight scala %}
-scala> val minMaxMeanScript: String =
-     | """
-     | Xin = read(" ")
-     | minOut = matrix(min(Xin), rows=1, cols=1)
-     | maxOut = matrix(max(Xin), rows=1, cols=1)
-     | meanOut = matrix(mean(Xin), rows=1, cols=1)
-     | write(minOut, " ")
-     | write(maxOut, " ")
-     | write(meanOut, " ")
-     | """
-minMaxMeanScript: String =
-"
-Xin = read(" ")
-minOut = matrix(min(Xin), rows=1, cols=1)
-maxOut = matrix(max(Xin), rows=1, cols=1)
-meanOut = matrix(mean(Xin), rows=1, cols=1)
-write(minOut, " ")
-write(maxOut, " ")
-write(meanOut, " ")
-"
-
-{% endhighlight %}
-</div>
-
-<div data-lang="Statements" markdown="1">
-{% highlight scala %}
-val minMaxMeanScript: String =
-"""
-Xin = read(" ")
-minOut = matrix(min(Xin), rows=1, cols=1)
-maxOut = matrix(max(Xin), rows=1, cols=1)
-meanOut = matrix(mean(Xin), rows=1, cols=1)
-write(minOut, " ")
-write(maxOut, " ")
-write(meanOut, " ")
-"""
-
-{% endhighlight %}
-</div>
-
-</div>
-
-## Scala Wrapper for DML
-
-We can create a Scala wrapper for our invocation of the `minMaxMeanScript` DML `String`. The `minMaxMean()` method
-takes a `JavaPairRDD[MatrixIndexes, MatrixBlock]` parameter, which is a SystemML binary-block matrix representation.
-It also takes a `rows` parameter indicating the number of rows in the matrix, a `cols` parameter indicating the number
-of columns in the matrix, and an `MLContext` parameter. The `minMaxMean()` method
-returns a tuple consisting of the minimum value in the matrix, the maximum value in the matrix, and the computed
-mean value of the matrix.
-
-<div class="codetabs">
-
-<div data-lang="Spark Shell" markdown="1">
-{% highlight scala %}
-scala> import org.apache.sysml.runtime.matrix.data.MatrixIndexes
-import org.apache.sysml.runtime.matrix.data.MatrixIndexes
-
-scala> import org.apache.sysml.runtime.matrix.data.MatrixBlock
-import org.apache.sysml.runtime.matrix.data.MatrixBlock
-
-scala> import org.apache.spark.api.java.JavaPairRDD
-import org.apache.spark.api.java.JavaPairRDD
-
-scala> def minMaxMean(mat: JavaPairRDD[MatrixIndexes, MatrixBlock], rows: Int, cols: Int, ml: MLContext): (Double, Double, Double) = {
-     | ml.reset()
-     | ml.registerInput("Xin", mat, rows, cols)
-     | ml.registerOutput("minOut")
-     | ml.registerOutput("maxOut")
-     | ml.registerOutput("meanOut")
-     | val outputs = ml.executeScript(minMaxMeanScript)
-     | val minOut = getScalarDouble(outputs, "minOut")
-     | val maxOut = getScalarDouble(outputs, "maxOut")
-     | val meanOut = getScalarDouble(outputs, "meanOut")
-     | (minOut, maxOut, meanOut)
-     | }
-minMaxMean: (mat: org.apache.spark.api.java.JavaPairRDD[org.apache.sysml.runtime.matrix.data.MatrixIndexes,org.apache.sysml.runtime.matrix.data.MatrixBlock], rows: Int, cols: Int, ml: org.apache.sysml.api.MLContext)(Double, Double, Double)
-
-{% endhighlight %}
-</div>
-
-<div data-lang="Statements" markdown="1">
-{% highlight scala %}
-import org.apache.sysml.runtime.matrix.data.MatrixIndexes
-import org.apache.sysml.runtime.matrix.data.MatrixBlock
-import org.apache.spark.api.java.JavaPairRDD
-def minMaxMean(mat: JavaPairRDD[MatrixIndexes, MatrixBlock], rows: Int, cols: Int, ml: MLContext): (Double, Double, Double) = {
-ml.reset()
-ml.registerInput("Xin", mat, rows, cols)
-ml.registerOutput("minOut")
-ml.registerOutput("maxOut")
-ml.registerOutput("meanOut")
-val outputs = ml.executeScript(minMaxMeanScript)
-val minOut = getScalarDouble(outputs, "minOut")
-val maxOut = getScalarDouble(outputs, "maxOut")
-val meanOut = getScalarDouble(outputs, "meanOut")
-(minOut, maxOut, meanOut)
-}
-
-{% endhighlight %}
-</div>
-
-</div>
-
-
-## Invoking DML via Scala Wrapper
-
-Here, we invoke `minMaxMeanScript` using our `minMaxMean()` Scala wrapper method. It returns a tuple
-consisting of the minimum value in the matrix, the maximum value in the matrix, and the mean value of the matrix.
-
-<div class="codetabs">
-
-<div data-lang="Spark Shell" markdown="1">
-{% highlight scala %}
-scala> val (min, max, mean) = minMaxMean(sysMlMatrix, numRows, numCols, ml)
-15/10/13 14:33:11 WARN OptimizerUtils: Auto-disable multi-threaded text read for 'text' and 'csv' due to thread contention on JRE < 1.8 (java.version=1.7.0_80).
-min: Double = 5.378949397005783E-9                                              
-max: Double = 0.9999999934660398
-mean: Double = 0.499988222338507
-
-{% endhighlight %}
-</div>
-
-<div data-lang="Statements" markdown="1">
-{% highlight scala %}
-val (min, max, mean) = minMaxMean(sysMlMatrix, numRows, numCols, ml)
-
-{% endhighlight %}
-</div>
-
-</div>
-
----
-
-# Zeppelin Notebook Example - Linear Regression Algorithm - OLD API
-
-### ** **NOTE: This API is old and has been deprecated.** **
-**Please use the [new MLContext API](spark-mlcontext-programming-guide#spark-shell-example) instead.**
-
-Next, we'll consider an example of a SystemML linear regression algorithm run from Spark through an Apache Zeppelin notebook.
-Instructions to clone and build Zeppelin can be found at the [GitHub Apache Zeppelin](https://github.com/apache/incubator-zeppelin)
-site. This example also will look at the Spark ML linear regression algorithm.
-
-This Zeppelin notebook example can be imported by choosing `Import note` -> `Add from URL` from the Zeppelin main page, then insert the following URL:
-
-    https://raw.githubusercontent.com/apache/incubator-systemml/master/samples/zeppelin-notebooks/2AZ2AQ12B/note.json
-
-Alternatively download <a href="https://raw.githubusercontent.com/apache/incubator-systemml/master/samples/zeppelin-notebooks/2AZ2AQ12B/note.json" download="note.json">note.json</a>, then import it by choosing `Import note` -> `Choose a JSON here` from the Zeppelin main page.
-
-A `conf/zeppelin-env.sh` file is created based on `conf/zeppelin-env.sh.template`. For
-this demonstration, it features `SPARK_HOME`, `SPARK_SUBMIT_OPTIONS`, and `ZEPPELIN_SPARK_USEHIVECONTEXT`
-environment variables:
-
-	export SPARK_HOME=/Users/example/spark-1.5.1-bin-hadoop2.6
-	export SPARK_SUBMIT_OPTIONS="--jars /Users/example/systemml/system-ml/target/SystemML.jar"
-	export ZEPPELIN_SPARK_USEHIVECONTEXT=false
-
-Start Zeppelin using the `zeppelin.sh` script:
-
-	bin/zeppelin.sh
-
-After opening Zeppelin in a brower, we see the "SystemML - Linear Regression" note in the list of available
-Zeppelin notes.
-
-![Zeppelin Notebook](img/spark-mlcontext-programming-guide/zeppelin-notebook.png "Zeppelin Notebook")
-
-If we go to the "SystemML - Linear Regression" note, we see that the note consists of several cells of code.
-
-![Zeppelin 'SystemML - Linear Regression' Note](img/spark-mlcontext-programming-guide/zeppelin-notebook-systemml-linear-regression.png "Zeppelin 'SystemML - Linear Regression' Note")
-
-Let's briefly consider these cells.
-
-## Trigger Spark Startup
-
-This cell triggers Spark to initialize by calling the `SparkContext` `sc` object. Information regarding these startup operations can be viewed in the
-console window in which `zeppelin.sh` is running.
-
-**Cell:**
-{% highlight scala %}
-// Trigger Spark Startup
-sc
-{% endhighlight %}
-
-**Output:**
-{% highlight scala %}
-res8: org.apache.spark.SparkContext = org.apache.spark.SparkContext@6ce70bf3
-{% endhighlight %}
-
-
-## Generate Linear Regression Test Data
-
-The Spark `LinearDataGenerator` is used to generate test data for the Spark ML and SystemML linear regression algorithms.
-
-**Cell:**
-{% highlight scala %}
-// Generate data
-import org.apache.spark.mllib.util.LinearDataGenerator
-import spark.implicits._
-
-val numRows = 10000
-val numCols = 1000
-val rawData = LinearDataGenerator.generateLinearRDD(sc, numRows, numCols, 1).toDF()
-
-// Repartition into a more parallelism-friendly number of partitions
-val data = rawData.repartition(64).cache()
-{% endhighlight %}
-
-**Output:**
-{% highlight scala %}
-import org.apache.spark.mllib.util.LinearDataGenerator
-numRows: Int = 10000
-numCols: Int = 1000
-rawData: org.apache.spark.sql.DataFrame = [label: double, features: vector]
-data: org.apache.spark.sql.DataFrame = [label: double, features: vector]
-{% endhighlight %}
-
-
-## Train using Spark ML Linear Regression Algorithm for Comparison
-
-For purpose of comparison, we can train a model using the Spark ML linear regression
-algorithm.
-
-**Cell:**
-{% highlight scala %}
-// Spark ML
-import org.apache.spark.ml.regression.LinearRegression
-
-// Model Settings
-val maxIters = 100
-val reg = 0
-val elasticNetParam = 0  // L2 reg
-
-// Fit the model
-val lr = new LinearRegression()
-  .setMaxIter(maxIters)
-  .setRegParam(reg)
-  .setElasticNetParam(elasticNetParam)
-val start = System.currentTimeMillis()
-val model = lr.fit(data)
-val trainingTime = (System.currentTimeMillis() - start).toDouble / 1000.0
-
-// Summarize the model over the training set and gather some metrics
-val trainingSummary = model.summary
-val r2 = trainingSummary.r2
-val iters = trainingSummary.totalIterations
-val trainingTimePerIter = trainingTime / iters
-{% endhighlight %}
-
-**Output:**
-{% highlight scala %}
-import org.apache.spark.ml.regression.LinearRegression
-maxIters: Int = 100
-reg: Int = 0
-elasticNetParam: Int = 0
-lr: org.apache.spark.ml.regression.LinearRegression = linReg_a7f51d676562
-start: Long = 1444672044647
-model: org.apache.spark.ml.regression.LinearRegressionModel = linReg_a7f51d676562
-trainingTime: Double = 12.985
-trainingSummary: org.apache.spark.ml.regression.LinearRegressionTrainingSummary = org.apache.spark.ml.regression.LinearRegressionTrainingSummary@227ba28b
-r2: Double = 0.9677118209276552
-iters: Int = 17
-trainingTimePerIter: Double = 0.7638235294117647
-{% endhighlight %}
-
-
-## Spark ML Linear Regression Summary Statistics
-
-Summary statistics for the Spark ML linear regression algorithm are displayed by this cell.
-
-**Cell:**
-{% highlight scala %}
-// Print statistics
-println(s"R2: ${r2}")
-println(s"Iterations: ${iters}")
-println(s"Training time per iter: ${trainingTimePerIter} seconds")
-{% endhighlight %}
-
-**Output:**
-{% highlight scala %}
-R2: 0.9677118209276552
-Iterations: 17
-Training time per iter: 0.7638235294117647 seconds
-{% endhighlight %}
-
-
-## SystemML Linear Regression Algorithm
-
-The `linearReg` fixed `String` variable is set to
-a linear regression algorithm written in DML, SystemML's Declarative Machine Learning language.
-
-
-
-**Cell:**
-{% highlight scala %}
-// SystemML kernels
-val linearReg =
-"""
-#
-# THIS SCRIPT SOLVES LINEAR REGRESSION USING THE CONJUGATE GRADIENT ALGORITHM
-#
-# INPUT PARAMETERS:
-# --------------------------------------------------------------------------------------------
-# NAME  TYPE   DEFAULT  MEANING
-# --------------------------------------------------------------------------------------------
-# X     String  ---     Matrix X of feature vectors
-# Y     String  ---     1-column Matrix Y of response values
-# icpt  Int      0      Intercept presence, shifting and rescaling the columns of X:
-#                       0 = no intercept, no shifting, no rescaling;
-#                       1 = add intercept, but neither shift nor rescale X;
-#                       2 = add intercept, shift & rescale X columns to mean = 0, variance = 1
-# reg   Double 0.000001 Regularization constant (lambda) for L2-regularization; set to nonzero
-#                       for highly dependend/sparse/numerous features
-# tol   Double 0.000001 Tolerance (epsilon); conjugate graduent procedure terminates early if
-#                       L2 norm of the beta-residual is less than tolerance * its initial norm
-# maxi  Int      0      Maximum number of conjugate gradient iterations, 0 = no maximum
-# --------------------------------------------------------------------------------------------
-#
-# OUTPUT:
-# B Estimated regression parameters (the betas) to store
-#
-# Note: Matrix of regression parameters (the betas) and its size depend on icpt input value:
-#         OUTPUT SIZE:   OUTPUT CONTENTS:                HOW TO PREDICT Y FROM X AND B:
-# icpt=0: ncol(X)   x 1  Betas for X only                Y ~ X %*% B[1:ncol(X), 1], or just X %*% B
-# icpt=1: ncol(X)+1 x 1  Betas for X and intercept       Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1]
-# icpt=2: ncol(X)+1 x 2  Col.1: betas for X & intercept  Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1]
-#                        Col.2: betas for shifted/rescaled X and intercept
-#
-
-fileX = "";
-fileY = "";
-fileB = "";
-
-intercept_status = ifdef ($icpt, 0);     # $icpt=0;
-tolerance = ifdef ($tol, 0.000001);      # $tol=0.000001;
-max_iteration = ifdef ($maxi, 0);        # $maxi=0;
-regularization = ifdef ($reg, 0.000001); # $reg=0.000001;
-
-X = read (fileX);
-y = read (fileY);
-
-n = nrow (X);
-m = ncol (X);
-ones_n = matrix (1, rows = n, cols = 1);
-zero_cell = matrix (0, rows = 1, cols = 1);
-
-# Introduce the intercept, shift and rescale the columns of X if needed
-
-m_ext = m;
-if (intercept_status == 1 | intercept_status == 2)  # add the intercept column
-{
-    X = append (X, ones_n);
-    m_ext = ncol (X);
-}
-
-scale_lambda = matrix (1, rows = m_ext, cols = 1);
-if (intercept_status == 1 | intercept_status == 2)
-{
-    scale_lambda [m_ext, 1] = 0;
-}
-
-if (intercept_status == 2)  # scale-&-shift X columns to mean 0, variance 1
-{                           # Important assumption: X [, m_ext] = ones_n
-    avg_X_cols = t(colSums(X)) / n;
-    var_X_cols = (t(colSums (X ^ 2)) - n * (avg_X_cols ^ 2)) / (n - 1);
-    is_unsafe = ppred (var_X_cols, 0.0, "<=");
-    scale_X = 1.0 / sqrt (var_X_cols * (1 - is_unsafe) + is_unsafe);
-    scale_X [m_ext, 1] = 1;
-    shift_X = - avg_X_cols * scale_X;
-    shift_X [m_ext, 1] = 0;
-} else {
-    scale_X = matrix (1, rows = m_ext, cols = 1);
-    shift_X = matrix (0, rows = m_ext, cols = 1);
-}
-
-# Henceforth, if intercept_status == 2, we use "X %*% (SHIFT/SCALE TRANSFORM)"
-# instead of "X".  However, in order to preserve the sparsity of X,
-# we apply the transform associatively to some other part of the expression
-# in which it occurs.  To avoid materializing a large matrix, we rewrite it:
-#
-# ssX_A  = (SHIFT/SCALE TRANSFORM) %*% A    --- is rewritten as:
-# ssX_A  = diag (scale_X) %*% A;
-# ssX_A [m_ext, ] = ssX_A [m_ext, ] + t(shift_X) %*% A;
-#
-# tssX_A = t(SHIFT/SCALE TRANSFORM) %*% A   --- is rewritten as:
-# tssX_A = diag (scale_X) %*% A + shift_X %*% A [m_ext, ];
-
-lambda = scale_lambda * regularization;
-beta_unscaled = matrix (0, rows = m_ext, cols = 1);
-
-if (max_iteration == 0) {
-    max_iteration = m_ext;
-}
-i = 0;
-
-# BEGIN THE CONJUGATE GRADIENT ALGORITHM
-r = - t(X) %*% y;
-
-if (intercept_status == 2) {
-    r = scale_X * r + shift_X %*% r [m_ext, ];
-}
-
-p = - r;
-norm_r2 = sum (r ^ 2);
-norm_r2_initial = norm_r2;
-norm_r2_target = norm_r2_initial * tolerance ^ 2;
-
-while (i < max_iteration & norm_r2 > norm_r2_target)
-{
-    if (intercept_status == 2) {
-        ssX_p = scale_X * p;
-        ssX_p [m_ext, ] = ssX_p [m_ext, ] + t(shift_X) %*% p;
-    } else {
-        ssX_p = p;
-    }
-
-    q = t(X) %*% (X %*% ssX_p);
-
-    if (intercept_status == 2) {
-        q = scale_X * q + shift_X %*% q [m_ext, ];
-    }
-
-    q = q + lambda * p;
-    a = norm_r2 / sum (p * q);
-    beta_unscaled = beta_unscaled + a * p;
-    r = r + a * q;
-    old_norm_r2 = norm_r2;
-    norm_r2 = sum (r ^ 2);
-    p = -r + (norm_r2 / old_norm_r2) * p;
-    i = i + 1;
-}
-# END THE CONJUGATE GRADIENT ALGORITHM
-
-if (intercept_status == 2) {
-    beta = scale_X * beta_unscaled;
-    beta [m_ext, ] = beta [m_ext, ] + t(shift_X) %*% beta_unscaled;
-} else {
-    beta = beta_unscaled;
-}
-
-# Output statistics
-avg_tot = sum (y) / n;
-ss_tot = sum (y ^ 2);
-ss_avg_tot = ss_tot - n * avg_tot ^ 2;
-var_tot = ss_avg_tot / (n - 1);
-y_residual = y - X %*% beta;
-avg_res = sum (y_residual) / n;
-ss_res = sum (y_residual ^ 2);
-ss_avg_res = ss_res - n * avg_res ^ 2;
-
-R2_temp = 1 - ss_res / ss_avg_tot
-R2 = matrix(R2_temp, rows=1, cols=1)
-write(R2, "")
-
-totalIters = matrix(i, rows=1, cols=1)
-write(totalIters, "")
-
-# Prepare the output matrix
-if (intercept_status == 2) {
-    beta_out = append (beta, beta_unscaled);
-} else {
-    beta_out = beta;
-}
-
-write (beta_out, fileB);
-"""
-{% endhighlight %}
-
-**Output:**
-
-None
-
-
-## Helper Methods
-
-This cell contains helper methods to return `Double` and `Int` values from output generated by the `MLContext` API.
-
-**Cell:**
-{% highlight scala %}
-// Helper functions
-import org.apache.sysml.api.MLOutput
-
-def getScalar(outputs: MLOutput, symbol: String): Any =
-    outputs.getDF(spark.sqlContext, symbol).first()(1)
-
-def getScalarDouble(outputs: MLOutput, symbol: String): Double =
-    getScalar(outputs, symbol).asInstanceOf[Double]
-
-def getScalarInt(outputs: MLOutput, symbol: String): Int =
-    getScalarDouble(outputs, symbol).toInt
-{% endhighlight %}
-
-**Output:**
-{% highlight scala %}
-import org.apache.sysml.api.MLOutput
-getScalar: (outputs: org.apache.sysml.api.MLOutput, symbol: String)Any
-getScalarDouble: (outputs: org.apache.sysml.api.MLOutput, symbol: String)Double
-getScalarInt: (outputs: org.apache.sysml.api.MLOutput, symbol: String)Int
-{% endhighlight %}
-
-
-## Convert DataFrame to Binary-Block Format
-
-SystemML uses a binary-block format for matrix data representation. This cell
-explicitly converts the `DataFrame` `data` object to a binary-block `features` matrix
-and single-column `label` matrix, both represented by the
-`JavaPairRDD[MatrixIndexes, MatrixBlock]` datatype.
-
-
-**Cell:**
-{% highlight scala %}
-// Imports
-import org.apache.sysml.api.MLContext
-import org.apache.sysml.runtime.instructions.spark.utils.{RDDConverterUtilsExt => RDDConverterUtils}
-import org.apache.sysml.runtime.matrix.MatrixCharacteristics;
-
-// Create SystemML context
-val ml = new MLContext(sc)
-
-// Convert data to proper format
-val mcX = new MatrixCharacteristics(numRows, numCols, 1000, 1000)
-val mcY = new MatrixCharacteristics(numRows, 1, 1000, 1000)
-val X = RDDConverterUtils.vectorDataFrameToBinaryBlock(sc, data, mcX, false, "features")
-val y = RDDConverterUtils.dataFrameToBinaryBlock(sc, data.select("label"), mcY, false)
-// val y = data.select("label")
-
-// Cache
-val X2 = X.cache()
-val y2 = y.cache()
-val cnt1 = X2.count()
-val cnt2 = y2.count()
-{% endhighlight %}
-
-**Output:**
-{% highlight scala %}
-import org.apache.sysml.api.MLContext
-import org.apache.sysml.runtime.instructions.spark.utils.{RDDConverterUtilsExt=>RDDConverterUtils}
-import org.apache.sysml.runtime.matrix.MatrixCharacteristics
-ml: org.apache.sysml.api.MLContext = org.apache.sysml.api.MLContext@38d59245
-mcX: org.apache.sysml.runtime.matrix.MatrixCharacteristics = [10000 x 1000, nnz=-1, blocks (1000 x 1000)]
-mcY: org.apache.sysml.runtime.matrix.MatrixCharacteristics = [10000 x 1, nnz=-1, blocks (1000 x 1000)]
-X: org.apache.spark.api.java.JavaPairRDD[org.apache.sysml.runtime.matrix.data.MatrixIndexes,org.apache.sysml.runtime.matrix.data.MatrixBlock] = org.apache.spark.api.java.JavaPairRDD@b5a86e3
-y: org.apache.spark.api.java.JavaPairRDD[org.apache.sysml.runtime.matrix.data.MatrixIndexes,org.apache.sysml.runtime.matrix.data.MatrixBlock] = org.apache.spark.api.java.JavaPairRDD@56377665
-X2: org.apache.spark.api.java.JavaPairRDD[org.apache.sysml.runtime.matrix.data.MatrixIndexes,org.apache.sysml.runtime.matrix.data.MatrixBlock] = org.apache.spark.api.java.JavaPairRDD@650f29d2
-y2: org.apache.spark.api.java.JavaPairRDD[org.apache.sysml.runtime.matrix.data.MatrixIndexes,org.apache.sysml.runtime.matrix.data.MatrixBlock] = org.apache.spark.api.java.JavaPairRDD@334857a8
-cnt1: Long = 10
-cnt2: Long = 10
-{% endhighlight %}
-
-
-## Train using SystemML Linear Regression Algorithm
-
-Now, we can train our model using the SystemML linear regression algorithm. We register the features matrix `X` and the label matrix `y` as inputs. We register the `beta_out` matrix,
-`R2`, and `totalIters` as outputs.
-
-**Cell:**
-{% highlight scala %}
-// Register inputs & outputs
-ml.reset()  
-ml.registerInput("X", X, numRows, numCols)
-ml.registerInput("y", y, numRows, 1)
-// ml.registerInput("y", y)
-ml.registerOutput("beta_out")
-ml.registerOutput("R2")
-ml.registerOutput("totalIters")
-
-// Run the script
-val start = System.currentTimeMillis()
-val outputs = ml.executeScript(linearReg)
-val trainingTime = (System.currentTimeMillis() - start).toDouble / 1000.0
-
-// Get outputs
-val B = outputs.getDF(spark.sqlContext, "beta_out").sort("ID").drop("ID")
-val r2 = getScalarDouble(outputs, "R2")
-val iters = getScalarInt(outputs, "totalIters")
-val trainingTimePerIter = trainingTime / iters
-{% endhighlight %}
-
-**Output:**
-{% highlight scala %}
-start: Long = 1444672090620
-outputs: org.apache.sysml.api.MLOutput = org.apache.sysml.api.MLOutput@5d2c22d0
-trainingTime: Double = 1.176
-B: org.apache.spark.sql.DataFrame = [C1: double]
-r2: Double = 0.9677079547216473
-iters: Int = 12
-trainingTimePerIter: Double = 0.09799999999999999
-{% endhighlight %}
-
-
-## SystemML Linear Regression Summary Statistics
-
-SystemML linear regression summary statistics are displayed by this cell.
-
-**Cell:**
-{% highlight scala %}
-// Print statistics
-println(s"R2: ${r2}")
-println(s"Iterations: ${iters}")
-println(s"Training time per iter: ${trainingTimePerIter} seconds")
-B.describe().show()
-{% endhighlight %}
-
-**Output:**
-{% highlight scala %}
-R2: 0.9677079547216473
-Iterations: 12
-Training time per iter: 0.2334166666666667 seconds
-+-------+-------------------+
-|summary|                 C1|
-+-------+-------------------+
-|  count|               1000|
-|   mean| 0.0184500840658385|
-| stddev| 0.2764750319432085|
-|    min|-0.5426068958986378|
-|    max| 0.5225309861616542|
-+-------+-------------------+
-{% endhighlight %}
-
-
----
-
-# Jupyter (PySpark) Notebook Example - Poisson Nonnegative Matrix Factorization - OLD API
-
-### ** **NOTE: This API is old and has been deprecated.** **
-**Please use the [new MLContext API](spark-mlcontext-programming-guide#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization) instead.**
-
-Here, we'll explore the use of SystemML via PySpark in a [Jupyter notebook](http://jupyter.org/).
-This Jupyter notebook example can be nicely viewed in a rendered state
-[on GitHub](https://github.com/apache/incubator-systemml/blob/master/samples/jupyter-notebooks/SystemML-PySpark-Recommendation-Demo.ipynb),
-and can be [downloaded here](https://raw.githubusercontent.com/apache/incubator-systemml/master/samples/jupyter-notebooks/SystemML-PySpark-Recommendation-Demo.ipynb) to a directory of your choice.
-
-From the directory with the downloaded notebook, start Jupyter with PySpark:
-
-    PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark --master local[*] --driver-class-path $SYSTEMML_HOME/SystemML.jar
-
-This will open Jupyter in a browser:
-
-![Jupyter Notebook](img/spark-mlcontext-programming-guide/jupyter1.png "Jupyter Notebook")
-
-We can then open up the `SystemML-PySpark-Recommendation-Demo` notebook:
-
-![Jupyter Notebook](img/spark-mlcontext-programming-guide/jupyter2.png "Jupyter Notebook")
-
-## Set up the notebook and download the data
-
-{% highlight python %}
-%load_ext autoreload
-%autoreload 2
-%matplotlib inline
-
-# Add SystemML PySpark API file.
-sc.addPyFile("https://raw.githubusercontent.com/apache/incubator-systemml/3d5f9b11741f6d6ecc6af7cbaa1069cde32be838/src/main/java/org/apache/sysml/api/python/SystemML.py")
-
-import numpy as np
-import matplotlib.pyplot as plt
-plt.rcParams['figure.figsize'] = (10, 6)
-{% endhighlight %}
-
-{% highlight python %}
-%%sh
-# Download dataset
-curl -O http://snap.stanford.edu/data/amazon0601.txt.gz
-gunzip amazon0601.txt.gz
-{% endhighlight %}
-
-## Use PySpark to load the data in as a Spark DataFrame
-
-{% highlight python %}
-# Load data
-import pyspark.sql.functions as F
-dataPath = "amazon0601.txt"
-
-X_train = (sc.textFile(dataPath)
-    .filter(lambda l: not l.startswith("#"))
-    .map(lambda l: l.split("\t"))
-    .map(lambda prods: (int(prods[0]), int(prods[1]), 1.0))
-    .toDF(("prod_i", "prod_j", "x_ij"))
-    .filter("prod_i < 500 AND prod_j < 500") # Filter for memory constraints
-    .cache())
-
-max_prod_i = X_train.select(F.max("prod_i")).first()[0]
-max_prod_j = X_train.select(F.max("prod_j")).first()[0]
-numProducts = max(max_prod_i, max_prod_j) + 1 # 0-based indexing
-print("Total number of products: {}".format(numProducts))
-{% endhighlight %}
-
-## Create a SystemML MLContext object
-
-{% highlight python %}
-# Create SystemML MLContext
-from SystemML import MLContext
-ml = MLContext(sc)
-{% endhighlight %}
-
-## Define a kernel for Poisson nonnegative matrix factorization (PNMF) in DML
-
-{% highlight python %}
-# Define PNMF kernel in SystemML's DSL using the R-like syntax for PNMF
-pnmf = """
-# data & args
-X = read($X)
-X = X+1 # change product IDs to be 1-based, rather than 0-based
-V = table(X[,1], X[,2])
-size = ifdef($size, -1)
-if(size > -1) {
-    V = V[1:size,1:size]
-}
-max_iteration = as.integer($maxiter)
-rank = as.integer($rank)
-
-n = nrow(V)
-m = ncol(V)
-range = 0.01
-W = Rand(rows=n, cols=rank, min=0, max=range, pdf="uniform")
-H = Rand(rows=rank, cols=m, min=0, max=range, pdf="uniform")
-losses = matrix(0, rows=max_iteration, cols=1)
-
-# run PNMF
-i=1
-while(i <= max_iteration) {
-  # update params
-  H = (H * (t(W) %*% (V/(W%*%H))))/t(colSums(W))
-  W = (W * ((V/(W%*%H)) %*% t(H)))/t(rowSums(H))
-
-  # compute loss
-  losses[i,] = -1 * (sum(V*log(W%*%H)) - as.scalar(colSums(W)%*%rowSums(H)))
-  i = i + 1;
-}
-
-# write outputs
-write(losses, $lossout)
-write(W, $Wout)
-write(H, $Hout)
-"""
-{% endhighlight %}
-
-## Execute the algorithm
-
-{% highlight python %}
-# Run the PNMF script on SystemML with Spark
-ml.reset()
-outputs = ml.executeScript(pnmf, {"X": X_train, "maxiter": 100, "rank": 10}, ["W", "H", "losses"])
-{% endhighlight %}
-
-## Retrieve the losses during training and plot them
-
-{% highlight python %}
-# Plot training loss over time
-losses = outputs.getDF(spark.sqlContext, "losses")
-xy = losses.sort(losses.ID).map(lambda r: (r[0], r[1])).collect()
-x, y = zip(*xy)
-plt.plot(x, y)
-plt.xlabel('Iteration')
-plt.ylabel('Loss')
-plt.title('PNMF Training Loss')
-{% endhighlight %}
-
-![Jupyter Loss Graph](img/spark-mlcontext-programming-guide/jupyter_loss_graph.png "Jupyter Loss Graph")
-
----
-
 # Recommended Spark Configuration Settings
 
 For best performance, we recommend setting the following flags when running SystemML with Spark:



[43/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

Posted by de...@apache.org.
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/GLM.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/GLM.tex b/Algorithms Reference/GLM.tex
deleted file mode 100644
index 8555a5b..0000000
--- a/Algorithms Reference/GLM.tex	
+++ /dev/null
@@ -1,431 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Generalized Linear Models (GLM)}
-\label{sec:GLM}
-
-\noindent{\bf Description}
-\smallskip
-
-Generalized Linear Models~\cite{Gill2000:GLM,McCullagh1989:GLM,Nelder1972:GLM}
-extend the methodology of linear and logistic regression to a variety of
-distributions commonly assumed as noise effects in the response variable.
-As before, we are given a collection
-of records $(x_1, y_1)$, \ldots, $(x_n, y_n)$ where $x_i$ is a numerical vector of
-explanatory (feature) variables of size~\mbox{$\dim x_i = m$}, and $y_i$ is the
-response (dependent) variable observed for this vector.  GLMs assume that some
-linear combination of the features in~$x_i$ determines the \emph{mean}~$\mu_i$
-of~$y_i$, while the observed $y_i$ is a random outcome of a noise distribution
-$\Prob[y\mid \mu_i]\,$\footnote{$\Prob[y\mid \mu_i]$ is given by a density function
-if $y$ is continuous.}
-with that mean~$\mu_i$:
-\begin{equation*}
-x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} 
-\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim \Prob[y\mid \mu_i]
-\end{equation*}
-
-In linear regression the response mean $\mu_i$ \emph{equals} some linear combination
-over~$x_i$, denoted above by~$\eta_i$.
-In logistic regression with $y\in\{0, 1\}$ (Bernoulli) the mean of~$y$ is the same
-as $\Prob[y=1]$ and equals $1/(1+e^{-\eta_i})$, the logistic function of~$\eta_i$.
-In GLM, $\mu_i$ and $\eta_i$ can be related via any given smooth monotone function
-called the \emph{link function}: $\eta_i = g(\mu_i)$.  The unknown linear combination
-parameters $\beta_j$ are assumed to be the same for all records.
-
-The goal of the regression is to estimate the parameters~$\beta_j$ from the observed
-data.  Once the~$\beta_j$'s are accurately estimated, we can make predictions
-about~$y$ for a new feature vector~$x$.  To do so, compute $\eta$ from~$x$ and use
-the inverted link function $\mu = g^{-1}(\eta)$ to compute the mean $\mu$ of~$y$;
-then use the distribution $\Prob[y\mid \mu]$ to make predictions about~$y$.
-Both $g(\mu)$ and $\Prob[y\mid \mu]$ are user-provided.  Our GLM script supports
-a standard set of distributions and link functions, see below for details.
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}GLM.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} Y=}path/file
-{\tt{} B=}path/file
-{\tt{} fmt=}format
-{\tt{} O=}path/file
-{\tt{} Log=}path/file
-{\tt{} dfam=}int
-{\tt{} vpow=}double
-{\tt{} link=}int
-{\tt{} lpow=}double
-{\tt{} yneg=}double
-{\tt{} icpt=}int
-{\tt{} reg=}double
-{\tt{} tol=}double
-{\tt{} disp=}double
-{\tt{} moi=}int
-{\tt{} mii=}int
-
-}
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the matrix of feature vectors; each row constitutes
-an example.
-\item[{\tt Y}:]
-Location to read the response matrix, which may have 1 or 2 columns
-\item[{\tt B}:]
-Location to store the estimated regression parameters (the $\beta_j$'s), with the
-intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\item[{\tt O}:] (default:\mbox{ }{\tt " "})
-Location to write certain summary statistics described in Table~\ref{table:GLM:stats},
-by default it is standard output.
-\item[{\tt Log}:] (default:\mbox{ }{\tt " "})
-Location to store iteration-specific variables for monitoring and debugging purposes,
-see Table~\ref{table:GLM:log} for details.
-\item[{\tt dfam}:] (default:\mbox{ }{\tt 1})
-Distribution family code to specify $\Prob[y\mid \mu]$, see Table~\ref{table:commonGLMs}:\\
-{\tt 1} = power distributions with $\Var(y) = \mu^{\alpha}$;
-{\tt 2} = binomial or Bernoulli
-\item[{\tt vpow}:] (default:\mbox{ }{\tt 0.0})
-When {\tt dfam=1}, this provides the~$q$ in $\Var(y) = a\mu^q$, the power
-dependence of the variance of~$y$ on its mean.  In particular, use:\\
-{\tt 0.0} = Gaussian,
-{\tt 1.0} = Poisson,
-{\tt 2.0} = Gamma,
-{\tt 3.0} = inverse Gaussian
-\item[{\tt link}:] (default:\mbox{ }{\tt 0})
-Link function code to determine the link function~$\eta = g(\mu)$:\\
-{\tt 0} = canonical link (depends on the distribution family), see Table~\ref{table:commonGLMs};\\
-{\tt 1} = power functions,
-{\tt 2} = logit,
-{\tt 3} = probit,
-{\tt 4} = cloglog,
-{\tt 5} = cauchit
-\item[{\tt lpow}:] (default:\mbox{ }{\tt 1.0})
-When {\tt link=1}, this provides the~$s$ in $\eta = \mu^s$, the power link
-function; {\tt lpow=0.0} gives the log link $\eta = \log\mu$.  Common power links:\\
-{\tt -2.0} = $1/\mu^2$,
-{\tt -1.0} = reciprocal,
-{\tt 0.0} = log,
-{\tt 0.5} = sqrt,
-{\tt 1.0} = identity
-\item[{\tt yneg}:] (default:\mbox{ }{\tt 0.0})
-When {\tt dfam=2} and the response matrix $Y$ has 1~column,
-this specifies the $y$-value used for Bernoulli ``No'' label.
-All other $y$-values are treated as the ``Yes'' label.
-For example, {\tt yneg=-1.0} may be used when $y\in\{-1, 1\}$;
-either {\tt yneg=1.0} or {\tt yneg=2.0} may be used when $y\in\{1, 2\}$.
-\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
-Intercept and shifting/rescaling of the features in~$X$:\\
-{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\
-{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
-{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
-\item[{\tt reg}:] (default:\mbox{ }{\tt 0.0})
-L2-regularization parameter (lambda)
-\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
-Tolerance (epsilon) used in the convergence criterion: we terminate the outer iterations
-when the deviance changes by less than this factor; see below for details
-\item[{\tt disp}:] (default:\mbox{ }{\tt 0.0})
-Dispersion parameter, or {\tt 0.0} to estimate it from data
-\item[{\tt moi}:] (default:\mbox{ }{\tt 200})
-Maximum number of outer (Fisher scoring) iterations
-\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
-Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
-limit provided
-\end{Description}
-
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|ll|}
-\hline
-Name & Meaning \\
-\hline
-{\tt TERMINATION\_CODE}  & A positive integer indicating success/failure as follows: \\
-                         & $1 = {}$Converged successfully;
-                           $2 = {}$Maximum \# of iterations reached; \\
-                         & $3 = {}$Input ({\tt X}, {\tt Y}) out of range;
-                           $4 = {}$Distribution/link not supported \\
-{\tt BETA\_MIN}          & Smallest beta value (regression coefficient), excluding the intercept \\
-{\tt BETA\_MIN\_INDEX}   & Column index for the smallest beta value \\
-{\tt BETA\_MAX}          & Largest beta value (regression coefficient), excluding the intercept \\
-{\tt BETA\_MAX\_INDEX}   & Column index for the largest beta value \\
-{\tt INTERCEPT}          & Intercept value, or NaN if there is no intercept (if {\tt icpt=0}) \\
-{\tt DISPERSION}         & Dispersion used to scale deviance, provided in {\tt disp} input argument \\
-                         & or estimated (same as {\tt DISPERSION\_EST}) if {\tt disp} argument is${} \leq 0$ \\
-{\tt DISPERSION\_EST}    & Dispersion estimated from the dataset \\
-{\tt DEVIANCE\_UNSCALED} & Deviance from the saturated model, assuming dispersion${} = 1.0$ \\
-{\tt DEVIANCE\_SCALED}   & Deviance from the saturated model, scaled by {\tt DISPERSION} value \\
-\hline
-\end{tabular}}
-\caption{Besides~$\beta$, GLM regression script computes a few summary statistics listed above.
-They are provided in CSV format, one comma-separated name-value pair per each line.}
-\label{table:GLM:stats}
-\end{table}
-
-
-
-
-
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|ll|}
-\hline
-Name & Meaning \\
-\hline
-{\tt NUM\_CG\_ITERS}     & Number of inner (Conj.\ Gradient) iterations in this outer iteration \\
-{\tt IS\_TRUST\_REACHED} & $1 = {}$trust region boundary was reached, $0 = {}$otherwise \\
-{\tt POINT\_STEP\_NORM}  & L2-norm of iteration step from old point ($\beta$-vector) to new point \\
-{\tt OBJECTIVE}          & The loss function we minimize (negative partial log-likelihood) \\
-{\tt OBJ\_DROP\_REAL}    & Reduction in the objective during this iteration, actual value \\
-{\tt OBJ\_DROP\_PRED}    & Reduction in the objective predicted by a quadratic approximation \\
-{\tt OBJ\_DROP\_RATIO}   & Actual-to-predicted reduction ratio, used to update the trust region \\
-{\tt GRADIENT\_NORM}     & L2-norm of the loss function gradient (omitted if point is rejected) \\
-{\tt LINEAR\_TERM\_MIN}  & The minimum value of $X \pxp \beta$, used to check for overflows \\
-{\tt LINEAR\_TERM\_MAX}  & The maximum value of $X \pxp \beta$, used to check for overflows \\
-{\tt IS\_POINT\_UPDATED} & $1 = {}$new point accepted; $0 = {}$new point rejected, old point restored \\
-{\tt TRUST\_DELTA}       & Updated trust region size, the ``delta'' \\
-\hline
-\end{tabular}}
-\caption{
-The {\tt Log} file for GLM regression contains the above \mbox{per-}iteration
-variables in CSV format, each line containing triple (Name, Iteration\#, Value) with Iteration\#
-being~0 for initial values.}
-\label{table:GLM:log}
-\end{table}
-
-\begin{table}[t]\hfil
-\begin{tabular}{|ccccccc|}
-\hline
-\multicolumn{4}{|c}{INPUT PARAMETERS}              & Distribution  & Link      & Cano- \\
-{\tt dfam} & {\tt vpow} & {\tt link} & {\tt\ lpow} & family        & function  & nical?\\
-\hline
-{\tt 1}    & {\tt 0.0}  & {\tt 1}    & {\tt -1.0}  & Gaussian      & inverse   &       \\
-{\tt 1}    & {\tt 0.0}  & {\tt 1}    & {\tt\ 0.0}  & Gaussian      & log       &       \\
-{\tt 1}    & {\tt 0.0}  & {\tt 1}    & {\tt\ 1.0}  & Gaussian      & identity  & Yes   \\
-{\tt 1}    & {\tt 1.0}  & {\tt 1}    & {\tt\ 0.0}  & Poisson       & log       & Yes   \\
-{\tt 1}    & {\tt 1.0}  & {\tt 1}    & {\tt\ 0.5}  & Poisson       & sq.root   &       \\
-{\tt 1}    & {\tt 1.0}  & {\tt 1}    & {\tt\ 1.0}  & Poisson       & identity  &       \\
-{\tt 1}    & {\tt 2.0}  & {\tt 1}    & {\tt -1.0}  & Gamma         & inverse   & Yes   \\
-{\tt 1}    & {\tt 2.0}  & {\tt 1}    & {\tt\ 0.0}  & Gamma         & log       &       \\
-{\tt 1}    & {\tt 2.0}  & {\tt 1}    & {\tt\ 1.0}  & Gamma         & identity  &       \\
-{\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt -2.0}  & Inverse Gauss & $1/\mu^2$ & Yes   \\
-{\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt -1.0}  & Inverse Gauss & inverse   &       \\
-{\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt\ 0.0}  & Inverse Gauss & log       &       \\
-{\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt\ 1.0}  & Inverse Gauss & identity  &       \\
-\hline
-{\tt 2}    & {\tt  *}   & {\tt 1}    & {\tt\ 0.0}  & Binomial      & log       &       \\
-{\tt 2}    & {\tt  *}   & {\tt 1}    & {\tt\ 0.5}  & Binomial      & sq.root   &       \\
-{\tt 2}    & {\tt  *}   & {\tt 2}    & {\tt\  *}   & Binomial      & logit     & Yes   \\
-{\tt 2}    & {\tt  *}   & {\tt 3}    & {\tt\  *}   & Binomial      & probit    &       \\
-{\tt 2}    & {\tt  *}   & {\tt 4}    & {\tt\  *}   & Binomial      & cloglog   &       \\
-{\tt 2}    & {\tt  *}   & {\tt 5}    & {\tt\  *}   & Binomial      & cauchit   &       \\
-\hline
-\end{tabular}\hfil
-\caption{Common GLM distribution families and link functions.
-(Here ``{\tt *}'' stands for ``any value.'')}
-\label{table:commonGLMs}
-\end{table}
-
-\noindent{\bf Details}
-\smallskip
-
-In GLM, the noise distribution $\Prob[y\mid \mu]$ of the response variable~$y$
-given its mean~$\mu$ is restricted to have the \emph{exponential family} form
-\begin{equation}
-Y \sim\, \Prob[y\mid \mu] \,=\, \exp\left(\frac{y\theta - b(\theta)}{a}
-+ c(y, a)\right),\,\,\textrm{where}\,\,\,\mu = \E(Y) = b'(\theta).
-\label{eqn:GLM}
-\end{equation}
-Changing the mean in such a distribution simply multiplies all \mbox{$\Prob[y\mid \mu]$}
-by~$e^{\,y\hspace{0.2pt}\theta/a}$ and rescales them so that they again integrate to~1.
-Parameter $\theta$ is called \emph{canonical}, and the function $\theta = b'^{\,-1}(\mu)$
-that relates it to the mean is called the~\emph{canonical link}; constant~$a$ is called
-\emph{dispersion} and rescales the variance of~$y$.  Many common distributions can be put
-into this form, see Table~\ref{table:commonGLMs}.  The canonical parameter~$\theta$
-is often chosen to coincide with~$\eta$, the linear combination of the regression features;
-other choices for~$\eta$ are possible too.
-
-Rather than specifying the canonical link, GLM distributions are commonly defined
-by their variance $\Var(y)$ as the function of the mean~$\mu$.  It can be shown
-from Eq.~(\ref{eqn:GLM}) that $\Var(y) = a\,b''(\theta) = a\,b''(b'^{\,-1}(\mu))$.
-For example, for the Bernoulli distribution $\Var(y) = \mu(1-\mu)$, for the Poisson
-distribution \mbox{$\Var(y) = \mu$}, and for the Gaussian distribution
-$\Var(y) = a\cdot 1 = \sigma^2$.
-It turns out that for many common distributions $\Var(y) = a\mu^q$, a power function.
-We support all distributions where $\Var(y) = a\mu^q$, as well as the Bernoulli and
-the binomial distributions.
-
-For distributions with $\Var(y) = a\mu^q$ the canonical link is also a power function,
-namely $\theta = \mu^{1-q}/(1-q)$, except for the Poisson ($q = 1$) whose canonical link is
-$\theta = \log\mu$.  We support all power link functions in the form $\eta = \mu^s$,
-dropping any constant factor, with $\eta = \log\mu$ for $s=0$.  The binomial distribution
-has its own family of link functions, which includes logit (the canonical link),
-probit, cloglog, and cauchit (see Table~\ref{table:binomial_links}); we support these
-only for the binomial and Bernoulli distributions.  Links and distributions are specified
-via four input parameters: {\tt dfam}, {\tt vpow}, {\tt link}, and {\tt lpow} (see
-Table~\ref{table:commonGLMs}).
-
-\begin{table}[t]\hfil
-\begin{tabular}{|cc|cc|}
-\hline
-Name & Link function & Name & Link function \\
-\hline
-Logit   & $\displaystyle \eta = 1 / \big(1 + e^{-\mu}\big)^{\mathstrut}$ &
-Cloglog & $\displaystyle \eta = \log \big(\!- \log(1 - \mu)\big)^{\mathstrut}$ \\
-Probit  & $\displaystyle \mu  = \frac{1}{\sqrt{2\pi}}\int\nolimits_{-\infty_{\mathstrut}}^{\,\eta\mathstrut}
-          \!\!\!\!\! e^{-\frac{t^2}{2}} dt$ & 
-Cauchit & $\displaystyle \eta = \tan\pi(\mu - 1/2)$ \\
-\hline
-\end{tabular}\hfil
-\caption{The supported non-power link functions for the Bernoulli and the binomial
-distributions.  (Here $\mu$~is the Bernoulli mean.)}
-\label{table:binomial_links}
-\end{table}
-
-The observed response values are provided to the regression script as matrix~$Y$
-having 1 or 2 columns.  If a power distribution family is selected ({\tt dfam=1}),
-matrix $Y$ must have 1~column that provides $y_i$ for each~$x_i$ in the corresponding
-row of matrix~$X$.  When {\tt dfam=2} and $Y$ has 1~column, we assume the Bernoulli
-distribution for $y_i\in\{y_{\mathrm{neg}}, y_{\mathrm{pos}}\}$ with $y_{\mathrm{neg}}$
-from the input parameter {\tt yneg} and with $y_{\mathrm{pos}} \neq y_{\mathrm{neg}}$.  
-When {\tt dfam=2} and $Y$ has 2~columns, we assume the
-binomial distribution; for each row~$i$ in~$X$, cells $Y[i, 1]$ and $Y[i, 2]$ provide
-the positive and the negative binomial counts respectively.  Internally we convert
-the 1-column Bernoulli into the 2-column binomial with 0-versus-1 counts.
-
-We estimate the regression parameters via L2-regularized negative log-likelihood
-minimization:
-\begin{equation*}
-f(\beta; X, Y) \,\,=\,\, -\sum\nolimits_{i=1}^n \big(y_i\theta_i - b(\theta_i)\big)
-\,+\,(\lambda/2) \sum\nolimits_{j=1}^m \beta_j^2\,\,\to\,\,\min
-\end{equation*}
-where $\theta_i$ and $b(\theta_i)$ are from~(\ref{eqn:GLM}); note that $a$
-and $c(y, a)$ are constant w.r.t.~$\beta$ and can be ignored here.
-The canonical parameter $\theta_i$ depends on both $\beta$ and~$x_i$:
-\begin{equation*}
-\theta_i \,\,=\,\, b'^{\,-1}(\mu_i) \,\,=\,\, b'^{\,-1}\big(g^{-1}(\eta_i)\big) \,\,=\,\,
-\big(b'^{\,-1}\circ g^{-1}\big)\left(\beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}\right)
-\end{equation*}
-The user-provided (via {\tt reg}) regularization coefficient $\lambda\geq 0$ can be used
-to mitigate overfitting and degeneracy in the data.  Note that the intercept is never
-regularized.
-
-Our iterative minimizer for $f(\beta; X, Y)$ uses the Fisher scoring approximation
-to the difference $\varDelta f(z; \beta) = f(\beta + z; X, Y) \,-\, f(\beta; X, Y)$,
-recomputed at each iteration:
-\begin{gather*}
-\varDelta f(z; \beta) \,\,\,\approx\,\,\, 1/2 \cdot z^T A z \,+\, G^T z,
-\,\,\,\,\textrm{where}\,\,\,\, A \,=\, X^T\!\diag(w) X \,+\, \lambda I\\
-\textrm{and}\,\,\,\,G \,=\, - X^T u \,+\, \lambda\beta,
-\,\,\,\textrm{with $n\,{\times}\,1$ vectors $w$ and $u$ given by}\\
-\forall\,i = 1\ldots n: \,\,\,\,
-w_i = \big[v(\mu_i)\,g'(\mu_i)^2\big]^{-1}
-\!\!\!\!\!\!,\,\,\,\,\,\,\,\,\,
-u_i = (y_i - \mu_i)\big[v(\mu_i)\,g'(\mu_i)\big]^{-1}
-\!\!\!\!\!\!.\,\,\,\,
-\end{gather*}
-Here $v(\mu_i)=\Var(y_i)/a$, the variance of $y_i$ as the function of the mean, and
-$g'(\mu_i) = d \eta_i/d \mu_i$ is the link function derivative.  The Fisher scoring
-approximation is minimized by trust-region conjugate gradient iterations (called the
-\emph{inner} iterations, with the Fisher scoring iterations as the \emph{outer}
-iterations), which approximately solve the following problem:
-\begin{equation*}
-1/2 \cdot z^T A z \,+\, G^T z \,\,\to\,\,\min\,\,\,\,\textrm{subject to}\,\,\,\,
-\|z\|_2 \leq \delta
-\end{equation*}
-The conjugate gradient algorithm closely follows Algorithm~7.2 on page~171
-of~\cite{Nocedal2006:Optimization}.
-The trust region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$
-and updated as described in~\cite{Nocedal2006:Optimization}.
-The user can specify the maximum number of the outer and the inner iterations with
-input parameters {\tt moi} and {\tt mii}, respectively.  The Fisher scoring algorithm
-terminates successfully if $2|\varDelta f(z; \beta)| < (D_1(\beta) + 0.1)\hspace{0.5pt}\eps$
-where $\eps > 0$ is a tolerance supplied by the user via {\tt tol}, and $D_1(\beta)$ is
-the unit-dispersion deviance estimated as
-\begin{equation*}
-D_1(\beta) \,\,=\,\, 2 \cdot \big(\Prob[Y \mid \!
-\begin{smallmatrix}\textrm{saturated}\\\textrm{model}\end{smallmatrix}, a\,{=}\,1]
-\,\,-\,\,\Prob[Y \mid X, \beta, a\,{=}\,1]\,\big)
-\end{equation*}
-The deviance estimate is also produced as part of the output.  Once the Fisher scoring
-algorithm terminates, if requested by the user, we estimate the dispersion~$a$ from
-Eq.~\ref{eqn:GLM} using Pearson residuals
-\begin{equation}
-\hat{a} \,\,=\,\, \frac{1}{n-m}\cdot \sum_{i=1}^n \frac{(y_i - \mu_i)^2}{v(\mu_i)}
-\label{eqn:dispersion}
-\end{equation}
-and use it to adjust our deviance estimate: $D_{\hat{a}}(\beta) = D_1(\beta)/\hat{a}$.
-If input argument {\tt disp} is {\tt 0.0} we estimate $\hat{a}$, otherwise we use its
-value as~$a$.  Note that in~(\ref{eqn:dispersion}) $m$~counts the intercept
-($m \leftarrow m+1$) if it is present.
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-The estimated regression parameters (the $\hat{\beta}_j$'s) are populated into
-a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
-input argument.  What this matrix contains, and its size, depends on the input
-argument {\tt icpt}, which specifies the user's intercept and rescaling choice:
-\begin{Description}
-\item[{\tt icpt=0}:] No intercept, matrix~$B$ has size $m\,{\times}\,1$, with
-$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to~$m = {}$ncol$(X)$.
-\item[{\tt icpt=1}:] There is intercept, but no shifting/rescaling of~$X$; matrix~$B$
-has size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from 1 to~$m$,
-and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept coefficient.
-\item[{\tt icpt=2}:] There is intercept, and the features in~$X$ are shifted to
-mean${} = 0$ and rescaled to variance${} = 1$; then there are two versions of
-the~$\hat{\beta}_j$'s, one for the original features and another for the
-shifted/rescaled features.  Now matrix~$B$ has size $(m\,{+}\,1) \times 2$, with
-$B[\cdot, 1]$ for the original features and $B[\cdot, 2]$ for the shifted/rescaled
-features, in the above format.  Note that $B[\cdot, 2]$ are iteratively estimated
-and $B[\cdot, 1]$ are obtained from $B[\cdot, 2]$ by complementary shifting and
-rescaling.
-\end{Description}
-Our script also estimates the dispersion $\hat{a}$ (or takes it from the user's input)
-and the deviances $D_1(\hat{\beta})$ and $D_{\hat{a}}(\hat{\beta})$, see
-Table~\ref{table:GLM:stats} for details.  A log file with variables monitoring
-progress through the iterations can also be made available, see Table~\ref{table:GLM:log}.
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-\hml -f GLM.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
-  B=/user/biadmin/B.mtx fmt=csv dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.01 tol=0.00000001
-  disp=1.0 moi=100 mii=10 O=/user/biadmin/stats.csv Log=/user/biadmin/log.csv
-
-}
-
-\smallskip
-\noindent{\bf See Also}
-\smallskip
-
-In case of binary classification problems, consider using L2-SVM or binary logistic
-regression; for multiclass classification, use multiclass~SVM or multinomial logistic
-regression.  For the special cases of linear regression and logistic regression, it
-may be more efficient to use the corresponding specialized scripts instead of~GLM.

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/GLMpredict.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/GLMpredict.tex b/Algorithms Reference/GLMpredict.tex
deleted file mode 100644
index ceb249d..0000000
--- a/Algorithms Reference/GLMpredict.tex	
+++ /dev/null
@@ -1,474 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Regression Scoring and Prediction}
-
-\noindent{\bf Description}
-\smallskip
-
-Script {\tt GLM-predict.dml} is intended to cover all linear model based regressions,
-including linear regression, binomial and multinomial logistic regression, and GLM
-regressions (Poisson, gamma, binomial with probit link~etc.).  Having just one scoring
-script for all these regressions simplifies maintenance and enhancement while ensuring
-compatible interpretations for output statistics.
-
-The script performs two functions, prediction and scoring.  To perform prediction,
-the script takes two matrix inputs: a collection of records $X$ (without the response
-attribute) and the estimated regression parameters~$B$, also known as~$\beta$.  
-To perform scoring, in addition to $X$ and~$B$, the script takes the matrix of actual
-response values~$Y$ that are compared to the predictions made with $X$ and~$B$.  Of course
-there are other, non-matrix, input arguments that specify the model and the output
-format, see below for the full list.
-
-We assume that our test/scoring dataset is given by $n\,{\times}\,m$-matrix $X$ of
-numerical feature vectors, where each row~$x_i$ represents one feature vector of one
-record; we have \mbox{$\dim x_i = m$}.  Each record also includes the response
-variable~$y_i$ that may be numerical, single-label categorical, or multi-label categorical.
-A single-label categorical $y_i$ is an integer category label, one label per record;
-a multi-label $y_i$ is a vector of integer counts, one count for each possible label,
-which represents multiple single-label events (observations) for the same~$x_i$.  Internally
-we convert single-label categoricals into multi-label categoricals by replacing each
-label~$l$ with an indicator vector~$(0,\ldots,0,1_l,0,\ldots,0)$.  In prediction-only
-tasks the actual $y_i$'s are not needed to the script, but they are needed for scoring.
-
-To perform prediction, the script matrix-multiplies $X$ and $B$, adding the intercept
-if available, then applies the inverse of the model's link function.  
-All GLMs assume that the linear combination of the features in~$x_i$ and the betas
-in~$B$ determines the means~$\mu_i$ of the~$y_i$'s (in numerical or multi-label
-categorical form) with $\dim \mu_i = \dim y_i$.  The observed $y_i$ is assumed to follow
-a specified GLM family distribution $\Prob[y\mid \mu_i]$ with mean(s)~$\mu_i$:
-\begin{equation*}
-x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} 
-\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim \Prob[y\mid \mu_i]
-\end{equation*}
-If $y_i$ is numerical, the predicted mean $\mu_i$ is a real number.  Then our script's
-output matrix $M$ is the $n\,{\times}\,1$-vector of these means~$\mu_i$.
-Note that $\mu_i$ predicts the mean of $y_i$, not the actual~$y_i$.  For example,
-in Poisson distribution, the mean is usually fractional, but the actual~$y_i$ is
-always integer.
-
-If $y_i$ is categorical, i.e.\ a vector of label counts for record~$i$, then $\mu_i$
-is a vector of non-negative real numbers, one number $\mu_{i,l}$ per each label~$l$.
-In this case we divide the $\mu_{i,l}$ by their sum $\sum_l \mu_{i,l}$ to obtain
-predicted label probabilities~\mbox{$p_{i,l}\in [0, 1]$}.  The output matrix $M$ is
-the $n \times (k\,{+}\,1)$-matrix of these probabilities, where $n$ is the number of
-records and $k\,{+}\,1$ is the number of categories\footnote{We use $k+1$ because
-there are $k$ non-baseline categories and one baseline category, with regression
-parameters $B$ having $k$~columns.}.  Note again that we do not predict the labels
-themselves, nor their actual counts per record, but we predict the labels' probabilities. 
-
-Going from predicted probabilities to predicted labels, in the single-label categorical
-case, requires extra information such as the cost of false positive versus
-false negative errors.  For example, if there are 5 categories and we \emph{accurately}
-predicted their probabilities as $(0.1, 0.3, 0.15, 0.2, 0.25)$, just picking the
-highest-probability label would be wrong 70\% of the time, whereas picking the
-lowest-probability label might be right if, say, it represents a diagnosis of cancer
-or another rare and serious outcome.  Hence, we keep this step outside the scope of
-{\tt GLM-predict.dml} for now.
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}GLM-predict.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} Y=}path/file
-{\tt{} B=}path/file
-{\tt{} M=}path/file
-{\tt{} O=}path/file
-{\tt{} dfam=}int
-{\tt{} vpow=}double
-{\tt{} link=}int
-{\tt{} lpow=}double
-{\tt{} disp=}double
-{\tt{} fmt=}format
-
-}
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the $n\,{\times}\,m$-matrix $X$ of feature vectors, each row
-constitutes one feature vector (one record)
-\item[{\tt Y}:] (default:\mbox{ }{\tt " "})
-Location to read the response matrix $Y$ needed for scoring (but optional for prediction),
-with the following dimensions: \\
-    $n \:{\times}\: 1$: acceptable for all distributions ({\tt dfam=1} or {\tt 2} or {\tt 3}) \\
-    $n \:{\times}\: 2$: for binomial ({\tt dfam=2}) if given by (\#pos, \#neg) counts \\
-    $n \:{\times}\: k\,{+}\,1$: for multinomial ({\tt dfam=3}) if given by category counts
-\item[{\tt M}:] (default:\mbox{ }{\tt " "})
-Location to write, if requested, the matrix of predicted response means (for {\tt dfam=1}) or
-probabilities (for {\tt dfam=2} or {\tt 3}):\\
-    $n \:{\times}\: 1$: for power-type distributions ({\tt dfam=1}) \\
-    $n \:{\times}\: 2$: for binomial distribution ({\tt dfam=2}), col\#~2 is the ``No'' probability \\
-    $n \:{\times}\: k\,{+}\,1$: for multinomial logit ({\tt dfam=3}), col\#~$k\,{+}\,1$ is for the baseline
-\item[{\tt B}:]
-Location to read matrix $B$ of the \mbox{betas}, i.e.\ estimated GLM regression parameters,
-with the intercept at row\#~$m\,{+}\,1$ if available:\\
-    $\dim(B) \,=\, m \:{\times}\: k$: do not add intercept \\
-    $\dim(B) \,=\, m\,{+}\,1 \:{\times}\: k$: add intercept as given by the last $B$-row \\
-    if $k > 1$, use only $B${\tt [, 1]} unless it is Multinomial Logit ({\tt dfam=3})
-\item[{\tt O}:] (default:\mbox{ }{\tt " "})
-Location to store the CSV-file with goodness-of-fit statistics defined in
-Table~\ref{table:GLMpred:stats}, the default is to print them to the standard output
-\item[{\tt dfam}:] (default:\mbox{ }{\tt 1})
-GLM distribution family code to specify the type of distribution $\Prob[y\,|\,\mu]$
-that we assume: \\
-{\tt 1} = power distributions with $\Var(y) = \mu^{\alpha}$, see Table~\ref{table:commonGLMs};\\
-{\tt 2} = binomial; 
-{\tt 3} = multinomial logit
-\item[{\tt vpow}:] (default:\mbox{ }{\tt 0.0})
-Power for variance defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt dfam}$\,{\neq}\,1$):
-when {\tt dfam=1}, this provides the~$q$ in $\Var(y) = a\mu^q$, the power
-dependence of the variance of~$y$ on its mean.  In particular, use:\\
-{\tt 0.0} = Gaussian,
-{\tt 1.0} = Poisson,
-{\tt 2.0} = Gamma,
-{\tt 3.0} = inverse Gaussian
-\item[{\tt link}:] (default:\mbox{ }{\tt 0})
-Link function code to determine the link function~$\eta = g(\mu)$, ignored for
-multinomial logit ({\tt dfam=3}):\\
-{\tt 0} = canonical link (depends on the distribution family), see Table~\ref{table:commonGLMs};\\
-{\tt 1} = power functions,
-{\tt 2} = logit,
-{\tt 3} = probit,
-{\tt 4} = cloglog,
-{\tt 5} = cauchit
-\item[{\tt lpow}:] (default:\mbox{ }{\tt 1.0})
-Power for link function defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt link}$\,{\neq}\,1$):
-when {\tt link=1}, this provides the~$s$ in $\eta = \mu^s$, the power link
-function; {\tt lpow=0.0} gives the log link $\eta = \log\mu$.  Common power links:\\
-{\tt -2.0} = $1/\mu^2$,
-{\tt -1.0} = reciprocal,
-{\tt 0.0} = log,
-{\tt 0.5} = sqrt,
-{\tt 1.0} = identity
-\item[{\tt disp}:] (default:\mbox{ }{\tt 1.0})
-Dispersion value, when available; must be positive
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix {\tt M} file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|lccl|}
-\hline
-Name & \hspace{-0.6em}CID\hspace{-0.5em} & \hspace{-0.3em}Disp?\hspace{-0.6em} & Meaning \\
-\hline
-{\tt LOGLHOOD\_Z}          &   & + & Log-likelihood $Z$-score (in st.\ dev.'s from the mean) \\
-{\tt LOGLHOOD\_Z\_PVAL}    &   & + & Log-likelihood $Z$-score p-value, two-sided \\
-{\tt PEARSON\_X2}          &   & + & Pearson residual $X^2$-statistic \\
-{\tt PEARSON\_X2\_BY\_DF}  &   & + & Pearson $X^2$ divided by degrees of freedom \\
-{\tt PEARSON\_X2\_PVAL}    &   & + & Pearson $X^2$ p-value \\
-{\tt DEVIANCE\_G2}         &   & + & Deviance from the saturated model $G^2$-statistic \\
-{\tt DEVIANCE\_G2\_BY\_DF} &   & + & Deviance $G^2$ divided by degrees of freedom \\
-{\tt DEVIANCE\_G2\_PVAL}   &   & + & Deviance $G^2$ p-value \\
-{\tt AVG\_TOT\_Y}          & + &   & $Y$-column average for an individual response value \\
-{\tt STDEV\_TOT\_Y}        & + &   & $Y$-column st.\ dev.\ for an individual response value \\
-{\tt AVG\_RES\_Y}          & + &   & $Y$-column residual average of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\
-{\tt STDEV\_RES\_Y}        & + &   & $Y$-column residual st.\ dev.\ of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\
-{\tt PRED\_STDEV\_RES}     & + & + & Model-predicted $Y$-column residual st.\ deviation\\
-{\tt PLAIN\_R2}            & + &   & Plain $R^2$ of $Y$-column residual with bias included \\
-{\tt ADJUSTED\_R2}         & + &   & Adjusted $R^2$ of $Y$-column residual w.\ bias included \\
-{\tt PLAIN\_R2\_NOBIAS}    & + &   & Plain $R^2$ of $Y$-column residual, bias subtracted \\
-{\tt ADJUSTED\_R2\_NOBIAS} & + &   & Adjusted $R^2$ of $Y$-column residual, bias subtracted \\
-\hline
-\end{tabular}}
-\caption{The above goodness-of-fit statistics are provided in CSV format, one per each line, with four
-columns: (Name, [CID], [Disp?], Value).  The columns are: 
-``Name'' is the string identifier for the statistic, see the table;
-``CID'' is an optional integer value that specifies the $Y$-column index for \mbox{per-}column statistics
-(note that a bi-/multinomial one-column {\tt Y}-input is converted into multi-column);
-``Disp?'' is an optional Boolean value ({\tt TRUE} or {\tt FALSE}) that tells us
-whether or not scaling by the input dispersion parameter {\tt disp} has been applied to this
-statistic;
-``Value''  is the value of the statistic.}
-\label{table:GLMpred:stats}
-\end{table}
-
-\noindent{\bf Details}
-\smallskip
-
-The output matrix $M$ of predicted means (or probabilities) is computed by matrix-multiplying $X$
-with the first column of~$B$ or with the whole~$B$ in the multinomial case, adding the intercept
-if available (conceptually, appending an extra column of ones to~$X$); then applying the inverse
-of the model's link function.  The difference between ``means'' and ``probabilities'' in the
-categorical case becomes significant when there are ${\geq}\,2$ observations per record
-(with the multi-label records) or when the labels such as $-1$ and~$1$ are viewed and averaged
-as numerical response values (with the single-label records).  To avoid any \mbox{mix-up} or
-information loss, we separately return the predicted probability of each category label for each
-record.
-
-When the ``actual'' response values $Y$ are available, the summary statistics are computed
-and written out as described in Table~\ref{table:GLMpred:stats}.  Below we discuss each of
-these statistics in detail.  Note that in the categorical case (binomial and multinomial)
-$Y$ is internally represented as the matrix of observation counts for each label in each record,
-rather than just the label~ID for each record.  The input~$Y$ may already be a matrix of counts,
-in which case it is used as-is.  But if $Y$ is given as a vector of response labels, each
-response label is converted into an indicator vector $(0,\ldots,0,1_l,0,\ldots,0)$ where~$l$
-is the label~ID for this record.  All negative (e.g.~$-1$) or zero label~IDs are converted to
-the $1 + {}$maximum label~ID.  The largest label~ID is viewed as the ``baseline'' as explained
-in the section on Multinomial Logistic Regression.  We assume that there are $k\geq 1$
-non-baseline categories and one (last) baseline category.
-
-We also estimate residual variances for each response value, although we do not output them,
-but use them only inside the summary statistics, scaled and unscaled by the input dispersion
-parameter {\tt disp}, as described below.
-
-\smallskip
-{\tt LOGLHOOD\_Z} and {\tt LOGLHOOD\_Z\_PVAL} statistics measure how far the log-likelihood
-of~$Y$ deviates from its expected value according to the model.  The script implements them
-only for the binomial and the multinomial distributions, returning NaN for all other distributions.
-Pearson's~$X^2$ and deviance~$G^2$ often perform poorly with bi- and multinomial distributions
-due to low cell counts, hence we need this extra goodness-of-fit measure.  To compute these
-statistics, we use:
-\begin{Itemize}
-\item the $n\times (k\,{+}\,1)$-matrix~$Y$ of multi-label response counts, in which $y_{i,j}$
-is the number of times label~$j$ was observed in record~$i$;
-\item the model-estimated probability matrix~$P$ of the same dimensions that satisfies
-$\sum_{j=1}^{k+1} p_{i,j} = 1$ for all~$i=1,\ldots,n$ and where $p_{i,j}$ is the model
-probability of observing label~$j$ in record~$i$;
-\item the $n\,{\times}\,1$-vector $N$ where $N_i$ is the aggregated count of observations
-in record~$i$ (all $N_i = 1$ if each record has only one response label).
-\end{Itemize}
-We start by computing the multinomial log-likelihood of $Y$ given~$P$ and~$N$, as well as
-the expected log-likelihood given a random~$Y$ and the variance of this log-likelihood if
-$Y$ indeed follows the proposed distribution:
-\begin{align*}
-\ell (Y) \,\,&=\,\, \log \Prob[Y \,|\, P, N] \,\,=\,\, \sum_{i=1}^{n} \,\sum_{j=1}^{k+1}  \,y_{i,j}\log p_{i,j} \\
-\E_Y \ell (Y)  \,\,&=\,\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\mu_{i,j} \log p_{i,j} 
-    \,\,=\,\, \sum_{i=1}^{n}\, N_i \,\sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j} \\
-\Var_Y \ell (Y) \,&=\, \sum_{i=1}^{n} \,N_i \left(\sum_{j=1}^{k+1} \,p_{i,j} \big(\log p_{i,j}\big)^2
-    - \Bigg( \sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j}\Bigg) ^ {\!\!2\,} \right)
-\end{align*}
-Then we compute the $Z$-score as the difference between the actual and the expected
-log-likelihood $\ell(Y)$ divided by its expected standard deviation, and its two-sided
-p-value in the Normal distribution assumption ($\ell(Y)$ should approach normality due
-to the Central Limit Theorem):
-\begin{equation*}
-Z   \,=\, \frac {\ell(Y) - \E_Y \ell(Y)}{\sqrt{\Var_Y \ell(Y)}};\quad
-\mathop{\textrm{p-value}}(Z) \,=\, \Prob \Big[\,\big|\mathop{\textrm{Normal}}(0,1)\big| \, > \, |Z|\,\Big]
-\end{equation*}
-A low p-value would indicate ``underfitting'' if $Z\ll 0$ or ``overfitting'' if $Z\gg 0$.  Here
-``overfitting'' means that higher-probability labels occur more often than their probabilities
-suggest.
-
-We also apply the dispersion input ({\tt disp}) to compute the ``scaled'' version of the $Z$-score
-and its p-value.  Since $\ell(Y)$ is a linear function of~$Y$, multiplying the GLM-predicted
-variance of~$Y$ by {\tt disp} results in multiplying $\Var_Y \ell(Y)$ by the same {\tt disp}.  This, in turn,
-translates into dividing the $Z$-score by the square root of the dispersion:
-\begin{equation*}
-Z_{\texttt{disp}}  \,=\, \big(\ell(Y) \,-\, \E_Y \ell(Y)\big) \,\big/\, \sqrt{\texttt{disp}\cdot\Var_Y \ell(Y)}
-\,=\, Z / \sqrt{\texttt{disp}}
-\end{equation*}
-Finally, we recalculate the p-value with this new $Z$-score.
-
-\smallskip
-{\tt PEARSON\_X2}, {\tt PEARSON\_X2\_BY\_DF}, and {\tt PEARSON\_X2\_PVAL}:
-Pearson's residual $X^2$-statistic is a commonly used goodness-of-fit measure for linear models~\cite{McCullagh1989:GLM}.
-The idea is to measure how well the model-predicted means and variances match the actual behavior
-of response values.  For each record $i$, we estimate the mean $\mu_i$ and the variance $v_i$
-(or $\texttt{disp}\cdot v_i$) and use them to normalize the residual: 
-$r_i = (y_i - \mu_i) / \sqrt{v_i}$.  These normalized residuals are then squared, aggregated
-by summation, and tested against an appropriate $\chi^2$~distribution.  The computation of~$X^2$
-is slightly different for categorical data (bi- and multinomial) than it is for numerical data,
-since $y_i$ has multiple correlated dimensions~\cite{McCullagh1989:GLM}:
-\begin{equation*}
-X^2\,\textrm{(numer.)} \,=\,  \sum_{i=1}^{n}\, \frac{(y_i - \mu_i)^2}{v_i};\quad
-X^2\,\textrm{(categ.)} \,=\,  \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\frac{(y_{i,j} - N_i 
-\hspace{0.5pt} p_{i,j})^2}{N_i \hspace{0.5pt} p_{i,j}}
-\end{equation*}
-The number of degrees of freedom~\#d.f.\ for the $\chi^2$~distribution is $n - m$ for numerical data and
-$(n - m)k$ for categorical data, where $k = \mathop{\texttt{ncol}}(Y) - 1$.  Given the dispersion
-parameter {\tt disp}, the $X^2$ statistic is scaled by division: \mbox{$X^2_{\texttt{disp}} = X^2 / \texttt{disp}$}.
-If the dispersion is accurate, $X^2 / \texttt{disp}$ should be close to~\#d.f.  In fact, $X^2 / \textrm{\#d.f.}$
-over the \emph{training} data is the dispersion estimator used in our {\tt GLM.dml} script, 
-see~(\ref{eqn:dispersion}).  Here we provide $X^2 / \textrm{\#d.f.}$ and $X^2_{\texttt{disp}} / \textrm{\#d.f.}$
-as {\tt PEARSON\_X2\_BY\_DF} to enable dispersion comparison between the training data and
-the test data.
-
-NOTE: For categorical data, both Pearson's $X^2$ and the deviance $G^2$ are unreliable (i.e.\ do not
-approach the $\chi^2$~distribution) unless the predicted means of multi-label counts
-$\mu_{i,j} = N_i \hspace{0.5pt} p_{i,j}$ are fairly large: all ${\geq}\,1$ and 80\% are
-at least~$5$~\cite{Cochran1954:chisq}.  They should not be used for ``one label per record'' categoricals.
-
-\smallskip
-{\tt DEVIANCE\_G2}, {\tt DEVIANCE\_G2\_BY\_DF}, and {\tt DEVIANCE\_G2\_PVAL}:
-Deviance $G^2$ is the log of the likelihood ratio between the ``saturated'' model and the
-linear model being tested for the given dataset, multiplied by two:
-\begin{equation}
-G^2 \,=\, 2 \,\log \frac{\Prob[Y \mid \textrm{saturated model}\hspace{0.5pt}]}%
-{\Prob[Y \mid \textrm{tested linear model}\hspace{0.5pt}]}
-\label{eqn:GLMpred:deviance}
-\end{equation}
-The ``saturated'' model sets the mean $\mu_i^{\mathrm{sat}}$ to equal~$y_i$ for every record
-(for categorical data, $p^{\mathrm{sat}}_{i,j} = y_{i,j} / N_i$), which represents the ``perfect fit.''
-For records with $y_{i,j} \in \{0, N_i\}$ or otherwise at a boundary, by continuity we set
-$0 \log 0 = 0$.  The GLM~likelihood functions defined in~(\ref{eqn:GLM}) become simplified
-in ratio~(\ref{eqn:GLMpred:deviance}) due to canceling out the term $c(y, a)$ since it is the same
-in both models.
-
-The log of a likelihood ratio between two nested models, times two, is known to approach
-a $\chi^2$ distribution as $n\to\infty$ if both models have fixed parameter spaces.  
-But this is not the case for the ``saturated'' model: it adds more parameters with each record.  
-In practice, however, $\chi^2$ distributions are used to compute the p-value of~$G^2$~\cite{McCullagh1989:GLM}.  
-The number of degrees of freedom~\#d.f.\ and the treatment of dispersion are the same as for
-Pearson's~$X^2$, see above.
-
-\Paragraph{Column-wise statistics.}  The rest of the statistics are computed separately
-for each column of~$Y$.  As explained above, $Y$~has two or more columns in bi- and multinomial case,
-either at input or after conversion.  Moreover, each $y_{i,j}$ in record~$i$ with $N_i \geq 2$ is
-counted as $N_i$ separate observations $y_{i,j,l}$ of 0 or~1 (where $l=1,\ldots,N_i$) with
-$y_{i,j}$~ones and $N_i-y_{i,j}$ zeros.
-For power distributions, including linear regression, $Y$~has only one column and all
-$N_i = 1$, so the statistics are computed for all~$Y$ with each record counted once.
-Below we denote $N = \sum_{i=1}^n N_i \,\geq n$.
-Here is the total average and the residual average (residual bias) of~$y_{i,j,l}$ for each $Y$-column:
-\begin{equation*}
-\texttt{AVG\_TOT\_Y}_j   \,=\, \frac{1}{N} \sum_{i=1}^n  y_{i,j}; \quad
-\texttt{AVG\_RES\_Y}_j   \,=\, \frac{1}{N} \sum_{i=1}^n \, (y_{i,j} - \mu_{i,j})
-\end{equation*}
-Dividing by~$N$ (rather than~$n$) gives the averages for~$y_{i,j,l}$ (rather than~$y_{i,j}$).
-The total variance, and the standard deviation, for individual observations~$y_{i,j,l}$ is
-estimated from the total variance for response values~$y_{i,j}$ using independence assumption:
-$\Var y_{i,j} = \Var \sum_{l=1}^{N_i} y_{i,j,l} = \sum_{l=1}^{N_i} \Var y_{i,j,l}$.
-This allows us to estimate the sum of squares for~$y_{i,j,l}$ via the sum of squares for~$y_{i,j}$:
-\begin{equation*}
-\texttt{STDEV\_TOT\_Y}_j \,=\, 
-\Bigg[\frac{1}{N-1} \sum_{i=1}^n  \Big( y_{i,j} -  \frac{N_i}{N} \sum_{i'=1}^n  y_{i'\!,j}\Big)^2\Bigg]^{1/2}
-\end{equation*}
-Analogously, we estimate the standard deviation of the residual $y_{i,j,l} - \mu_{i,j,l}$:
-\begin{equation*}
-\texttt{STDEV\_RES\_Y}_j \,=\, 
-\Bigg[\frac{1}{N-m'} \,\sum_{i=1}^n  \Big( y_{i,j} - \mu_{i,j} -  \frac{N_i}{N} \sum_{i'=1}^n  (y_{i'\!,j} - \mu_{i'\!,j})\Big)^2\Bigg]^{1/2}
-\end{equation*}
-Here $m'=m$ if $m$ includes the intercept as a feature and $m'=m+1$ if it does not.
-The estimated standard deviations can be compared to the model-predicted residual standard deviation
-computed from the predicted means by the GLM variance formula and scaled by the dispersion:
-\begin{equation*}
-\texttt{PRED\_STDEV\_RES}_j \,=\, \Big[\frac{\texttt{disp}}{N} \, \sum_{i=1}^n \, v(\mu_{i,j})\Big]^{1/2}
-\end{equation*}
-We also compute the $R^2$ statistics for each column of~$Y$, see Table~\ref{table:GLMpred:R2} for details.
-We compute two versions of~$R^2$: in one version the residual sum-of-squares (RSS) includes any bias in
-the residual that might be present (due to the lack of, or inaccuracy in, the intercept); in the other
-version of~RSS the bias is subtracted by ``centering'' the residual.  In both cases we subtract the bias in the total
-sum-of-squares (in the denominator), and $m'$ equals $m$~with the intercept or $m+1$ without the intercept.
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|c|c|}
-\multicolumn{2}{c}{$R^2$ where the residual sum-of-squares includes the bias contribution:} \\
-\hline
-\multicolumn{1}{|l|}{\tt PLAIN\_R2${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2${}_j \,\,= {}$} \\
-$ \displaystyle 1 - 
-\frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}%
-{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ & 
-$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m}}  \, 
-\frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}%
-{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ \\
-\hline
-\multicolumn{2}{c}{} \\
-\multicolumn{2}{c}{$R^2$ where the residual sum-of-squares is centered so that the bias is subtracted:} \\
-\hline
-\multicolumn{1}{|l|}{\tt PLAIN\_R2\_NOBIAS${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2\_NOBIAS${}_j \,\,= {}$} \\
-$ \displaystyle 1 - 
-\frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\!
-    \sum\limits_{i'=1}^n  (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}%
-{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ &
-$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m'}} \, 
-\frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! 
-    \sum\limits_{i'=1}^n  (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}%
-{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ \\
-\hline
-\end{tabular}}
-\caption{The $R^2$ statistics we compute in {\tt GLM-predict.dml}}
-\label{table:GLMpred:R2}
-\end{table}
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-The matrix of predicted means (if the response is numerical) or probabilities (if the response
-is categorical), see ``Description'' subsection above for more information.  Given {\tt Y}, we
-return some statistics in CSV format as described in Table~\ref{table:GLMpred:stats} and in the
-above text.
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-Note that in the examples below the value for ``{\tt disp}'' input argument
-is set arbitrarily.  The correct dispersion value should be computed from the training
-data during model estimation, or omitted if unknown (which sets it to~{\tt 1.0}).
-
-\smallskip\noindent
-Linear regression example:
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0 disp=5.67
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
-  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}\smallskip\noindent
-Linear regression example, prediction only (no {\tt Y} given):
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
-
-}\smallskip\noindent
-Binomial logistic regression example:
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=2 link=2 disp=3.0004464
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
-  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}\smallskip\noindent
-Binomial probit regression example:
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=2 link=3 disp=3.0004464
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
-  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}\smallskip\noindent
-Multinomial logistic regression example:
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=3 
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
-  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}\smallskip\noindent
-Poisson regression with the log link example:
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=1 vpow=1.0 link=1 lpow=0.0 disp=3.45
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
-  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}\smallskip\noindent
-Gamma regression with the inverse (reciprocal) link example:
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=1 vpow=2.0 link=1 lpow=-1.0 disp=1.99118
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
-  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/KaplanMeier.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/KaplanMeier.tex b/Algorithms Reference/KaplanMeier.tex
deleted file mode 100644
index 6ea6fbc..0000000
--- a/Algorithms Reference/KaplanMeier.tex	
+++ /dev/null
@@ -1,289 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Kaplan-Meier Survival Analysis}
-\label{sec:kaplan-meier}
-
-\noindent{\bf Description}
-\smallskip
-
-
-Survival analysis examines the time needed for a particular event of interest to occur.
-In medical research, for example, the prototypical such event is the death of a patient but the methodology can be applied to other application areas, e.g., completing a task by an individual in a psychological experiment or the failure of electrical components in engineering.   
-Kaplan-Meier or (product limit) method is a simple non-parametric approach for estimating survival probabilities from both censored and uncensored survival times.\\
-
- 
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}KM.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} TE=}path/file
-{\tt{} GI=}path/file
-{\tt{} SI=}path/file
-{\tt{} O=}path/file
-{\tt{} M=}path/file
-{\tt{} T=}path/file
-{\tt{} alpha=}double
-{\tt{} etype=}greenwood$\mid$peto
-{\tt{} ctype=}plain$\mid$log$\mid$log-log
-{\tt{} ttype=}none$\mid$log-rank$\mid$wilcoxon
-{\tt{} fmt=}format
-
-}
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the input matrix of the survival data containing: 
-\begin{Itemize}
-	\item timestamps,
-	\item whether event occurred (1) or data is censored (0),
-	\item a number of factors (i.e., categorical features) for grouping and/or stratifying
-\end{Itemize}
-\item[{\tt TE}:]
-Location (on HDFS) to read the 1-column matrix $TE$ that contains the column indices of the input matrix $X$ corresponding to timestamps (first entry) and event information (second entry) 
-\item[{\tt GI}:]
-Location (on HDFS) to read the 1-column matrix $GI$ that contains the column indices of the input matrix $X$ corresponding to the factors (i.e., categorical features) to be used for grouping
-\item[{\tt SI}:]
-Location (on HDFS) to read the 1-column matrix $SI$ that contains the column indices of the input matrix $X$ corresponding to the factors (i.e., categorical features) to be used for grouping
-\item[{\tt O}:]
-Location (on HDFS) to write the matrix containing the results of the Kaplan-Meier analysis $KM$
-\item[{\tt M}:]
-Location (on HDFS) to write Matrix $M$ containing the following statistics: total number of events, median and its confidence intervals; if survival data for multiple groups and strata are provided each row of $M$ contains the above statistics per group and stratum.
-\item[{\tt T}:]
-If survival data from multiple groups is available and {\tt ttype=log-rank} or {\tt ttype=wilcoxon}, location (on HDFS) to write the two matrices that contains the result of the (stratified) test for comparing these groups; see below for details.
-\item[{\tt alpha}:](default:\mbox{ }{\tt 0.05})
-Parameter to compute $100(1-\alpha)\%$ confidence intervals for the survivor function and its median 
-\item[{\tt etype}:](default:\mbox{ }{\tt "greenwood"})
-Parameter to specify the error type according to "greenwood" or "peto"
-\item[{\tt ctype}:](default:\mbox{ }{\tt "log"})
-Parameter to modify the confidence interval; "plain" keeps the lower and upper bound of the confidence interval unmodified,	"log" corresponds to logistic transformation and "log-log" corresponds to the complementary log-log transformation
-\item[{\tt ttype}:](default:\mbox{ }{\tt "none"})
-If survival data for multiple groups is available specifies which test to perform for comparing 
-survival data across multiple groups: "none", "log-rank" or "wilcoxon" test
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
-\noindent{\bf Details}
-\smallskip
-
-The Kaplan-Meier estimate is a non-parametric maximum likelihood estimate (MLE) of the survival function $S(t)$, i.e., the probability of survival from the time origin to a given future time. 
-As an illustration suppose that there are $n$ individuals with observed survival times $t_1,t_2,\ldots t_n$ out of which there are $r\leq n$ distinct death times $t_{(1)}\leq t_{(2)}\leq t_{(r)}$---since some of the observations may be censored, in the sense that the end-point of interest has not been observed for those individuals, and there may be more than one individual with the same survival time.
-Let $S(t_j)$ denote the probability of survival until time $t_j$, $d_j$ be the number of events at time $t_j$, and $n_j$ denote the number of individual at risk (i.e., those who die at time $t_j$ or later). 
-Assuming that the events occur independently, in Kaplan-Meier method the probability of surviving from $t_j$ to $t_{j+1}$ is estimated from $S(t_j)$ and given by
-\begin{equation*}
-\hat{S}(t) = \prod_{j=1}^{k} \left( \frac{n_j-d_j}{n_j} \right),
-\end{equation*}   
-for $t_k\leq t<t_{k+1}$, $k=1,2,\ldots r$, $\hat{S}(t)=1$ for $t<t_{(1)}$, and $t_{(r+1)}=\infty$. 
-Note that the value of $\hat{S}(t)$ is constant between times of event and therefore
-the estimate is a step function with jumps at observed event times.
-If there are no censored data this estimator would simply reduce to the empirical survivor function defined as $\frac{n_j}{n}$. Thus, the Kaplan-Meier estimate can be seen as the generalization of the empirical survivor function that handles censored observations.
-
-The methodology used in our {\tt KM.dml} script closely follows~\cite[Sec.~2]{collett2003:kaplanmeier}.
-For completeness we briefly discuss the equations used in our implementation.
-
-% standard error of the survivor function
-\textbf{Standard error of the survivor function.}
-The standard error of the estimated survivor function (controlled by parameter {\tt etype}) can be calculated as  
-\begin{equation*}
-\text{se} \{\hat{S}(t)\} \approx \hat{S}(t) {\bigg\{ \sum_{j=1}^{k} \frac{d_j}{n_j(n_j -   d_j)}\biggr\}}^2,
-\end{equation*}
-for $t_{(k)}\leq t<t_{(k+1)}$.
-This equation is known as the {\it Greenwood's} formula.
-An alternative approach is to apply the {\it Petos's} expression %~\cite{PetoPABCHMMPS1979:kaplanmeier} 
-\begin{equation*}
-\text{se}\{\hat{S}(t)\}=\frac{\hat{S}(t)\sqrt{1-\hat{S}(t)}}{\sqrt{n_k}},
-\end{equation*}
-for $t_{(k)}\leq t<t_{(k+1)}$. 
-%Note that this estimate is known to be conservative producing larger standard errors than they ought to be. The Greenwood estimate is therefore recommended for general use. 
-Once the standard error of $\hat{S}$ has been found we compute the following types of confidence intervals (controlled by parameter {\tt cctype}): 
-The ``plain'' $100(1-\alpha)\%$ confidence interval for $S(t)$ is computed using 
-\begin{equation*}
-\hat{S}(t)\pm z_{\alpha/2} \text{se}\{\hat{S}(t)\}, 
-\end{equation*} 
-where $z_{\alpha/2}$ is the upper $\alpha/2$-point of the standard normal distribution. 
-Alternatively, we can apply the ``log'' transformation using 
-\begin{equation*}
-\hat{S}(t)^{\exp[\pm z_{\alpha/2} \text{se}\{\hat{S}(t)\}/\hat{S}(t)]}
-\end{equation*}
-or the ``log-log'' transformation using 
-\begin{equation*}
-\hat{S}(t)^{\exp [\pm z_{\alpha/2} \text{se} \{\log [-\log \hat{S}(t)]\}]}.
-\end{equation*}
-
-% standard error of the median of survival times
-\textbf{Median, its standard error and confidence interval.}
-Denote by $\hat{t}(50)$ the estimated median of $\hat{S}$, i.e.,
-$\hat{t}(50)=\min \{ t_i \mid \hat{S}(t_i) < 0.5\}$,
-where $t_i$ is the observed survival time for individual $i$.
-The standard error of $\hat{t}(50)$ is given by
-\begin{equation*}
-\text{se}\{ \hat{t}(50) \} = \frac{1}{\hat{f}\{\hat{t}(50)\}} \text{se}[\hat{S}\{ \hat{t}(50) \}],
-\end{equation*}
-where $\hat{f}\{ \hat{t}(50) \}$ can be found from
-\begin{equation*}
-\hat{f}\{ \hat{t}(50) \} = \frac{\hat{S}\{ \hat{u}(50) \} -\hat{S}\{ \hat{l}(50) \} }{\hat{l}(50) - \hat{u}(50)}. 
-\end{equation*}
-Above, $\hat{u}(50)$ is the largest survival time for which $\hat{S}$ exceeds $0.5+\epsilon$, i.e., $\hat{u}(50)=\max \bigl\{ t_{(j)} \mid \hat{S}(t_{(j)}) \geq 0.5+\epsilon \bigr\}$,
-and $\hat{l}(50)$ is the smallest survivor time for which $\hat{S}$ is less than $0.5-\epsilon$,
-i.e., $\hat{l}(50)=\min \bigl\{ t_{(j)} \mid \hat{S}(t_{(j)}) \leq 0.5+\epsilon \bigr\}$,
-for small $\epsilon$.
-
-
-% comparing two or more groups of data
-\textbf{Log-rank test and Wilcoxon test.}
-Our implementation supports comparison of survival data from several groups using two non-parametric procedures (controlled by parameter {\tt ttype}): the {\it log-rank test} and the {\it Wilcoxon test} (also known as the {\it Breslow test}). 
-Assume that the survival times in $g\geq 2$ groups of survival data are to be compared. 
-Consider the {\it null hypothesis} that there is no difference in the survival times of the individuals in different groups. One way to examine the null hypothesis is to consider the difference between the observed number of deaths with the numbers expected under the null hypothesis.  
-In both tests we define the $U$-statistics ($U_{L}$ for the log-rank test and $U_{W}$ for the Wilcoxon test) to compare the observed and the expected number of deaths in $1,2,\ldots,g-1$ groups as follows:
-\begin{align*}
-U_{Lk} &= \sum_{j=1}^{r}\left( d_{kj} - \frac{n_{kj}d_j}{n_j} \right), \\
-U_{Wk} &= \sum_{j=1}^{r}n_j\left( d_{kj} - \frac{n_{kj}d_j}{n_j} \right),
-\end{align*}
-where $d_{kj}$ is the of number deaths at time $t_{(j)}$ in group $k$, 
-$n_{kj}$ is the number of individuals at risk at time $t_{(j)}$ in group $k$, and 
-$k=1,2,\ldots,g-1$ to form the vectors $U_L$ and $U_W$ with $(g-1)$ components.
-The covariance (variance) between $U_{Lk}$ and $U_{Lk'}$ (when $k=k'$) is computed as
-\begin{equation*}
-V_{Lkk'}=\sum_{j=1}^{r} \frac{n_{kj}d_j(n_j-d_j)}{n_j(n_j-1)} \left( \delta_{kk'}-\frac{n_{k'j}}{n_j} \right),
-\end{equation*}
-for $k,k'=1,2,\ldots,g-1$, with
-\begin{equation*}
-\delta_{kk'} = 
-\begin{cases}
-1 & \text{if } k=k'\\
-0 & \text{otherwise.}
-\end{cases}
-\end{equation*}
-These terms are combined in a {\it variance-covariance} matrix $V_L$ (referred to as the $V$-statistic).
-Similarly, the variance-covariance matrix for the Wilcoxon test $V_W$ is a matrix where the entry at position $(k,k')$ is given by
-\begin{equation*}
-V_{Wkk'}=\sum_{j=1}^{r} n_j^2 \frac{n_{kj}d_j(n_j-d_j)}{n_j(n_j-1)} \left( \delta_{kk'}-\frac{n_{k'j}}{n_j} \right).
-\end{equation*}
-
-Under the null hypothesis of no group differences, the test statistics $U_L^\top V_L^{-1} U_L$ for the log-rank test and  $U_W^\top V_W^{-1} U_W$ for the Wilcoxon test have a Chi-squared distribution on $(g-1)$ degrees of freedom.
-Our {\tt KM.dml} script also provides a stratified version of the log-rank or Wilcoxon test if requested.
-In this case, the values of the $U$- and $V$- statistics are computed for each stratum and then combined over all strata.
-
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-  
-Blow we list the results of the survival analysis computed by {\tt KM.dml}. 
-The calculated statistics are stored in matrix $KM$ with the following schema:
-\begin{itemize}
-	\item Column 1: timestamps 
-	\item Column 2: number of individuals at risk
-	\item Column 3: number of events
-	\item Column 4: Kaplan-Meier estimate of the survivor function $\hat{S}$ 
-	\item Column 5: standard error of $\hat{S}$
-	\item Column 6: lower bound of $100(1-\alpha)\%$ confidence interval for $\hat{S}$
-	\item Column 7: upper bound of $100(1-\alpha)\%$ confidence interval for $\hat{S}$
-\end{itemize}
-Note that if survival data for multiple groups and/or strata is available, each collection of 7 columns in $KM$ stores the results per group and/or per stratum. 
-In this case $KM$ has $7g+7s$ columns, where $g\geq 1$ and $s\geq 1$ denote the number of groups and strata, respectively. 
-
-
-Additionally, {\tt KM.dml} stores the following statistics in the 1-row matrix $M$ whose number of columns depends on the number of groups ($g$) and strata ($s$) in the data. Below $k$ denotes the number of factors used for grouping and $l$ denotes the number of factors used for stratifying. 
-\begin{itemize}
-	\item Columns 1 to $k$: unique combination of values in the $k$ factors used for grouping 
-	\item Columns $k+1$ to $k+l$: unique combination of values in the $l$ factors used for stratifying  
-	\item Column $k+l+1$: total number of records 
-	\item Column $k+l+2$: total number of events
-    \item Column $k+l+3$: median of $\hat{S}$
-    \item Column $k+l+4$: lower bound of $100(1-\alpha)\%$ confidence interval for the median of $\hat{S}$
-    \item Column $k+l+5$: upper bound of $100(1-\alpha)\%$ confidence interval for the median of $\hat{S}$. 
-\end{itemize}
-If there is only 1 group and 1 stratum available $M$ will be a 1-row matrix with 5 columns where
-\begin{itemize}
-	\item Column 1: total number of records
-	\item Column 2: total number of events
-	\item Column 3: median of $\hat{S}$
-	\item Column 4: lower bound of $100(1-\alpha)\%$ confidence interval for the median of $\hat{S}$
-	\item Column 5: upper bound of $100(1-\alpha)\%$ confidence interval for the median of $\hat{S}$.
-\end{itemize} 
-
-If a comparison of the survival data across multiple groups needs to be performed, {\tt KM.dml} computes two matrices $T$ and $T\_GROUPS\_OE$ that contain a summary of the test. The 1-row matrix $T$ stores the following statistics: 
-\begin{itemize}
-	\item Column 1: number of groups in the survival data
- 	\item Column 2: degree of freedom for Chi-squared distributed test statistic
-	\item Column 3: value of test statistic
-	\item Column 4: $P$-value.
-\end{itemize}
-Matrix $T\_GROUPS\_OE$ contains the following statistics for each of $g$ groups:
-\begin{itemize}
-	\item Column 1: number of events
-	\item Column 2: number of observed death times ($O$)
-	\item Column 3: number of expected death times ($E$)
-	\item Column 4: $(O-E)^2/E$
-	\item Column 5: $(O-E)^2/V$.
-\end{itemize}
-
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f KM.dml -nvargs X=/user/biadmin/X.mtx TE=/user/biadmin/TE
-	GI=/user/biadmin/GI SI=/user/biadmin/SI O=/user/biadmin/kaplan-meier.csv
-	M=/user/biadmin/model.csv alpha=0.01 etype=greenwood ctype=plain fmt=csv
-	
-}\smallskip
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f KM.dml -nvargs X=/user/biadmin/X.mtx TE=/user/biadmin/TE
-	GI=/user/biadmin/GI SI=/user/biadmin/SI O=/user/biadmin/kaplan-meier.csv
-	M=/user/biadmin/model.csv T=/user/biadmin/test.csv alpha=0.01 etype=peto 
-	ctype=log ttype=log-rank fmt=csv
-	
-}
-
-%
-%\smallskip
-%\noindent{\bf References}
-%\begin{itemize}
-%	\item
-%	R.~Peto, M.C.~Pike, P.~Armitage, N.E.~Breslow, D.R.~Cox, S.V.~Howard, N.~Mantel, K.~McPherson, J.~Peto, and P.G.~Smith.
-%	\newblock Design and analysis of randomized clinical trials requiring prolonged observation of each patient.
-%	\newblock {\em British Journal of Cancer}, 35:1--39, 1979.
-%\end{itemize}
-
-%@book{collett2003:kaplanmeier,
-%	title={Modelling Survival Data in Medical Research, Second Edition},
-%	author={Collett, D.},
-%	isbn={9781584883258},
-%	lccn={2003040945},
-%	series={Chapman \& Hall/CRC Texts in Statistical Science},
-%	year={2003},
-%	publisher={Taylor \& Francis}
-%}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/Kmeans.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/Kmeans.tex b/Algorithms Reference/Kmeans.tex
deleted file mode 100644
index 2b5492c..0000000
--- a/Algorithms Reference/Kmeans.tex	
+++ /dev/null
@@ -1,371 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{K-Means Clustering}
-
-\noindent{\bf Description}
-\smallskip
-
-Given a collection of $n$ records with a pairwise similarity measure,
-the goal of clustering is to assign a category label to each record so that
-similar records tend to get the same label.  In contrast to multinomial
-logistic regression, clustering is an \emph{unsupervised}\/ learning problem
-with neither category assignments nor label interpretations given in advance.
-In $k$-means clustering, the records $x_1, x_2, \ldots, x_n$ are numerical
-feature vectors of $\dim x_i = m$ with the squared Euclidean distance 
-$\|x_i - x_{i'}\|_2^2$ as the similarity measure.  We want to partition
-$\{x_1, \ldots, x_n\}$ into $k$ clusters $\{S_1, \ldots, S_k\}$ so that
-the aggregated squared distance from records to their cluster means is
-minimized:
-\begin{equation}
-\textrm{WCSS}\,\,=\,\, \sum_{i=1}^n \,\big\|x_i - \mean(S_j: x_i\in S_j)\big\|_2^2 \,\,\to\,\,\min
-\label{eqn:WCSS}
-\end{equation}
-The aggregated distance measure in~(\ref{eqn:WCSS}) is called the
-\emph{within-cluster sum of squares}~(WCSS).  It can be viewed as a measure
-of residual variance that remains in the data after the clustering assignment,
-conceptually similar to the residual sum of squares~(RSS) in linear regression.
-However, unlike for the RSS, the minimization of~(\ref{eqn:WCSS}) is an NP-hard 
-problem~\cite{AloiseDHP2009:kmeans}.
-
-Rather than searching for the global optimum in~(\ref{eqn:WCSS}), a heuristic algorithm
-called Lloyd's algorithm is typically used.  This iterative algorithm maintains
-and updates a set of $k$~\emph{centroids} $\{c_1, \ldots, c_k\}$, one centroid per cluster.
-It defines each cluster $S_j$ as the set of all records closer to~$c_j$ than
-to any other centroid.  Each iteration of the algorithm reduces the WCSS in two steps:
-\begin{Enumerate}
-\item Assign each record to the closest centroid, making $\mean(S_j)\neq c_j$;
-\label{step:kmeans:recluster}
-\item Reset each centroid to its cluster's mean: $c_j := \mean(S_j)$.
-\label{step:kmeans:recenter}
-\end{Enumerate}
-After Step~\ref{step:kmeans:recluster} the centroids are generally different from the cluster
-means, so we can compute another ``within-cluster sum of squares'' based on the centroids:
-\begin{equation}
-\textrm{WCSS\_C}\,\,=\,\, \sum_{i=1}^n \,\big\|x_i - \mathop{\textrm{centroid}}(S_j: x_i\in S_j)\big\|_2^2
-\label{eqn:WCSS:C}
-\end{equation}
-This WCSS\_C after Step~\ref{step:kmeans:recluster} is less than the means-based WCSS
-before Step~\ref{step:kmeans:recluster} (or equal if convergence achieved), and in
-Step~\ref{step:kmeans:recenter} the WCSS cannot exceed the WCSS\_C for \emph{the same}
-clustering; hence the WCSS reduction.
-
-Exact convergence is reached when each record becomes closer to its
-cluster's mean than to any other cluster's mean, so there are no more re-assignments
-and the centroids coincide with the means.  In practice, iterations may be stopped
-when the reduction in WCSS (or in WCSS\_C) falls below a minimum threshold, or upon
-reaching the maximum number of iterations.  The initialization of the centroids is also
-an important part of the algorithm.  The smallest WCSS obtained by the algorithm is not
-the global minimum and varies depending on the initial centroids.  We implement multiple
-parallel runs with different initial centroids and report the best result.
-
-\Paragraph{Scoring} 
-Our scoring script evaluates the clustering output by comparing it with a known category
-assignment.  Since cluster labels have no prior correspondence to the categories, we
-cannot count ``correct'' and ``wrong'' cluster assignments.  Instead, we quantify them in
-two ways:
-\begin{Enumerate}
-\item Count how many same-category and different-category pairs of records end up in the
-same cluster or in different clusters;
-\item For each category, count the prevalence of its most common cluster; for each
-cluster, count the prevalence of its most common category.
-\end{Enumerate}
-The number of categories and the number of clusters ($k$) do not have to be equal.  
-A same-category pair of records clustered into the same cluster is viewed as a
-``true positive,'' a different-category pair clustered together is a ``false positive,''
-a same-category pair clustered apart is a ``false negative''~etc.
-
-
-\smallskip
-\noindent{\bf Usage: K-means Script}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}Kmeans.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} C=}path/file
-{\tt{} k=}int
-{\tt{} runs=}int
-{\tt{} maxi=}int
-{\tt{} tol=}double
-{\tt{} samp=}int
-{\tt{} isY=}int
-{\tt{} Y=}path/file
-{\tt{} fmt=}format
-{\tt{} verb=}int
-
-}
-
-\smallskip
-\noindent{\bf Usage: K-means Scoring/Prediction}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}Kmeans-predict.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} C=}path/file
-{\tt{} spY=}path/file
-{\tt{} prY=}path/file
-{\tt{} fmt=}format
-{\tt{} O=}path/file
-
-}
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location to read matrix $X$ with the input data records as rows
-\item[{\tt C}:] (default:\mbox{ }{\tt "C.mtx"})
-Location to store the output matrix with the best available cluster centroids as rows
-\item[{\tt k}:]
-Number of clusters (and centroids)
-\item[{\tt runs}:] (default:\mbox{ }{\tt 10})
-Number of parallel runs, each run with different initial centroids
-\item[{\tt maxi}:] (default:\mbox{ }{\tt 1000})
-Maximum number of iterations per run
-\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
-Tolerance (epsilon) for single-iteration WCSS\_C change ratio
-\item[{\tt samp}:] (default:\mbox{ }{\tt 50})
-Average number of records per centroid in data samples used in the centroid
-initialization procedure
-\item[{\tt Y}:] (default:\mbox{ }{\tt "Y.mtx"})
-Location to store the one-column matrix $Y$ with the best available mapping of
-records to clusters (defined by the output centroids)
-\item[{\tt isY}:] (default:\mbox{ }{\tt 0})
-{\tt 0} = do not write matrix~$Y$,  {\tt 1} = write~$Y$
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\item[{\tt verb}:] (default:\mbox{ }{\tt 0})
-{\tt 0} = do not print per-iteration statistics for each run, {\tt 1} = print them
-(the ``verbose'' option)
-\end{Description}
-\smallskip
-\noindent{\bf Arguments --- Scoring/Prediction}
-\begin{Description}
-\item[{\tt X}:] (default:\mbox{ }{\tt " "})
-Location to read matrix $X$ with the input data records as rows,
-optional when {\tt prY} input is provided
-\item[{\tt C}:] (default:\mbox{ }{\tt " "})
-Location to read matrix $C$ with cluster centroids as rows, optional
-when {\tt prY} input is provided; NOTE: if both {\tt X} and {\tt C} are
-provided, {\tt prY} is an output, not input
-\item[{\tt spY}:] (default:\mbox{ }{\tt " "})
-Location to read a one-column matrix with the externally specified ``true''
-assignment of records (rows) to categories, optional for prediction without
-scoring
-\item[{\tt prY}:] (default:\mbox{ }{\tt " "})
-Location to read (or write, if {\tt X} and {\tt C} are present) a
-column-vector with the predicted assignment of rows to clusters;
-NOTE: No prior correspondence is assumed between the predicted
-cluster labels and the externally specified categories
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format for {\tt prY}, such as {\tt text}, {\tt mm},
-or {\tt csv}; see read/write functions in SystemML Language Reference
-for details
-\item[{\tt O}:] (default:\mbox{ }{\tt " "})
-Location to write the output statistics defined in 
-Table~\ref{table:kmeans:predict:stats}, by default print them to the
-standard output
-\end{Description}
-
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|lcl|}
-\hline
-Name & CID & Meaning \\
-\hline
-{\tt TSS}             &     & Total Sum of Squares (from the total mean) \\
-{\tt WCSS\_M}         &     & Within-Cluster  Sum of Squares (means as centers) \\
-{\tt WCSS\_M\_PC}     &     & Within-Cluster  Sum of Squares (means), in \% of TSS \\
-{\tt BCSS\_M}         &     & Between-Cluster Sum of Squares (means as centers) \\
-{\tt BCSS\_M\_PC}     &     & Between-Cluster Sum of Squares (means), in \% of TSS \\
-\hline
-{\tt WCSS\_C}         &     & Within-Cluster  Sum of Squares (centroids as centers) \\
-{\tt WCSS\_C\_PC}     &     & Within-Cluster  Sum of Squares (centroids), \% of TSS \\
-{\tt BCSS\_C}         &     & Between-Cluster Sum of Squares (centroids as centers) \\
-{\tt BCSS\_C\_PC}     &     & Between-Cluster Sum of Squares (centroids), \% of TSS \\
-\hline
-{\tt TRUE\_SAME\_CT}  &     & Same-category pairs predicted as Same-cluster, count \\
-{\tt TRUE\_SAME\_PC}  &     & Same-category pairs predicted as Same-cluster, \% \\
-{\tt TRUE\_DIFF\_CT}  &     & Diff-category pairs predicted as Diff-cluster, count \\
-{\tt TRUE\_DIFF\_PC}  &     & Diff-category pairs predicted as Diff-cluster, \% \\
-{\tt FALSE\_SAME\_CT} &     & Diff-category pairs predicted as Same-cluster, count \\
-{\tt FALSE\_SAME\_PC} &     & Diff-category pairs predicted as Same-cluster, \% \\
-{\tt FALSE\_DIFF\_CT} &     & Same-category pairs predicted as Diff-cluster, count \\
-{\tt FALSE\_DIFF\_PC} &     & Same-category pairs predicted as Diff-cluster, \% \\
-\hline
-{\tt SPEC\_TO\_PRED}  & $+$ & For specified category, the best predicted cluster id \\
-{\tt SPEC\_FULL\_CT}  & $+$ & For specified category, its full count \\
-{\tt SPEC\_MATCH\_CT} & $+$ & For specified category, best-cluster matching count \\
-{\tt SPEC\_MATCH\_PC} & $+$ & For specified category, \% of matching to full count \\
-{\tt PRED\_TO\_SPEC}  & $+$ & For predicted cluster, the best specified category id \\
-{\tt PRED\_FULL\_CT}  & $+$ & For predicted cluster, its full count \\
-{\tt PRED\_MATCH\_CT} & $+$ & For predicted cluster, best-category matching count \\
-{\tt PRED\_MATCH\_PC} & $+$ & For predicted cluster, \% of matching to full count \\
-\hline
-\end{tabular}}
-\caption{The {\tt O}-file for {\tt Kmeans-predict} provides the output statistics
-in CSV format, one per line, in the following format: (NAME, [CID], VALUE).  Note:
-the 1st group statistics are given if {\tt X} input is available;
-the 2nd group statistics are given if {\tt X} and {\tt C} inputs are available;
-the 3rd and 4th group statistics are given if {\tt spY} input is available;
-only the 4th group statistics contain a nonempty CID value;
-when present, CID contains either the specified category label or the
-predicted cluster label.}
-\label{table:kmeans:predict:stats}
-\end{table}
-
-
-\noindent{\bf Details}
-\smallskip
-
-Our clustering script proceeds in 3~stages: centroid initialization,
-parallel $k$-means iterations, and the best-available output generation.
-Centroids are initialized at random from the input records (the rows of~$X$),
-biased towards being chosen far apart from each other.  The initialization
-method is based on the {\tt k-means++} heuristic from~\cite{ArthurVassilvitskii2007:kmeans},
-with one important difference: to reduce the number of passes through~$X$,
-we take a small sample of $X$ and run the {\tt k-means++} heuristic over
-this sample.  Here is, conceptually, our centroid initialization algorithm
-for one clustering run:
-\begin{Enumerate}
-\item Sample the rows of~$X$ uniformly at random, picking each row with probability
-$p = ks / n$ where
-\begin{Itemize}
-\item $k$~is the number of centroids, 
-\item $n$~is the number of records, and
-\item $s$~is the {\tt samp} input parameter.
-\end{Itemize}
-If $ks \geq n$, the entire $X$ is used in place of its sample.
-\item Choose the first centroid uniformly at random from the sampled rows.
-\item Choose each subsequent centroid from the sampled rows, at random, with
-probability proportional to the squared Euclidean distance between the row and
-the nearest already-chosen centroid.
-\end{Enumerate}
-The sampling of $X$ and the selection of centroids are performed independently
-and in parallel for each run of the $k$-means algorithm.  When we sample the
-rows of~$X$, rather than tossing a random coin for each row, we compute the
-number of rows to skip until the next sampled row as $\lceil \log(u) / \log(1 - p) \rceil$
-where $u\in (0, 1)$ is uniformly random.  This time-saving trick works because
-\begin{equation*}
-\Prob [k-1 < \log_{1-p}(u) < k] \,\,=\,\, p(1-p)^{k-1} \,\,=\,\,
-\Prob [\textrm{skip $k-1$ rows}]
-\end{equation*}
-However, it requires us to estimate the maximum sample size, which we set
-near~$ks + 10\sqrt{ks}$ to make it generous enough.
-
-Once we selected the initial centroid sets, we start the $k$-means iterations
-independently in parallel for all clustering runs.  The number of clustering runs
-is given as the {\tt runs} input parameter.  Each iteration of each clustering run
-performs the following steps:
-\begin{Itemize}
-\item Compute the centroid-dependent part of squared Euclidean distances from
-all records (rows of~$X$) to each of the $k$~centroids using matrix product;
-\item Take the minimum of the above for each record;
-\item Update the current within-cluster sum of squares (WCSS) value, with centroids
-substituted instead of the means for efficiency;
-\item Check the convergence criterion:\hfil
-$\textrm{WCSS}_{\mathrm{old}} - \textrm{WCSS}_{\mathrm{new}} < \eps\cdot\textrm{WCSS}_{\mathrm{new}}$\linebreak
-as well as the number of iterations limit;
-\item Find the closest centroid for each record, sharing equally any records with multiple
-closest centroids;
-\item Compute the number of records closest to each centroid, checking for ``runaway''
-centroids with no records left (in which case the run fails);
-\item Compute the new centroids by averaging the records in their clusters.
-\end{Itemize}
-When a termination condition is satisfied, we store the centroids and the WCSS value
-and exit this run.  A run has to satisfy the WCSS convergence criterion to be considered
-successful.  Upon the termination of all runs, we select the smallest WCSS value among
-the successful runs, and write out this run's centroids.  If requested, we also compute
-the cluster assignment of all records in~$X$, using integers from 1 to~$k$ as the cluster
-labels.  The scoring script can then be used to compare the cluster assignment with
-an externally specified category assignment.
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-We output the $k$ centroids for the best available clustering, i.~e.\ whose WCSS
-is the smallest of all successful runs.
-The centroids are written as the rows of the $k\,{\times}\,m$-matrix into the output
-file whose path/name was provided as the ``{\tt C}'' input argument.  If the input
-parameter ``{\tt isY}'' was set to~{\tt 1}, we also output the one-column matrix with
-the cluster assignment for all the records.  This assignment is written into the
-file whose path/name was provided as the ``{\tt Y}'' input argument.
-The best WCSS value, as well as some information about the performance of the other
-runs, is printed during the script execution.  The scoring script {\tt Kmeans-predict}
-prints all its results in a self-explanatory manner, as defined in
-Table~\ref{table:kmeans:predict:stats}.
-
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-\hml -f Kmeans.dml -nvargs X=/user/biadmin/X.mtx k=5 C=/user/biadmin/centroids.mtx fmt=csv
-
-}
-
-{\hangindent=\parindent\noindent\tt
-\hml -f Kmeans.dml -nvargs X=/user/biadmin/X.mtx k=5 runs=100 maxi=5000 
-tol=0.00000001 samp=20 C=/user/biadmin/centroids.mtx isY=1 Y=/user/biadmin/Yout.mtx verb=1
-
-}
-\noindent To predict {\tt Y} given {\tt X} and {\tt C}:
-
-{\hangindent=\parindent\noindent\tt
-\hml -f Kmeans-predict.dml -nvargs X=/user/biadmin/X.mtx
-         C=/user/biadmin/C.mtx prY=/user/biadmin/PredY.mtx O=/user/biadmin/stats.csv
-
-}
-\noindent To compare ``actual'' labels {\tt spY} with ``predicted'' labels given {\tt X} and {\tt C}:
-
-{\hangindent=\parindent\noindent\tt
-\hml -f Kmeans-predict.dml -nvargs X=/user/biadmin/X.mtx
-         C=/user/biadmin/C.mtx spY=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}
-\noindent To compare ``actual'' labels {\tt spY} with given ``predicted'' labels {\tt prY}:
-
-{\hangindent=\parindent\noindent\tt
-\hml -f Kmeans-predict.dml -nvargs spY=/user/biadmin/Y.mtx prY=/user/biadmin/PredY.mtx O=/user/biadmin/stats.csv
-
-}
-
-\smallskip
-\noindent{\bf References}
-\begin{itemize}
-\item
-D.~Aloise, A.~Deshpande, P.~Hansen, and P.~Popat.
-\newblock {NP}-hardness of {E}uclidean sum-of-squares clustering.
-\newblock {\em Machine Learning}, 75(2):245--248, May 2009.
-\item
-D.~Arthur and S.~Vassilvitskii.
-\newblock {\tt k-means++}: The advantages of careful seeding.
-\newblock In {\em Proceedings of the 18th Annual {ACM-SIAM} Symposium on
-  Discrete Algorithms ({SODA}~2007)}, pages 1027--1035, New Orleans~{LA},
-  {USA}, January 7--9 2007.
-\end{itemize}


[02/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1139] Updated the beginner's guide

Posted by de...@apache.org.
[SYSTEMML-1139] Updated the beginner's guide

The updated documentation reflect the installation steps as per commit
https://github.com/apache/incubator-systemml/commit/d225cbdc90e4d5f8e464182c237f5e4900467a38

Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/fa88464b
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/fa88464b
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/fa88464b

Branch: refs/heads/gh-pages
Commit: fa88464bab650ea0df736a2887391ce2847115c6
Parents: 313b1db
Author: Niketan Pansare <np...@us.ibm.com>
Authored: Wed Dec 7 14:49:02 2016 -0800
Committer: Niketan Pansare <np...@us.ibm.com>
Committed: Wed Dec 7 14:49:02 2016 -0800

----------------------------------------------------------------------
 beginners-guide-python.md | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/fa88464b/beginners-guide-python.md
----------------------------------------------------------------------
diff --git a/beginners-guide-python.md b/beginners-guide-python.md
index d0598aa..c919f3f 100644
--- a/beginners-guide-python.md
+++ b/beginners-guide-python.md
@@ -75,8 +75,8 @@ We are working towards uploading the python package on pypi. Until then, please
 ```bash
 git checkout https://github.com/apache/incubator-systemml.git
 cd incubator-systemml
-mvn post-integration-test -P distribution -DskipTests
-pip install src/main/python/dist/systemml-incubating-0.12.0.dev1.tar.gz
+mvn clean package -P distribution
+pip install target/systemml-0.12.0-incubating-SNAPSHOT-python.tgz
 ```
 
 The above commands will install Python package and place the corresponding Java binaries (along with algorithms) into the installed location.
@@ -214,10 +214,10 @@ digits = datasets.load_digits()
 X_digits = digits.data
 y_digits = digits.target 
 n_samples = len(X_digits)
-X_train = X_digits[:.9 * n_samples]
-y_train = y_digits[:.9 * n_samples]
-X_test = X_digits[.9 * n_samples:]
-y_test = y_digits[.9 * n_samples:]
+X_train = X_digits[:int(.9 * n_samples)]
+y_train = y_digits[:int(.9 * n_samples)]
+X_test = X_digits[int(.9 * n_samples):]
+y_test = y_digits[int(.9 * n_samples):]
 logistic = LogisticRegression(sqlCtx)
 print('LogisticRegression score: %f' % logistic.fit(X_train, y_train).score(X_test, y_test))
 ```
@@ -245,13 +245,13 @@ X_digits = digits.data
 y_digits = digits.target
 n_samples = len(X_digits)
 # Split the data into training/testing sets and convert to PySpark DataFrame
-df_train = sml.convertToLabeledDF(sqlContext, X_digits[:.9 * n_samples], y_digits[:.9 * n_samples])
-X_test = sqlCtx.createDataFrame(pd.DataFrame(X_digits[.9 * n_samples:]))
+df_train = sml.convertToLabeledDF(sqlContext, X_digits[:int(.9 * n_samples)], y_digits[:int(.9 * n_samples)])
+X_test = sqlCtx.createDataFrame(pd.DataFrame(X_digits[int(.9 * n_samples):]))
 logistic = LogisticRegression(sqlCtx)
 logistic.fit(df_train)
 y_predicted = logistic.predict(X_test)
 y_predicted = y_predicted.select('prediction').toPandas().as_matrix().flatten()
-y_test = y_digits[.9 * n_samples:]
+y_test = y_digits[int(.9 * n_samples):]
 print('LogisticRegression score: %f' % accuracy_score(y_test, y_predicted))
 ```
 
@@ -331,8 +331,8 @@ X_digits = digits.data
 y_digits = digits.target + 1
 n_samples = len(X_digits)
 # Split the data into training/testing sets and convert to PySpark DataFrame
-X_df = sqlCtx.createDataFrame(pd.DataFrame(X_digits[:.9 * n_samples]))
-y_df = sqlCtx.createDataFrame(pd.DataFrame(y_digits[:.9 * n_samples]))
+X_df = sqlCtx.createDataFrame(pd.DataFrame(X_digits[:int(.9 * n_samples)]))
+y_df = sqlCtx.createDataFrame(pd.DataFrame(y_digits[:int(.9 * n_samples)]))
 ml = sml.MLContext(sc)
 # Get the path of MultiLogReg.dml
 scriptPath = os.path.join(imp.find_module("systemml")[1], 'systemml-java', 'scripts', 'algorithms', 'MultiLogReg.dml')


[04/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1154] Transform examples missing quotes

Posted by de...@apache.org.
[SYSTEMML-1154] Transform examples missing quotes

The transformencode, transformdecode, and transformapply examples in the DML
Language Reference need quotes on the jspec read functions' first argument.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/a9695eb8
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/a9695eb8
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/a9695eb8

Branch: refs/heads/gh-pages
Commit: a9695eb86fae7f82375d1ee5aace6eea351209cb
Parents: 8b91758
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Thu Dec 15 16:35:32 2016 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Thu Dec 15 16:35:32 2016 -0800

----------------------------------------------------------------------
 dml-language-reference.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a9695eb8/dml-language-reference.md
----------------------------------------------------------------------
diff --git a/dml-language-reference.md b/dml-language-reference.md
index 7437423..eefdc44 100644
--- a/dml-language-reference.md
+++ b/dml-language-reference.md
@@ -1694,7 +1694,7 @@ This example replaces values in specific columns to create a recoded matrix with
 The following DML utilizes the `transformencode()` function.
 
     F1 = read("/user/ml/homes.csv", data_type="frame", format="csv");
-    jspec = read(/user/ml/homes.tfspec_recode2.json, data_type="scalar", value_type="string");
+    jspec = read("/user/ml/homes.tfspec_recode2.json", data_type="scalar", value_type="string");
     [X, M] = transformencode(target=F1, spec=jspec);
     print(toString(X));
     if(1==1){}
@@ -1780,7 +1780,7 @@ The <code>transformdecode()</code> function can be used to transform a <code>mat
 The next example takes the outputs from the [transformencode](dml-language-reference.html#transformencode) example and reconstructs the original data using the same transformation specification. 
 
     F1 = read("/user/ml/homes.csv", data_type="frame", format="csv");
-    jspec = read(/user/ml/homes.tfspec_recode2.json, data_type="scalar", value_type="string");
+    jspec = read("/user/ml/homes.tfspec_recode2.json", data_type="scalar", value_type="string");
     [X, M] = transformencode(target=F1, spec=jspec);
     F2 = transformdecode(target=X, spec=jspec, meta=M);
     print(toString(F2));
@@ -1823,7 +1823,7 @@ The following example uses <code>transformapply()</code> with the input matrix a
     }
     
     F1 = read("/user/ml/homes.csv", data_type="frame", format="csv");
-    jspec = read(/user/ml/homes.tfspec_bin2.json, data_type="scalar", value_type="string");
+    jspec = read("/user/ml/homes.tfspec_bin2.json", data_type="scalar", value_type="string");
     [X, M] = transformencode(target=F1, spec=jspec);
     X2 = transformapply(target=F1, spec=jspec, meta=M);
     print(toString(X2));


[31/50] [abbrv] incubator-systemml git commit: Updated document to correspond to the currently released artifacts

Posted by de...@apache.org.
Updated document to correspond to the currently released artifacts

Closes #403


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/5c4e27c7
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/5c4e27c7
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/5c4e27c7

Branch: refs/heads/gh-pages
Commit: 5c4e27c701da1084d1e47d7ad049f9570033e7ae
Parents: 0fb74b9
Author: Nakul Jindal <na...@gmail.com>
Authored: Tue Feb 21 14:56:58 2017 -0800
Committer: Nakul Jindal <na...@gmail.com>
Committed: Thu Feb 23 13:20:27 2017 -0800

----------------------------------------------------------------------
 release-process.md | 146 ++++++++++++++++++++----------------------------
 1 file changed, 62 insertions(+), 84 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5c4e27c7/release-process.md
----------------------------------------------------------------------
diff --git a/release-process.md b/release-process.md
index 1cc5c9f..a75a281 100644
--- a/release-process.md
+++ b/release-process.md
@@ -102,86 +102,64 @@ The build artifacts should be downloaded from [https://dist.apache.org/repos/dis
 this OS X example.
 
 	# download artifacts
-	wget -r -nH -nd -np -R index.html* https://dist.apache.org/repos/dist/dev/incubator/systemml/0.11.0-incubating-rc1/
+	wget -r -nH -nd -np -R 'index.html*' https://dist.apache.org/repos/dist/dev/incubator/systemml/0.13.0-incubating-rc1/
 
 	# verify standalone tgz works
-	tar -xvzf systemml-0.11.0-incubating-standalone.tgz
-	cd systemml-0.11.0-incubating-standalone
+	tar -xvzf systemml-0.13.0-incubating-bin.tgz
+	cd systemml-0.13.0-incubating-bin
 	echo "print('hello world');" > hello.dml
 	./runStandaloneSystemML.sh hello.dml
 	cd ..
 
-	# verify main jar works
-	mkdir lib
-	cp -R systemml-0.11.0-incubating-standalone/lib/* lib/
-	rm lib/systemml-0.11.0-incubating.jar
-	java -cp ./lib/*:systemml-0.11.0-incubating.jar org.apache.sysml.api.DMLScript -s "print('hello world');"
-
-	# verify src works
-	tar -xvzf systemml-0.11.0-incubating-src.tgz
-	cd systemml-0.11.0-incubating-src
-	mvn clean package -P distribution
-	cd target/
-	java -cp ./lib/*:systemml-0.11.0-incubating.jar org.apache.sysml.api.DMLScript -s "print('hello world');"
-	java -cp ./lib/*:SystemML.jar org.apache.sysml.api.DMLScript -s "print('hello world');"
-	cd ..
+	# verify standalon zip works
+	rm -rf systemml-0.13.0-incubating-bin
+	unzip systemml-0.13.0-incubating-bin.zip
+	cd systemml-0.13.0-incubating-bin
+	echo "print('hello world');" > hello.dml
+	./runStandaloneSystemML.sh hello.dml
 	cd ..
 
-	# verify distrib tgz works
-	tar -xvzf systemml-0.11.0-incubating.tgz
-	cd systemml-0.11.0-incubating
-	java -cp ../lib/*:SystemML.jar org.apache.sysml.api.DMLScript -s "print('hello world');"
-
-	# verify spark batch mode
-	export SPARK_HOME=/Users/deroneriksson/spark-1.5.1-bin-hadoop2.6
-	$SPARK_HOME/bin/spark-submit SystemML.jar -s "print('hello world');" -exec hybrid_spark
-
-	# verify hadoop batch mode
-	hadoop jar SystemML.jar -s "print('hello world');"
-
-
-Here is an example of doing a basic
-sanity check on OS X after building the artifacts manually.
-
-	# build distribution artifacts
-	mvn clean package -P distribution
-
-	cd target
-
-	# verify main jar works
-	java -cp ./lib/*:systemml-0.11.0-incubating.jar org.apache.sysml.api.DMLScript -s "print('hello world');"
-
-	# verify SystemML.jar works
-	java -cp ./lib/*:SystemML.jar org.apache.sysml.api.DMLScript -s "print('hello world');"
-
 	# verify src works
-	tar -xvzf systemml-0.11.0-incubating-src.tgz
-	cd systemml-0.11.0-incubating-src
+	tar -xvzf systemml-0.13.0-incubating-src.tgz
+	cd systemml-0.13.0-incubating-src
 	mvn clean package -P distribution
 	cd target/
-	java -cp ./lib/*:systemml-0.11.0-incubating.jar org.apache.sysml.api.DMLScript -s "print('hello world');"
-	java -cp ./lib/*:SystemML.jar org.apache.sysml.api.DMLScript -s "print('hello world');"
-	cd ..
-	cd ..
-
-	# verify standalone tgz works
-	tar -xvzf systemml-0.11.0-incubating-standalone.tgz
-	cd systemml-0.11.0-incubating-standalone
-	echo "print('hello world');" > hello.dml
-	./runStandaloneSystemML.sh hello.dml
-	cd ..
-
-	# verify distrib tgz works
-	tar -xvzf systemml-0.11.0-incubating.tgz
-	cd systemml-0.11.0-incubating
-	java -cp ../lib/*:SystemML.jar org.apache.sysml.api.DMLScript -s "print('hello world');"
+	java -cp "./lib/*:systemml-0.13.0-incubating.jar" org.apache.sysml.api.DMLScript -s "print('hello world');"
+	java -cp "./lib/*:SystemML.jar" org.apache.sysml.api.DMLScript -s "print('hello world');"
+	cd ../..
 
 	# verify spark batch mode
-	export SPARK_HOME=/Users/deroneriksson/spark-1.5.1-bin-hadoop2.6
-	$SPARK_HOME/bin/spark-submit SystemML.jar -s "print('hello world');" -exec hybrid_spark
+	export SPARK_HOME=~/spark-2.1.0-bin-hadoop2.7
+	cd systemml-0.13.0-incubating-bin/target/lib
+	$SPARK_HOME/bin/spark-submit systemml-0.13.0-incubating.jar -s "print('hello world');" -exec hybrid_spark
 
 	# verify hadoop batch mode
-	hadoop jar SystemML.jar -s "print('hello world');"
+	hadoop jar systemml-0.13.0-incubating.jar -s "print('hello world');"
+
+
+	# verify python artifact
+	# install numpy, pandas, scipy & set SPARK_HOME
+	pip install numpy
+	pip install pandas
+	pip install scipy
+	export SPARK_HOME=~/spark-2.1.0-bin-hadoop2.7
+	# get into the pyspark prompt
+	cd systemml-0.13.0
+	$SPARK_HOME/bin/pyspark --driver-class-path systemml-java/systemml-0.13.0-incubating.jar
+	# Use this program at the prompt:
+	import systemml as sml
+	import numpy as np
+	m1 = sml.matrix(np.ones((3,3)) + 2)
+	m2 = sml.matrix(np.ones((3,3)) + 3)
+	m2 = m1 * (m2 + m1)
+	m4 = 1.0 - m2
+	m4.sum(axis=1).toNumPy()
+
+	# This should be printed
+	# array([[-60.],
+	#       [-60.],
+	#       [-60.]])
+
 
 
 ## Python Tests
@@ -229,8 +207,8 @@ The project should be built using the `src` (tgz and zip) artifacts.
 In addition, the test suite should be run using an `src` artifact and
 the tests should pass.
 
-	tar -xvzf systemml-0.11.0-incubating-src.tgz
-	cd systemml-0.11.0-incubating-src
+	tar -xvzf systemml-0.13.0-incubating-src.tgz
+	cd systemml-0.13.0-incubating-src
 	mvn clean package -P distribution
 	mvn verify
 
@@ -246,13 +224,14 @@ standalone distributions.
 Here is an example based on the [Standalone Guide](http://apache.github.io/incubator-systemml/standalone-guide.html)
 demonstrating the execution of an algorithm (on OS X).
 
-	$ tar -xvzf systemml-0.11.0-incubating-standalone.tgz
-	$ cd systemml-0.11.0-incubating-standalone
-	$ wget -P data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data
-	$ echo '{"rows": 306, "cols": 4, "format": "csv"}' > data/haberman.data.mtd
-	$ echo '1,1,1,2' > data/types.csv
-	$ echo '{"rows": 1, "cols": 4, "format": "csv"}' > data/types.csv.mtd
-	$ ./runStandaloneSystemML.sh scripts/algorithms/Univar-Stats.dml -nvargs X=data/haberman.data TYPES=data/types.csv STATS=data/univarOut.mtx CONSOLE_OUTPUT=TRUE
+	tar -xvzf systemml-0.13.0-incubating-bin.tgz
+	cd systemml-0.13.0-incubating-bin
+	wget -P data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data
+	echo '{"rows": 306, "cols": 4, "format": "csv"}' > data/haberman.data.mtd
+	echo '1,1,1,2' > data/types.csv
+	echo '{"rows": 1, "cols": 4, "format": "csv"}' > data/types.csv.mtd
+	./runStandaloneSystemML.sh scripts/algorithms/Univar-Stats.dml -nvargs X=data/haberman.data TYPES=data/types.csv STATS=data/univarOut.mtx CONSOLE_OUTPUT=TRUE
+	cd ..
 
 
 ## Single-Node Spark
@@ -263,13 +242,13 @@ Verify that SystemML runs algorithms on Spark locally.
 
 Here is an example of running the `Univar-Stats.dml` algorithm on random generated data.
 
-	$ tar -xvzf systemml-0.11.0-incubating.tgz
-	$ cd systemml-0.11.0-incubating
-	$ export SPARK_HOME=/Users/deroneriksson/spark-1.5.1-bin-hadoop2.6
-	$ $SPARK_HOME/bin/spark-submit SystemML.jar -f scripts/datagen/genRandData4Univariate.dml -exec hybrid_spark -args 1000000 100 10 1 2 3 4 uni.mtx
-	$ echo '1' > uni-types.csv
-	$ echo '{"rows": 1, "cols": 1, "format": "csv"}' > uni-types.csv.mtd
-	$ $SPARK_HOME/bin/spark-submit SystemML.jar -f scripts/algorithms/Univar-Stats.dml -exec hybrid_spark -nvargs X=uni.mtx TYPES=uni-types.csv STATS=uni-stats.txt CONSOLE_OUTPUT=TRUE
+	cd systemml-0.13.0-incubating-bin/lib
+	export SPARK_HOME=~/spark-2.1.0-bin-hadoop2.7
+	$SPARK_HOME/bin/spark-submit systemml-0.13.0-incubating.jar -f ../scripts/datagen/genRandData4Univariate.dml -exec hybrid_spark -args 1000000 100 10 1 2 3 4 uni.mtx
+	echo '1' > uni-types.csv
+	echo '{"rows": 1, "cols": 1, "format": "csv"}' > uni-types.csv.mtd
+	$SPARK_HOME/bin/spark-submit systemml-0.13.0-incubating.jar -f ../scripts/algorithms/Univar-Stats.dml -exec hybrid_spark -nvargs X=uni.mtx TYPES=uni-types.csv STATS=uni-stats.txt CONSOLE_OUTPUT=TRUE
+	cd ..
 
 
 ## Single-Node Hadoop
@@ -280,7 +259,8 @@ Verify that SystemML runs algorithms on Hadoop locally.
 
 Based on the "Single-Node Spark" setup above, the `Univar-Stats.dml` algorithm could be run as follows:
 
-	$ hadoop jar SystemML.jar -f scripts/algorithms/Univar-Stats.dml -nvargs X=uni.mtx TYPES=uni-types.csv STATS=uni-stats.txt CONSOLE_OUTPUT=TRUE
+	cd systemml-0.13.0-incubating-bin/lib
+	hadoop jar systemml-0.13.0-incubating.jar -f ../scripts/algorithms/Univar-Stats.dml -nvargs X=uni.mtx TYPES=uni-types.csv STATS=uni-stats.txt CONSOLE_OUTPUT=TRUE
 
 
 ## Notebooks
@@ -313,5 +293,3 @@ has been approved.
 
 To be written. (What steps need to be done? How is the release deployed to the central maven repo? What updates need to
 happen to the main website, such as updating the Downloads page? Where do the release notes for the release go?)
-
-


[28/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1259] Replace append with cbind for matrices

Posted by de...@apache.org.
[SYSTEMML-1259] Replace append with cbind for matrices

Replace matrix append calls with cbind calls.

Closes #391.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/ba2819bc
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/ba2819bc
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/ba2819bc

Branch: refs/heads/gh-pages
Commit: ba2819bce02500a374c7e7fe957bb678efebf277
Parents: 0f92f40
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Tue Feb 14 16:14:16 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Tue Feb 14 16:14:16 2017 -0800

----------------------------------------------------------------------
 dml-language-reference.md | 1 -
 1 file changed, 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/ba2819bc/dml-language-reference.md
----------------------------------------------------------------------
diff --git a/dml-language-reference.md b/dml-language-reference.md
index 22ec0d9..fca2b9b 100644
--- a/dml-language-reference.md
+++ b/dml-language-reference.md
@@ -639,7 +639,6 @@ The builtin function `sum` operates on a matrix (say A of dimensionality (m x n)
 
 Function | Description | Parameters | Example
 -------- | ----------- | ---------- | -------
-append() | Adds the second argument as additional columns to the first argument (note that the first argument is not over-written). Append is meant to be used in situations where one cannot use left-indexing. <br/> **NOTE: append() has been replaced by cbind(), so its use is discouraged.** | Input: (X &lt;matrix&gt;, Y &lt;matrix&gt;) <br/>Output: &lt;matrix&gt; <br/> X and Y are matrices (with possibly multiple columns), where the number of rows in X and Y must be the same. Output is a matrix with exactly the same number of rows as X and Y. Let n1 and n2 denote the number of columns of matrix X and Y, respectively. The returned matrix has n1+n2 columns, where the first n1 columns contain X and the last n2 columns contain Y. | A = matrix(1, rows=2,cols=5) <br/> B = matrix(1, rows=2,cols=3) <br/> C = append(A,B) <br/> print("Dimensions of C: " + nrow(C) + " X " + ncol(C)) <br/> The output of above example is: <br/> Dimensions of C: 2 X 8
 cbind() | Column-wise matrix concatenation. Concatenates the second matrix as additional columns to the first matrix | Input: (X &lt;matrix&gt;, Y &lt;matrix&gt;) <br/>Output: &lt;matrix&gt; <br/> X and Y are matrices, where the number of rows in X and the number of rows in Y are the same. | A = matrix(1, rows=2,cols=3) <br/> B = matrix(2, rows=2,cols=3) <br/> C = cbind(A,B) <br/> print("Dimensions of C: " + nrow(C) + " X " + ncol(C)) <br/> Output: <br/> Dimensions of C: 2 X 6
 matrix() | Matrix constructor (assigning all the cells to numeric literals). | Input: (&lt;init&gt;, rows=&lt;value&gt;, cols=&lt;value&gt;) <br/> init: numeric literal; <br/> rows/cols: number of rows/cols (expression) <br/> Output: matrix | # 10x10 matrix initialized to 0 <br/> A = matrix (0, rows=10, cols=10)
  | Matrix constructor (reshaping an existing matrix). | Input: (&lt;existing matrix&gt;, rows=&lt;value&gt;, cols=&lt;value&gt;, byrow=TRUE) <br/> Output: matrix | A = matrix (0, rows=10, cols=10) <br/> B = matrix (A, rows=100, cols=1)


[07/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1185] SystemML Breast Cancer Project

Posted by de...@apache.org.
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/cc6f3c7e/img/projects/breast_cancer/approach.svg
----------------------------------------------------------------------
diff --git a/img/projects/breast_cancer/approach.svg b/img/projects/breast_cancer/approach.svg
new file mode 100644
index 0000000..3c57460
--- /dev/null
+++ b/img/projects/breast_cancer/approach.svg
@@ -0,0 +1,4 @@
+<?xml version="1.0" standalone="yes"?>
+
+<svg version="1.1" viewBox="0.0 0.0 960.0 540.0" fill="none" stroke="none" stroke-linecap="square" stroke-miterlimit="10" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"><clipPath id="g18f7cf5d33_0_17.0"><path d="m0 0l960.0 0l0 540.0l-960.0 0l0 -540.0z" clip-rule="nonzero"></path></clipPath><g clip-path="url(#g18f7cf5d33_0_17.0)"><path fill="#000000" d="m0 0l960.0 0l0 540.0l-960.0 0z" fill-rule="nonzero"></path><path fill="#000000" fill-opacity="0.0" d="m254.28084 371.81628l701.798 0l0 164.0315l-701.798 0z" fill-rule="nonzero"></path><g transform="matrix(0.9320026246719161 0.0 0.0 0.9319971128608923 254.28084015748033 371.81627296587925)"><clipPath id="g18f7cf5d33_0_17.1"><path d="m0 0l753.0 0l0 176.0l-753.0 0z" clip-rule="nonzero"></path></clipPath><image clip-path="url(#g18f7cf5d33_0_17.1)" fill="#000" width="753.0" height="176.0" x="0.0" y="0.0" preserveAspectRatio="none" xlink:href="
 ACAAElEQVR42uzdCbhFVV33cSttsEG0UrMSxSm1RFMTtVJDMURt0CS0Ii1DKxXFFFFSE0SBRFI0RSwEp6RBRTE0isCcAgu1wVBU0IwmSa1sOu/z2e/7u+9it8+955x77j3D/a/nWc/eZ5999rD2Pnt/12/91n9dZ7Qi6bzzzht98pOf3LX9XX755aOPfOQjo0qLSxdddNHoiiuu2Jh/9rOfPXrjG984+u///u+x1+m000671vebJdv6n//5n8Hv/vqv/3p01FFHjb74xS/WhahUqVKlSpUqLV26zsoc6HWuM3rmM5+5a/uzryOPPLLukAWlf/mXf+muOSD/uZ/7uW7++7//+7spuB53nd75zneObnGLW0xUQcj2++nqq6/uvttnn326+UqVKlWqVKlSpYL4GdO+++47OuGEE+a6TYB21VVXDX5nX7tZaah07QSggXbmqeZJ//RP/7TpdQLgYH6z626dcRD/G7/xG90+53k/VapUqVKlSpUq7WmIZ38IgCVTbfvLo9ZmeeCK3eLEE0/s7BLt+n2YCxz+53/+Zwd0UWZ9Nv+Wt7ylWy+KMYCz7WwPCGZ/jiUq8gc+8IG667ZILDTtNUnZDYH5EMQre+sPpVy/rNO/7u6TXEP3iftn//3331jmOg/dE/376b/+678G7zvJ9sxnXfsYuneGjksrgylrUbsNtp+h4xr3v2jLVW4rRpUqVapUqVKlgvgdg/jDDjusAxdQA2iy3DzYM//gBz94Y3m81cDHOoEiwATAxsGh78BO1Fu/D3BJoAgwXXbZZd0y6wfoAH7WBf1sIWXP2Dops1YJb0HadCuIz7UdUtlt1zb+/d//feMatcm+3DcBW/eZe8y9Fmgeuif699O4+y4Q7zv3jH2Nu3eGzknrRNb3W8eYYxg6rnH/i6yjUqmVo1oQKlWqVKlSpYL4XYH4QEfgqL88MATkxsEUoJvUThN12O+zzQA6qMo8UI+H2zFYPolHu9J4iE+Kjx
 1YTwLxQL1NrXK9WStMe82ifLetKVm/vSf699NmEG+9tkVm3L3TP6e2TNp73Xaz7f5xbfW/aL+rVKlSpUqVKhXE7zjE9+GovzwgF/tEa2toYSrrj4P4/J56KQfYbCMQ1AJifNGmQKmFt0qzQXyrSqflYxaIdy1ti6oeJXqoZaTdf1p62us67p5o76dA8iT33bh7Z+icNqsgDB3XuP+F7fsuFZOh1qhKlSpVqlSpUkH8rkE8tRa8ga9YL2JniWe4hS7KJ0tCP5Rg4DCqpmTdeIpjx4g3PhYHlgXfsVUApoL47UN8ytl1ck1bS820dpqk3C9DISbb/cdaA6xVANwDH/vYxwbvif79tNl910L8uHtnHMSPa10aulfH/S9yb+Y46x6tVKlSpUqVCuJ3DOJPPfXULe00yWAFPElAqv0uMMXSMM5SEDhMx9V0Fsy68Re3ENh2Tkxnw4L46VMfwtMJdagT5hDEt30WtoL4IdBv7TT9jq2gftw90b+fxt13Qy1AQ/dOv0xSsRj3Hxg6riuvvHLwfxF4r46tlSpVqlSpUkH8QlPAhqq5KwX3/zzx446l0vbLNyEmpy3XnYgCtFvXdN772ep/UfdqpUqVKlWqVBC/0NS3GOxkSgfLKP2V5p9aa8g0CbzPEuN9XdNu/i8qVapUqVKlSgXxU6d00uuH5tuJxGs8FMu70nzTuPCfmyVe8H6H1r2cdvN/UalSpUqVKlUqiK9UqVKlSpUqVapUqVJBfKVlTSwfRjhlT/qP//iPykuUXRPXZrNIP5UqVapUqVKlgvhKeyyxewDFf/u3f+ssH6Kk/MM//MPo7//+7ysvMLsGroVr4tq4RtUJtlKlSpUqVSqIr1SpSxRekPiP//iPg6OpVl58dm1co1LjK1WqVKlSpYL4SpW6lFFUP/nJTxYwL2l2bVyjisZUqVKlSpUqFcRXqtQlvmu2jb/8y7/cgMZKS/Jw+H/Xw7VxjVyrSpUqVapUqVJBfKVKHRj
 yX//5n/95QfySQrxr4xoVxFeqVKlSpUoF8ZUqbUC8jpR/9md/VhC/pBDv2rhGBfGVKlWqVKlSQXylSgXxBfE7noYG/rLsi1/84tiOuv3l0w4etlXfAfuuSD+zX8dcv6FybcveetOU81Ydt21raL+Vdia5Hv1r4nP/OvevT3vN53297Lf6BlVaGYiv+N0V97sgviB+VSH+sMMO60bwTbriiitG++yzz//quHvZZZdtrHPqqad2y/xH8gz0WZ+ASdKDH/zgbkTncWn//ffvtrfZOpWunYzanOeC8m2vnety1VVXXWvZs5/97G7dD3zgAxM/T4wObd1x0Kdjd7Y/7SjSlaZPQFxZu7b5H/o/9/+7+R/l+vX/00arPvHEEyfa5y1ucYsuj0tvfOMbN7ZfkboqLT3EV/zuivtdEF8Qv6oQH7BLuuiiizbO5y1veUv3/Tvf+c7upd0CdR/iJetNcl8GJMZBQ47J9mZRCPeqeh9wDkS5lsrStbMctKucAbdAfkDbOoH6ra6b344rYzBoHyqCswBctbxMl1S+XQ/Ju6qtoLkGrrV1AtQB+lTc5DCM+auvvnpLgM+zYbNKhf1vta1KlZYC4ofid3/t137t6BnPeMboRS960eg1r3nN6Mwzz+wegKeffnr3gP293/u90Tve8Y7RH/7hH3b57W9/e/fCMn/BBReM3v3ud48uvvji0Z3vfOfR9a9//W6bN7/5zUe3v/3tR/e6171GN73pTUdf9VVfNfryL//y0bHHHjt6wxve0O3/b//2b0d/8zd/s5Evv/zy7iEt57PpX/3VX43+4i/+osvm8/nSSy/t/vSibPhsXWHzPv3pT3cvg4997GOjj370oxvr57cf+tCHNtaXs0/rtscjf/zjH++yB8ynPvWpbruf+cxnun383d/9XQdBn/vc50bXXHPN6POf//zoC1/4wrWyZQD7n//5n7tz/uxnP9tty/7+9E//tHtxnX/++V2ZK+u3vvWtoze9
 6U2jV73qVaMnPelJo6c85Smjpz3taaNf+ZVfGT396U8f/dIv/dLoOc95TrfsiCOOGH3lV37lUsb9ngbiq2Vod3Ouh/vPf/Bf//VfV6aFB3Cddtpp14IAy4Zgqj3mIYjPC9x/cFzyfEl5Zb/9BO7797dtb3VMUZVnUe/72570+kzyu2mv9Sz3BqhKmQW2h5T6dh+tGh8Vf9y+CRu+p/JudT/1K2dD125o2SzXbpJ7Yjev3W5WQlRwXRPvQykKvPdhP8Xa0ram9a+532sBG5f8X7f6f+cZ0D+GScvWMbTHOEuZj3tWbPVfm/Q+rUromkH8UPxuEP/Lv/zLo1/7tV/rYPKMM87oIPKVr3zl6LWvfW0Hl29+85tHb3vb20bnnXfe6Pd///c7ePfnAPLvf//7Rx/84Ae7B9pDH/rQbpu3vOUtR3e6051GBx988Og7vuM7OpAH8Y985CNHxx13XKcgg+EWnM23EA/C5YA8WM8UhAPEAHmg/BOf+ESX/SkD8Fm/rQDIlgfeTfO97fi9rCIRiFdmIB7Ag3GQCpwBvDIF7B5UyeDIFMxbx4sF+KsMOC9/fi8C5QjeU7YqOcpexeqYY47prs0LX/jCjfy85z2vg/tHP/rRo+td73pLGfd7UoivlqHdz7kef/RHf9Td90B+FVp4KGWt+hYFPE3zm6UhiJeoguOa2gMdYAAsDCnxngspT0CYbWZZfhMAyXp83W0rQvaTfbVKIsiVLG+3TUhprQDWHQKNlFMsR6btvqOMDh1jICfZMaS8Uz6tBWYaFb4P5LYdwBuCl5xzC18pm35KOTnWca0jqXy1xzJ07bzXsiw2rmmuXe7b9lo5pkVeu1Sisu/dSMo474Fcz9YWt1XKfy3llAr20PVNBd+5bea1b8vH9lP5a5f5bWvX885uKyHJni1thSDX3fG6DxxLfpPK5dD9NlQZiV0vZZYKStvKMHSf9i1plqf82n27FyutCMQPxe/++q//+g6sX/ayl
 3U326//+q93GUj6TJ3/rd/6rU4h/t3f/d3Ru971rtGFF17Yqe8gnqIHin13/PHHd9u87W1vO7rnPe85etSjHjW6973vPbrjHe/YQfxd7nKXDuRBMBgOxLcQHZAHugA6inqWZ11/4g9/+MOjj3zkI9eC71QIAut9Fd/vfVYB8Nm67fp+D7RVBlKZ6Kvx/qCBnAB8IB68Ax8PUlPLrAPkU3m58soru+MQ5s9D4T3vec9Geao4KXtl6bqccMIJoxe/+MXdMtmyJz7xiV3ZfsVXfMVSxv2eFOJrZNca2XVa9TaViUDQJJWLcRCfl2H/3NoXvJe4F7AX4JDy5WXoJe+ZECuA/3ugw3MKwDjeeLC9sPOS90L3TPG8cDypDAQE83LPizzb88yI0ui31g3UDYGg/bfgbXlsCo5l6BhbWPa8Yj8JOAWCnadrM6kdIQDXKpgtmA7ZZIa87X3LVH/7bXbc4yppykz5DV27PL+UVY4xz/9Jr13K32fb9ZtA96KuneN0P5n3292ooLfe9BxbgHja30vKbZyS31auk4cqfKkIuHY5poBung0pO2Xl2uUap/xt1/XIfZfjybVrt+W6uOZ+M+5ZMa5FwfZy3zge559KQfbdv09zDLab5Y4/90mOPZXjSisC8f343awuL3jBCzbA/aUvfeno5S9/+eiss84a/eZv/mZ3oU3B/Ote97pOMWavkVlBvEwA8Xvf+97Rq1/96m6b97jHPUY/8iM/0gHn4x//+NGhhx46+qZv+qbO/sFmY11wDGj91k0GxluVPJaaKOJDsJ9p1pWBfbaZ7QX2A/OBePuJwp9Kge1ln36T72XLwH1sNP7YsdGYb8HdfLLvY6tR/h68KgS2lwqIqeuiMgTkXYOXvOQlG5Ury1yDZz3rWaMjjzyyU+JVjJYx7vekEF8ju9bIrpOmvMxyHHnxTaIAj4P4vIj7EN+qtEMd7vrHlRd7QAmQZT5Qah+peATMW0jvb6v/fX/dgK39tK
 raOBAMqDmHFoZa8OgfY78jYta371a1nKZTaI6n/xv7Srm3Hubsp69UgtAhFTnHrCzdK+n7MARIfh+b1GbXzjFE6Wx9+ZNcu5xvW8lZ9LXLf8n+d+t/3fZjaL3o0/z32zLMNob87lnfNXd+gd2hykrKKPeZa5gWurb8bKtfkWi/71tzhiC+TZvdb/1nUXuftfts1f6h+7R//7StGTm/cTbBSksM8X246kN8wPHss8/uoFxmsTEFkTzy7B/sNR6QgXgedeBvm/e5z31GD3/4wzs7yE//9E+PfviHf7jzyYPOb/mWbxn99m//dgedjiXQDaxbkO9DvNz3rUehb5cF4FtIt90W4rOsVeqtH4jPtn0OxLeWHbBMWae6B+AngXjgT9l03v5gFP7sz9QxnHvuuV3LRyxNpqlQ6adw9NFHj5761Kd2nvhWiV+mjoqTQnyN7LrED5Ela+Hpq+Ztk/nQy7m1ZoyD+Kj7Q0p8QuGZ36wjZQtv1vNC9jv/e8doPirYEMS3L9FpID7lkWdN9jkOBHOO7T5a8Bg6xnEgmOZ324yiuJn3eMgWFdjx3GuvXwsz2f+Qkg7ihzzRUSXbSsI4JbZfvkPXLpDdKpzTXLt++S/LtbM8NpHdAPn+fyjgOqSkty0uAc++WryZmt9vpRvXh6Itw7R0KEvrKZO8v9Pq4nkxVP7t59x3LWD3743N7rchiO//H4YqCkP36WYQ35bT0P+r0gpBPHW8hXjeeDcdcATuwBHUv+IVr+jm+eb5tvnk2T/8iaJ0s9zw2D/wgQ/sbqqDDjpo9F3f9V2dvYY67zv5sY99bPeHAMKt3aWF6hbi448HvfG9y21n2PzeucnmA/8B9+R44a0D+K3fQr7zkdNCkByIjxc+vvfkL33pSxvxjvMAsCze+Bbk0/m2PWf7Y1dSSVL+ssrU61//+tEpp5yy4Yfnk9fJNR2JVxnia2TX5Yb4ZWnhycupBb624ylI89k0cLIVxAemtrIT9AF
 6HMTH95tmaqqY38XPnnK1PDYd2bp+EyVapSkVjHEQ33be9HvrD0HtZiDYWgCGjjEdof3GsyuKpudZfOCBlUlV1T58pcxcu0BFQDOgp6KQ8sm1cg7j9pnrD/xyzEMtNm2ZDl07fZBybLEgBCYnvXZDEL/oa6dc/S4tArsRmaXfctLanlxf94OyDYwGbtv/t3PJvZHrMa4SEDhtI9uMuxezjTaUJUZwvKyvKbdcV9cvx++Y7MMy12bI6jIE8eOeFbNAfASN/n3a2mkC8InGlD4orUWo0opCPKg+6aSTOsWX2gsWZYq8mw/Am/reDQcsQTzbB0vN+973vg54+bZ//ud/vtvmjW50o64zqw6ubDVuFDCf5SBeJYC9ZMiz3lpkWk+6PxQvfSLQaOpvO6H6XQA8fnfLWlvMkK0GqPje+tlOwD9RasB7OrUC+FaBbzuzxhMP3pOzPP54IG9btpsWCJ2D2Yz8+ViWfud3fqdrsfDQUpl67nOf20WoMWVTMu/arTrEVyjK5Yb4ZbmvAj594PAcaDt+pRNfq9B5dg1BvBfuZhEuJoH4VBqS+h3e4jltOx4GVFvvbl7++T7nFAW5r/xK7XbHdVAbAsGccwsFQ8foOT8u/n4LR23UkUnvrYCJ65mm/YCt42o7frY5LSRD6mxSfOb9zn9D17Ut06Fr149hn0rfpNeub3lYhmuX1pO2c+xutaS1CXy2x9KPBNVew35H3FTqx7UitPdn+qyMg/i0VriH206syqbti5Dvoly3tru2n0K7biB+yPo1dL9tV4nvb69fdm3rwlb/j0orAvE6trpRvOio7FQrEA/K3Xy82VR6IOl7vngPHJAJNgPxOmL+5E/+ZLfNG9zgBqOb3OQmo9vc5jajH/3RH+06swpBCeK//du/vfN1U5xjpQCyQ6Ee+xCfEI+yeVCfTqgB+X40mlhnWltMfPcB6Nhm0qG2DW/ps+3bV8JKthAfQA+kbwbxsd/4vXOw3Ryna8Ka
 5IGscsS2dM4553QPYy0gwkw+85nP7NR4YSYp8qxQBfGV9gLER80bF9Fi2vBqUa8miW4zS9rsWIY6yE7622n3tZ3j71sE5rXvoRj901y/ANxWadbO2OOux9Dx7bVrN2vK/21IOc+1muZ4PAsmCfE57/j/Q5ac7ZbrvK5D/z4NxI8b5bhCTK4BxH/Zl31ZB/EAHKjzvAN3lhowz17DD5+OlYASwCdSDV+82icAZvHQgdV2+d5ZaHjjhZl80IMeNLrf/e7XQfx+++3XhagEySA+sOxhE/tLwD3RacZBPDW7/RyLim22kE5tby00scq0nVYD9tmPyoH1gbbPFHg2GMfcxoVv1fd+dJp45NuczrCJVKM1wTE7RhCvszCIVztWYdI64pqoaD3/+c/fAPjHPOYxa+GJL4gviJ9WjZ+Hh5e6VV7QraFgM/Da7v01a1SMaSOb1LVbjjROkZ42jWvdqHTtNGTlqrRmEB9PPAAH7dT2hDKkzCciDXinvFOHEzMebCZOPCgGoH5ve9R2Vhox4u9+97t3YSbvf//7d8uEmQTxwDWeeBAdUE+O7x3kZr6105hPtk6+y/rpnBplP9sN2MttJ9ZAfGwz6XiaYwnIe4CAeVYgzW+tGt9X5vvZ987ZbxNm0/E6fx2DDZylXA2oxUrT9klgbzIgl8g0P/uzPzt6xCMesRGdRmWsIL7SukO8RETYahCfrVKak0uJWhwIxqYwbSJuTBNfvK7dFUt1XAm1uZ1EhZ92kKWC+EprC/EGDGJvYZ8BjeA96nsAHlB6eIqcAuAz4BPoTMdWf0zfsdJ867d+axeNxh+WjeZ7vud7Rg94wAM6qDevIgCaW4gfgndqeEA981Hc+3aaKPPxyqczbHK223YkDeQndrzjsI4BcGJ3yXHZj+37LoPhOP4+xG+WKfB+5xokXrx9KD/l+Md//Mdd2YJ4VppcD5UqtiYVrV/4hV/oIv7oaxCIN113iK+RXWtk10qVpk2r9Nyo/
 1GlSpWmhni+ahYNgBhgpP6CeJFRQDwFnkIM3GUKfCLTqBEDYYBrmZFadeoQF15MeB1aqe/3ve99O6CXAShIBQZRzYF0wD1QDh740E0D6LHQtNPYbNp1bUdO59eM4JrQk0NhKinv7T5sL157FYEo/gF5HVSp8fHHD/nk25FbAXyOLees8qA1A8BrXv6DP/iDrqyVuVCTlEfXwPWgyD/hCU/orDRCeLZx4tcZ4mtk1xrZtVKlWdToVXhu1P+oUqVKU0M8C0YgniceIAJ4oSRNebHBI9jWmfLkk0/uOryydAB+g0Ilag3Yt96v/uqvjm54wxt2UVMo8tT4W93qVqNb3/rWo3333Xd0u9vdrlPjKfWaRIVOBMrAth2lFcRndFS2kxbawYRlpm22HTn++DYUpW22sG4/4NzydGLNflulv7XltC0FYN7xsNZQ4zNq6zgF3jqx0fhtBqVyHS655JKuRUMFKH54SrzWCi0hylZEIC0lPPGsNJT4vQTxNbJrjexaqdIsKvyqPTfqf1SpUqWpIR6k81wDeLYN00A8lf5JT3pSZ5UBl3JGbWX9EGWGWnfhhRd28c1FUDn22GO736kcnHDCCV22D3F3eeR1yKTYC5fo9yoCft9aXwLx6bgaL3wgPssD9uMgPvAdiO/Hmg/It973dGx1LLHcBPrbcJeOg4ISRT7QTnVvs++t5wGdioGwlnzwWjNEW9DCoTyVrbIG8eCdAs/epOz0XTBS60/91E+NfuzHfmzPQHyN7Foju1aqNG1axedG/Y8qVao0lZ0GUBs4CGgD+cSHB446tgJyfniqs98C0TRJgtN4ZS0DvTrABvQpy4CUlz52HDHOdXC9wx3u0HV4ZRcR49RygPuMZzyjs+IIqQikhbU0QinABfHgV6fUQH0f5uNnb33z8ci3KnysO7HctJ77Vr3v226i5rfRctrQk63qnig2vnNc6Szr9xR4AC9T4lWEVGiixGvZUJnS6qFsjNT65Cc/uV
 PhH/WoR3Wj4O4ViK+RXZf4YbNkI7tWqrSKz436H1WqVGlqiAeBQkwG4sE7iGetSVjJFuID8m2YRfMAlTfeYEXWBaKgPVAqx0svVCKfvPBuD3nIQzqwzeBHUcd5xMWfN0/d5wNXqbAuuHfMYN9nCvUP/dAPdZ1vATVPud+A7j7ExzrTqv0B93SSlfsx5xPRJgNCxRufTrTJLdRT6U0TStJ3GT3Wtpyf8pJF9uGJz7EL36nywr6kNQO8KwODafHDi8cv/v5egvga2XW54WNZRnadNq1jh+nqILl6z41V/x9VqlRpARAvMo0RVJ/2tKeNXvCCF3RWGhBvSoXnWWfhAJaBeDkKPIgHs0YbtY7OmAk/mU6apjJIBfGg+453vOPooIMO6mAUvFOtW4hPmEgw3UaWaQd2AuM+/8mf/EkX157aD5bN3/Wud+0A2DZ/8Rd/sdufSoDfULeBsd859nTKjXIfz7p12xFdlRkA910Uf+tHpU/M+RxXjrkdWRa8247yor4nq7Sw1Kj0qIwodxUpx3nMMcd06vvP/MzPdH74eOL3kp2mQlEuP8QvWyjKSdK6dZiuDpKr+9xY5f9RpUqVFgTxAXkKr8FP2pFZMyorLztAj/ouxjLoBqd87Gwzgff45uOZD8zLAJUaT4m/173u1YWaBNfAlkVHjiLeWmESHz5RY2J/adXwrBMVPFFqLNOB1PGxq4BtHXJFxwH1fsdfLra9iDxR+qnd7DvA3Pk49sSQz8itjgWY234gX9Yi0ar3vo8th4VG5eHiiy/ujkelJhF/ZPvR+qFjMYBXudIfAbQ//vGP7yokQkw6PhDv2unbsO5x4gviC+J3Iq1zh+nqIFkQX6lSpTWG+HRsBYLAkOIOcAEkkNepFYiLOAPKqTxUeBANaKnH6YAJkinhMjtN2/GVGh8rjXmKPzuNKDXiyNsPBVsH0NbL3irjLcRHpW9tLPmcaDYZSCmqfZT1KOZy28k1I7zaBiUcPDsnx6W
 CYwRayrjvH/vYx44OPPDAroLidyo+yg2c66jaZoq7DOyVOXgPuAN25RO7USo6otHYZjoUx0LzuMc9rpunxP/ET/xE54nfKyO2FsQXxO9EWucO09VBsiC+UqVKawzxrBhAHsQ/8YlP7KBRGENWGtFiwDXoZk8BmZRtLwb2D5/Brgx2KfUygOfnbpV40Eqxj+ebwmxEV2EmWXn48X3/pS99aUPhzuiriRIT7/qQ5aaNLpMY7rG6tACfDN7BeUJatgNLtTHls14bjtL2nb/IPabWA/z3vOc9u4qPTqpaGW52s5t1fQp43aOmp6Ovck2FJ2WWjr/Kh41J3wR9EVSujjjiiM5KIySnCgSA//Ef//HRQx/60LLTVCr42OZ9um4dpquDZEF8pUqV9gDE56HBSgI0xSA/6aSTOhhl6aCwA1EgKUJKIs2AduBpHoRaTzaf0VxNQSvFmdebAh2Itx/wDuQp3FR51hZKPLWfit6GekzH1FhZ2vCQbW5HVKXo+xyAT1SZfgbh1knYyHjwA/Dt6K4ZICrqfdZjpQHvzjEdeXnalYFzZ0fSCZUX/01velMH+de97nU7W5HBnA4++ODRD/7gD3YKvAqQCpVOrK4J5R28JyKN4eYB/CMe8Yguok9B/P9ONbJrjew6zX26bh2mq4Pkaj43aoTknU07eX3relRaKMSLE0/1ZZvRGZQSL1ODn/WsZ3WRX6jn4J3yzicP4nnFoyTLgXi2EAAvg9pAPLU9EP/N3/zNHcDLgB7QAvSEr4w1plXjsyxA31fQ+5FhbC/hHDeD+DbsZBR334F2nnY5IN+Cv6ll8blrcUgLhAqL7Nwp9spK+VHhU+lRUTKYExsTr3ssTFT9G9/4xp36DuK1WCgvPn4Qf4973GN0wAEHjB70oAcVxPdSjexaI7uuq1Jbiu7OXd9leG7UCMk7l3bq+tb1qLRQiGelSYhJEM0LLwvdyHet06mRV6339Kc/vYP3vve9hfdMKfYZedR6
 HkpgNl5woSztMyBv+wYwArai3lDkA+StNUaOyt7GiM/oreA90XPMtxAPuCnriR7Txn03D8pbVT/gr1NqC/HprBqPO9UkCrxzTflYJmcQJ8t1FKbQA3WZbQa4K3PLWXT47nWsNWDWU57ylA7iKfDUegq8aD53u9vdupFwH/CAB1wL4pWva3vkkUd2FQMPFUl0HWWyFyC+RnatDpcF8QXxe+G5UR2Xp1Phd/r6EvIwA36oVGlXlXidIzMK6F3ucpcOrvfZZ5/RDW5wg06l972BhkA8uwdwj5o8NIprOmgmg1k3OKClxqswGMwpKjwQvc1tbtMpzWDciKf9GO6B6/7orYnDnvj15vO57QSbgZsC7W2OCt+ODut3fhMLTSw3gX8wrwzFene+KjBppVAuzpsSb0ptp7rLLDNaO5SBzrBaPwzk9KIXvahr/WBfCswb4Ioaz0oj69wqg/qHP/zhnQWnhXgtAspAZCGRhoT/lMTj/8Zv/MYuspAE/kW2yctd5cJ1WQeIr5Fdq8NlQXxB/F54blTH5cnTblxf/Qm1ugsJjWEM1kisk4h++CEVikqV5grxQBAQ3v/+9x/d5CY36cBdvtGNbnQtiAfwFOUAfAZ1Au8gPh1ZE40mOZYSUxm4AniRafjiefJVHIScBMlqy/G0JzJNID7w3qrwzge0g/eEqrQskW4Ssz0jtrYdXQPw2XbU/4SyzPrWSwdayyj0UeFjlfGnVUbKJlF5lA3FPRnE+7PrewDcATzLUgvy1HmAT4kH8SpXOrNS4U0PPfTQrsLzwAc+cCY7jc62Hi55+Nvu937v93aePkkF6053utPGA4fVSvjNNEuqOC0rxNfIrssPl8vU4bIgviB+1Z4b1XF5tvtgp65v/3rEcSBQBxaQ9JnDR5J3PLFPIp4F7vWBqFRpKohPiMmAvAzYgTy4po5T4y1jpwm4pxMreHWzpsNrBneKfSbKe7LlRic1mNStb33rrkPr7W9/+w4agTz13
 43uz+APkI6pQ51XA+jpyBqA9ycKxCdWfNZtB5AKjLfx5q3bjrKqYpAKRFT42GyEjwTwjhe866BK0U6kGfDuj2rZWWed1YG5zrsgnlJunlpObddxWIQeFSW1eICv3wCfPFsTiKeci0ajM6s+Cvzw97vf/XbEE+/cKQeBdlGL2HokYwS4Xip4EuuTVgHHne9dZ+W0KIivkV2XGz6WqcPlOnaYrg6S6/3cqI7Ls90HO3V9h65HfPHEOUwixWbju7SKa8kXqlvSOk/8k7ADLsn7uFJB/KYPNRlAgzOK+C1vecsOsgPYX/3VX91BWlT4+ODBKmhnGQHwbQdW9pkMahSgj61GpBsRWh7ykId0arL9gEIgT31mJ/FyAd4U8ISYlKOau8ED87JzAu8GoUrn2HjjM+hTWxkA8QkxmW1l3UB9FHxAGj89BV7ZqW2rxLDQAHhZ5BlKfEZcBewAnkUmEC90Z0bFBetGrxU3Xz8BPnYRaZS1qag0AFk8eGUlg3hZud3nPvdZWMfW7AMAqKg4J0mT5fd93/d1rSqOxf3RPjRdH+taLvEpuk4VinLvQPwqdsBepQ7T1UFyvZ8b1dKys5X1eV+P/K+0nuedN+6aES8lYM+2K4luh5skEe6wikRYXObKt/N2fH3xoxUO5OoIPAeIp8iD9YR9pMAbiEnHVst0bg3EJ7xkQB7AAzKZZUYG8Swbspsvy/MZxFORhUmkKt/5znfu1H8Qn0g4bubEeY83Xk4Emdb+MgTxfTU+EJ/fBuJNQXui2li33YecAaLig6dwqTGr0ABY8J4cHxxwb+E9AG8QJ3H3WWpkZeHzySef3Cne4F0+6qijupjwwF2nVsp7AN7ngw46qPO6r1p0Gh2XVU60MEiugZaeu9/97hvrCbGp7CS+fuVvWhBfEL+ol3uN7FoQX/+jgvjtXI/AKkbJ+y/91iYBYkm/N4wiacnHJdJzn/vcDcXeOrHGLhLgHQOO0xJILHDehANTny33vf
 UK5GeEePAUGw0lHLBT4SnjAJ4vmncdxA954uXEggfpCalIcU8zES9YplRs86CVkgziASmI32+//Tb8+EAeVAfEE5d9aETWgHwgPnaaKPGJbBNPfOw5fSXeNqxvuwC+jVhjPgDvPKnwmrsC7omVT4FPxBk1ZpnqDNR53IXsBO5Ud3YZ5eB70Wn8CXnPRaWR2ZfYacSXjwIfNV7fBfne9773tSxRqxpiUtmzZGU9HX/1D5CU/a1udasu7KbtqSS123N9lb1rWRBfEL9T92mN7FoQX/+jgvh5XQ+MInnvCXyRY5w1iUDnPyzhiESjM8p7vPYE192AZWKAZ4n9AnbvZ9wibDah1tRny31vPetXh98ZlfiEmYwKr7NposbIPPFf8zVf04WdjPre+uIp0lHe+d0p7fG+A3lTKr2bFdx/8IMf7BRpyjKvt7CJgJ79goXn+te/fjdPiaXA845lVNV40+OFjx+egt42FSdOfEZ+9ZKyjUSnSWx408y3/vjWPmN9AO+4nZfzZp1xfLztbDMqNio4UeHZaCjvAB28A3fALrPQaB4D7aw0fHAyKw14V1niP9ehlQ+enUYznMpOIJ4KD+J1Rl0HiJ9mPfdTu56KlgpOlA0PKvev+yrrsSupaEo8/HKlgo9p778a2bUgvv5HBfHzvh7xynMgeJ/NM+GbHA/GSEVd/7ok3DJPBd4+ROfBZgJxbCYg+D4RCSvG/hwg3qBCQD7hJcVyB9XXu971OjsNaO8P7sQLH4insoMssAvgZcsTLx5gaeoB8WAUqLqIlHdAysJjn9/5nd/ZAS2YBvGxt0Qhz2iuID0dUROhpu3M2u+UGkuMaQvpmQf6iWCTePCystKS4Fzjd6e2B+Ljg6fKg3g1a5DPOkN9p8TLYB5QAnhNYWBeFhdehBoAz0YjnCQIVdGhxPPEq+j4rNxEEoonfq9B/FbraZ5T4dJSkvV0GlZ5ksC8SukjH/nI7rOK5eGHH75xHfu5UkF81qu
 RXQviozTWyK7rC/HTXt95XQ/WGswiCXaxk9cMHyVxBSSx6CZhnKRJWulioXH+WMw7VrngR30AnZsyMPXZct9bz/p+V9aaGSA+GbSDdxnUg8MAPtsNJT6dWjOwk3nWEhaa1vsu2z6YAvXWSSx1yj1oBaNPetKTOohPpBWdXdl5ZFYe29HsBMwD4gnzCLbbwaCSo763HVIBueamHJOXFjj3nSgsPvuuBXzLrAveY6Fpw2mmQiNT4qnz4J0Knwg0wjLGVgPghdZUG1YhUp687wBTBu+avsSAV6ExKqvyUDagnXVGGVHhKfNq0tT4GrF1tvWixru3PKzYl1QmXZ9knwvkC+JXTaktRXfnrm+N7Lre98Es13cnrkfra9dSv1upbaUG2knsrElYJokdpq38KGfL9BUMwCvzoWR5QN76fuf3ZauZAuJbAATqCpRnyTSfMw8qea0oyMImafqRKctsIYl7nsxKQoVul/nMD25AKT57EA+eKKPXve51Oyg1GmnUZZFdQDWvvMGKeOrB+e1ud7uu8y2gB+xg101Gqfc54Gt98B5bigoFKGe90LkyUA+iHUfg3p9G7dS6WhNy/CoiKiFubtaY2IoC660KD+K1OPC7G7gpCjxYtz+hJc2LxvPkJz+5a5U47LDDOnAE7GrELDQ+q/AYoMln1hHnYjmgL4ifD2ypiLm/MyiXDOQL4gviC+IL4ltQqZFd1/c+2I3rO+310EossRerKCwitceq318qPBhEEloaB1LTWZK5BJxrWxEYSr63HmcBAdbvC+KngHiA3kJg4sUDeR1Mgbapz8Ba+Mlv+7Zv62wvlPLv/u7v7lRzQH2HO9yhA3D5rne96+gHfuAHugtpPcoyqNZR03csO2AUwOrEyC4CvIEpWLUNxwKkKeCaWlh1AH1uZDCdCDKxuKTzK3AG2QkHqeIBnBPi8glPeEIX+QWgA/V488V+p8w7HscRn7/jd64Jpanj77777rsRUpIVSQuGYwDy
 LEjKEdiz1AjRqQzB+uMf//iu1UNWHioyfn/DG95wdMghh3T71clYtCBQD9hvfvObd1n5UOFve9vbdmU+brAn+1WRSE0foGoxSFIOzispVqIkD5nEsS2IL4gviC+IL4j/v6lGdl3v+2A3ru+s1wPgYhsJm0wy6OJuQj5R03lhKMzjXGMPGpd8bz28g0H8ftGRdVYK4ltPPLW9Vd+jwAfoQaqOp6LWyOZ1ggXroJcNhOINNME7KOZ1EoGG8m59U2BqhFBwb546DWzB6n3ve9/u96b2D2ip/m5etTTQno6oLDXJsdm04SOtB+I1S6kAtFaffva99QC9zNsWFd6fRdQdlYl00GWrSb8Amb0m4SbVUgG0jq06taqksM+wyjjP9ANQXnxvsnNnlXHubDLKSaVB+ShHrQzU+XxWdipNfSVeq4Jrq5WBzzsQ/7CHPayrIKSG6/roPJzPrqfKWD7rPa5vQj7nPsj20n8inw3atf/++298dk8ceOCB/+uhme913FWRyv2ogmVZ1lMB03rRKhHyuIewZa51tqf1hfc9600CJwXxBfEF8QXxNbLr3ob43RzZdTvXwzaIiBJRcxnAV6WEJcaxxSaz1TmmT4H1/c7v93Jlc1sQ347YGoiPKk9ZVlMC7WBcSEidTynSgBNsH3rooV0nTBAKSIEp4Hz0ox+9obBbH+CxhZhSkn1HjdaZ03IAD1qp/eCQ9UXNNZ73dD5NtJlAvc8B+USeAXOB+L5n3+fEsg+4J6JOYF62TMddUBg7DYgHfQkzSX03QBXVHbTzvifqjNYE/nexz4EqkE8G9oAb2KvsqAQpQ+UgU91ly0wTatL8bttpNJclZSTbJBYkFYgkDxV9B8Y9NBN2M/cjK5OWkaznflA2SSqDznnc9tyPKjXZXlo2sp57uF3/G77hGzZGnJWEsLzZzW7W3btAPrk88QXxs0B8jey63hBfI7uuN8Tv5siu80j64HkHS7a9yDJWOXEs7M/TKPGEY
 r+j4nNW9JmjIH6Tmzng3sI86JH7EC90H5Dnj5epuawxBuoBl/Fvg0vzwBScgTKKs/VUAgA6NZpNhaIbaAOxgEzFgD+eOsxmA7q9ZAwU1IaaTHQZwB6ITwdYy1uIz0BTgXdwbnli2Fsvy73UWjW+hXgAr2MvgA+8p+MqYAd/GXE1UWj434888sjuPE0B/BFHHNGVzWMe85jOzqOsVIRkMOm8VXIyuJPKjfLilc8IripG5Ymffj1Ndu1Q1u4dKkAGM2tzpYL4adarkV3XH+IrFOV6Q/yyhqLcLEWJxxPte25RSvy0nnjrCywSe7SywV4EPsJBxunJM3Y306JGnp24Y2s6kbYAD9qT1ZBkHm4gD7qBpEI3pZqDy3wGluwygU8ZrFKh2UF4u6mg1GXrUaGtC+pZLAKrtsP3DfxdSMDOp91aaRIesj9NNBoe91hiAvCyZaai6gDzRNNRi02kmozMyj/uJccPzzoD3sVVZbegvFPbEypSzmirwN0IrCwZvP+gXauCcsiIpVoptFjoL5CWDJUflRmZ+hxgVzEC9bJWDL8x1WKSa1gQX4M9FcQv9j6tkV0L4uv/VhC/DNeDhXY3YR7Q2hdRLHC+VXQaAU2sRwTFbRnkMyO78v1rIcR9IgACZa3j6VzrNxnBdqcAflEjz84M8QodtAP4dGw1D7yjxPNnA+2EQkw0FZmCTlWmJAN4sMkqA1qBqkqACgFgtYzNBKSy3oDbxEGP/Yb/O51OFVxGXR0X9908gAfj1PUW4mOlobabUtjBuXnbT7Qa8B4LDgWeF56NhjKbAZ1En3GjirYD3nn3ZRAP4GVKfCCe+u78YqFx7iouIB2gx04D3JVT+xnYp0KUsoklSStKKfEF8QXxy3Gf1siuBfH1fyuIX4brkZFaMZHoeO2ynUgq9+m/iKEwymZx4gPwhElCKVcDbtxqZFewHHDHdmkxx2SEWQkfEhy2ez6LHHl2KjtNItAECBNqEsQDeJ0cRUThUQ
 fVpiwvosgAbVNKPBjlh9cp0zyFPX5uFhGwn9FggTlVWqdGKjVFWgWAGg90gaoKgn1R79lYRKZRUwPygN05AHYgn5yINIFxlpjYaVpPvIvtxtFp1eeo84F9+xPRJYP+UODVBBNGUqdV3nd/jsSAN58Y8OCdtcagVc4RyLPSaHkA8Jq90ocgPnfeb+o8iI8aH6C3rvUyAJSbpyC+IL4gfvnu0xrZtSC+/m8F8ct0PbCTZEwUjLNTEJ848SBaMBCAvpkwgPOwFaFyuyO72i+hQRIIg/AqCfWd6HsGB51UgV/0yLNTD/bURqVplfhMxWrfb7/9OosLqAb1gB6Q866DbWAps8RQjAEnpRmYAlfwyRNvW2w4lHvQHw+4C6pyIO47cPW9DrQi4xggSidSzStuELCeEVXlqO+xwbS+9kA8cM/AVIH1VqEH9W486jxfFmhP2EhZ9BlKvOWip4B38ePVKkE8Jd7Iq8BdZqsRVhK8q6gAd5UX9iKfzTvPxH1XVmBeDVaZ8Monko11QXwsNcqrBnsqiC+IX777tEZ2LYgP1NTIrusL8Ysa2XU7yXYDtFwD7WBN2039EVsJrURSwiZhF8dR300POOCAzsFAKMWMOzmyK5t14F4fRa4LSRRB/Swldphx57GokWdngvj44eODD8BT4sWIT6dUUzAvuyC+08E1EWeSA6PxxIN5v2HNsR3ArnMsWw7l3ro6tloGZlUGfLYPv6V0O3YQHwW+BfhWfW9DRfYhvu3YCt753t1srDMAXmQVsA7aqe4JcQjkAbzlBq4SYQXIR5V3U0Z9P/roozt4ZxeSWYoSctO87Bx9tjydgsG581Ye+gz4LqElrZMOr6n0FMQXxBfEl1Jbiu5yXd8a2XW974NlGdl1OwnY8p9LQHteIB8FG9BiNRxGHM1I9wKD4CzlIJz0bo/smvLUHzLKPA5T/lHxQf2iR56dCOITlSYHmQ6slidqDYgH3dT3ADbIFFvcun6T9VllLKc
 oA3pKMfUYsPKEs5VQ860L+qnxwN08RT6eccozi45liWBj3xRrN7kwkgBeOCIZwLfRZIYgPh1Z5USg4Zl3cyW7sdhoeJwSQrJV340gy3elScrNB+IzPfnkkzslXk0PxLPPZERWEM8WA9rjg9dSkag+FHrWmUScUblhJQrUmzr3RLBJB9gasbUgvmCwIL4gfvmub43sunz3Qb+FbNK8qOu7m9cDWEv2J2jHdlLrJcdOovIlAhzRNdZnoitOXIaRXXPMEpcFELcfvvdFHd/EHVv7GcyLQsPywiqT0UPBt06tYJ7yDu4zQFQGA6KY+x17DRUeaAJOPvD438Fp7DfsMtYD8WBV8w47iosO4K0L5kEtkDcF2iD+qquu6m6CDNLUet6BfEJHZjmlneJu3jLrgHm1Q+Ce8JE+88Cz06QTK3gH7qLRCCcpB94p8SeddFIXpUZLQQvxRmQVWpN1BoCrsCTKDIgH8FR2LRWsM1HoM+hTlPr442Wf/Sax9wviC+ILBgviC+KX6/rWyK7Ldx/0+6psB+KXeWTX7SrpxnGROByU16zbGRfVReVE30Ygjy2niScv/jwLs+194Qtf6NhMso9J5hPVZrN5Qq3tO/ev+7qvW9jIsxNBfBsbvh25laIOrAG7g5LNB+ozmms/mg1lXma/AZhAk3IM4oEssI3PnaJvHbCqckBtFkoSALOhgHveeGq1dWPnUcAKiTc+UWRkgN4q7VHkY5cRiSYQn1jx5tVA2Wd0gjA999xzO4iPbUbTCu/U6aef3tlm1NLc5IF4gzuB9wzuJKykiggVHsBrWdAfgB2GAq9zrxYKEE9dZ5nJgE/xyrPQRHH3u8wH4mO/UYstiC+ILxgsiC+IX77nUI3sujyphW5ckGhzxLo2e++LkgIU12Fk1+0kdh9lIeGqaSPbjIuvbhAnIA/iZxnZFWepEHz+85+fGeK1OGR5fx5j2r5WmwQOWcTIs1N54hOhJh54EE/5Bu4+t9Ae2LfM
 d8Ce+g7u25CVoBusx9cts8lQ13WIpbyzkej4aVlGeY1CzXaSCCysNPHj24bCdRO0Pnd+9uTAesJIxi5jvUA8ZV7oSOp7MhUexLPRuNjgXabC6+EM4GUw70/OQpM48UJNmheZRmXENLHh+fnTsdU5sAhR46nvvgPwGQTKZ+euvNqRWpVPyiadX0UEKogviC8YXJ371HMoUy83LYse9loC/bad2g7wqJFdV/M5VCO7Lk9q7S9AHmi1Y8K0Y8PwaoP5dRvZdTvJ8wovSbhrkkTgHLLmbGdkV+v73U5WbnJ87gdMvKjjmxriqe2AHMBT0h184N00gE9pB+gsMcJLgn0AH+88S03A35TKToFmE+ENB6IUdoo0SKVC89hT3FlnLLM+0NXZAhCrDAB9HnodY0WpcYOryalVZzRV2Y0WaE+0mcwH4mOvyY3pd+eff343mBM/vBp5wJ2FRidWyjtwd2PKxx9/fAft6dAK2p1fjp+Cng6poFvozLQwAHi2GhUU37PdsBuBdwq7sgL8poleA+RtJwAfu1EqTaYF8QXxBfHLfZ96VoBgzx7ASxEE6jp8eWFTB/3elG3P89FnrX6at2tk13oOVUvL9GnSjqjubfe4e30vhqLcqgwlboNEtWGLmaVVZLsju6qMUfR3YuTURO1h1SYgT3t8C1HiASAwB+SgG4iD9lhtgDx4B+nWsw6wBt7xyPuN36sEmG899uwjlGV2GjAK0P3WPFj1PYDnk09MdJ8NluSGsRz0g3+VDS83CvtHP/rRrqCBOAiXA/JR5QPwiQffxon3XSLS+B2AZ6VRSQDxLDQAPqEkQXv87+LAeyGz0ohIA9Cp6ACdTaiNkW8epFsnn63nfKwT3zzoT5lQ8EE8C00qBtme7StDZVQQXxBfEL8a96nnkFY+So0XUI3sWs+h+l/urhq/VUhIFWwgT5EviN+8LCV2aMn/vfWAszsblHMozTKya2w340Z2nXXk1KxHhE1iq571+Ky/q
 574/oitwJ1tRUxPlhdQrsNqlHqADtop8FRx8K2mosOp3wB7v7edKPIpAJYb9hEqMl+834FZsJrRSKnQGaU0I79Soa0fyLddMAyu1Xjib4sSD+R9Nk0Yo9ZG0+/kat3YadrRWNsoNNT2F77whR20q1jwvCd+uxzLSwZiilpumamWBsudD3invqvUOMdE8Wk7tlL0Qb1WC7Du/JVVBnvyWT8B16ggviC+IH617lOd8j2TrFcju9ZzqP6Xy5P2+qBQ06aAKpDGLJIOp7hpHMRPO7JrADkju7I4H3LIITOPnJpjdoyJmZ+BofrHx1496fFZz/nsanSaVolvI8wAcaAOzk0N7JS48LJlrBzgE0haj7IOLsG971ly2vCTwN46vo+dRsHzzFOZAS6QBejAXqUA1ANiIAt2wbAOt5ZZT+dVMN762lsol6nt6dSaDq2tzSZhJTVn68jKw8VGc8YZZ4xe+cpXdjU/9pmAPNVdlJ0o5CoZgXgXMZ8BeOwv5p0nhd1nUz54lhplpix0gNWxFcRn6nvlqwKjjEB87EjKX0UoEF9wVBBfEL/8Snw88QZCocpPep+Ku6yZX26TwUooUioGhAEvJUo/NWm37SzL2EGynkMF8QXxu5Oo7xIBVZ8CaagiP8vIrljT8xIHzTJyagZzYpXGe5IAKEOgneNjX9Q3cpLj8731nM+uxolvPfHxv/sM5FlmQDPrhgIE4MAasIsF77PvqeMZtCmKOuC2DRCfUJSx16gQgNJEr8mopaCVrUQH0ac+9amd5x4YA2WQa105lQv7cTFipdEpNeEiZS8xkWfYZUB8O8hTC++m1mOjSTjJRKDJgE6JRMNS03rfAXkU9gzEFOsLABdVRvaZt9X5edGy1ujE6sI7N2WXjq3pCGybgF1oT+UdX7xl6R+gHKtja0F8Qfxq3qdC5XpOTXqfehElisK45MWlqdkI10Da9j232hfZTl/fZeqQtwrPoRrZdb0hfhVHdt1OijVZwn
 Mf//jHr/X9uJFdrYtt+iO7epa1I7v6Do996lOf2nTkVM8/+xG9MJ52VpytWghzfDjRtrc6Pst9b71dH7G1BcCEiqSgJ/Y79Vw0FQeuYKjPYDV2GN8BfJ0sLcsgRXxS1HMqMnW/jWxjP/bhOxCr5gRebZ/SrFMByPV9IrFYT4XBPqjQWgrsH2gDcfHi+xAfVT5RaajwUcLSETYx4dXw1NB47dOJlQKf0VhN+eDZaFQwRNTRGVVn1VhfVEYcp2PUucG8MgDpGbRKjk2I/z0tECw5sdPw1afVwboqTiBfBUF5RPFPx1bXLNex4KggviB+NZT4rKf/zTT3qbBqUrazVaLU24ekdVErouTFOu/yr0ra9OvVyK7rDfHrMLLrNMlxEEUjGHj+sNdJhMoc46wju4Jn6w2da+tNF5CEu8F1AvvTCBhAX2hxZbzZ8Zn6bLnvrddvAdhViAfboB1gt1FpEiOe8g4qwSTwpjhThQPXYDUACzAp0aZgm5qcKDYBep1hqerAto2L7kJTrG0b0MZWYrvWBfG89yoMQJr9RbOHQhVdhvKUjqogno0mMeGjwFtufeBOfWej0fwTC03CSMpqcOw0IN5ATjq0stRkUCdQrzxUXMC3Sg7QdrzgO+EzfVZW8cpbH5QDeZWjhI/02fexzrRlmwg1tkGJ1wriGsVSUy/PgviC+L1xn+qfM23yYsngLZqJPT+jnHnRrev1nedInTWya43supdHdt1O4oKIH50w2o7syoKiBXGrkV3zu3EpHVEPPPDATqiY1pueqDmxw2x2fKY+Wz6JF3/HID4AmDjxMqAP1Ae6rUN5B5NAmwLNIgJOQSbAjkoMXE2BKjD3HUWZlcY2s0/70OHVBbIeRdporRRuyjRANu97+7Fv2w4cg+KEXtN8I4N4FhsZrFPg3TxqcxmVFcBTp7zIVALUusC8zzqzZlRWSr/t88OD+Oc///ldNBre+AzsJOJOwFplg58f1INsx6hsLFf5iT0
 oYSZBuuVy269AhYayn1FZnbvKkM+Ue+WhIsPyVHaagviC+NVV4tv1NMtS1yZJ11xzTTfVkjhrijrluUnpysvVS3Odru88R+qskV1rZNca2XV8wmz9fjtDKYNIsauwokw6sutWYsN2Rk61L2KvyGGt8DFu5NntRMWZG8S3NwJFVw4UAu2AfBtthlLPr06Vz2BNwJWaLLOJ+AzIzcux3FgGaAGubSX2vG2DWCo0aw17Cg84QJaBr9+w2PCHm4JY6j41yUsIqLsAgfdYavjhvTw1fUR513HVTQTgQXuyZaLS8J22thrwTvky5c+ixDtHQK2iojKjmVrkGuq8+QzypDKipYHX3XxGb3V+ysvv08E155TwncotlRcVIetZ5gZVbq5Pew0L4gviC+L31n1KSJhnEixAHyPJM8sLlrKkqXgVr+9OjNRZI7vWyK41suv4VoxpEvsJsdQzxrPHc2ZeI7tOE69dYACMKLJOP40beXbe8elngvg2skkgnvKeUVlbiAfvPlsHdFOGKcsiqYi0Ej+3eRBumtFHqc7g00sBuFKeE08+yr+oN5R16wd8Kf4qCT77PQUexLPhaBWg4tsm1dwfjuoerztFPjHjTaO+U90D7GA9kWhaiDcF8Gw0RmU1sNNxxx3XKWVgnqXGeauUAHmtB+DdenzzfP2pjAB8WVlk5FbnmMGgqPjKI/53FSQKu3OkwseKQ9VnJVIG7UBcBfEF8QXx66HES55hOs8DB6kd4XVomtBohIUMwDKvxItrLA4vK62AnqU8rsB+SN1axuu7EyN11siuNbJrjew6/8Siwtqc5yB1ezdGdr300kuX8/82yUsrMeCddEJMgnUDO5lSexNVJpabKMCBSH53D/ijjjqqU6K9gEAqGwyVnPIM9qnIbCTg1nI2E1572852QT0VGrgCY/Crc8LRRx/dbYfNhA8c5Ao1CXgBrhYBKrw/lVjvasr88V424F2Ts8/+cJR4nwPzKgCmrDSgHsCr
 GYJ31hlKF/UdxLPRmPoM1oG7bN7oso5Tc5LKh2MH7M4V9DtvFRvTjMYqKyswr7UBzDsfZaGCwm6TOPPO1bIMpJVKV0F8QXxB/Orcp4FvwoL1DCg3tJ5nme9N83ncVNZSqUUSgHj+eIF5CYLueSUWHh1rjaKt5VNnLn2IqGlUqGW8vosaqbNCUa5mqpFdZ0sEBK6C2Py2m8SCVzmxXR1cd3JkV/y2shBPcU8GjnKinUSJ9zkRZkBj/PIB+QCniDEuIiAFrzqnyiAcgAPYZIo0pT2jt+rkGnuNzq/gPh05bY//3voqC2wmUeK1BoB+CjZ4Z50B8zqvUt1dHBAdkNZjmepOTdcxVaVD51XQ7rtEpwHvPO+Ud9ugcvm99fMdoPcZuLPPqLxYD8Q7XudoqnID9B0D9T4x4s37XmVHNB5WIrDOKnTTm960K1PlkDjzGTEXxAfgY3+qjq0F8QXxq3GfBr49o6ynJXCcYt+OIjjJeUhefEKqCXcG4A2KQtUSw1kULsA9r2RbXrYUSxHAch4ZyGpZIG8RI3UWxK+2Gl8ju46W4tmsIoWNAuei02w2cqrvpx3ZNfHtVxLi28GeQKKczqxyFPLAfavAt0Af6AegoJzfG6SC+dhKokrHfmNKhXaBdHjlBQfzCXNJbU8HWb9NSEYKPjUevPsezFuX9QW488FrknHRqdsqFkCYJQeIU9Jtj1rlcyLR6MDqwvO4gXzWGNMW4inwwB7Ayz5Hgbeeddh/VF7Aun2AeNm5q9BYnrLxPYi33LHqsKpCokIE4Hnf2WxUaMC7comtKdeiIL4gviB+te5TQoGIWVutBxCkrV4yrU3n7LPP7nydbYQELy0Kug76H/7wh0eXXXZZ9zziP6UyTutjHUpeiDkPYoiWV2WhRYDfdBWAoZ5XBfGLul9W/Xp4hsy7Y+1mI7t6lvVHTg3AZ2RXrgruhqGRXbUk4jK/XdaoPlNFpwnEU3qpwODRvBxwBJGx2WSE13R4z
 fIWKAPOVGawzj8OboFrBjYCtOBcwYJVgG6fQD7Ra9h1/D6DILlA1HgqPI84oGc1cYFVDijh9uUzEAbGfgOObT8dRb3EwDsrDf+7TqusMzKAB+aUMGp7/PAA33fp3Oo7nynzpiA+I6+CdJ1y04oQG5GpMpGt47u0YGRAJ5UTHX1VUJSx8k1kH2XcDqBVdpqC+IL41bpPvTQoRCItbOWdlzxHpnnxUci9VMe9nCwXgYZCBbI9f6iKl1xySWc3nHXwmPb6igyhLLSOZjh2Q5hrFSiIL4gviF+f60EYYLGbdxoa2RUnbhapRz9CLY5cG7OM7LrySjyApobrdEoZB9BAmP88YJ+RWKMGm89ncKkSAEL9ju0FmHqIU5zBO8jW8RMIs5l4QSUso86eoFbtSaUgLQGORYdOdhPWEvMg3jQjyeoYqyXAvu3XNPYd2xZ3XaXCsbC/uOmAugzIqe5APSEkQbnPJ510UrdMNu+4wTrAp8I7B+DPnkONB/Gy83K+vP0qL2BetswxqMA431RgZDcgBcs0o+G6Hspc2Ud9b8u8IL4gviB+9e5TVhfRsfg9t7pP2WMmVeKTKP1erpPCOHCneBE0vNi0aga+L7/88omanLe6vioObIsS4SOjOlLb6nlVI7vudYjfayO7bpW2GtkVs1LfiZucGRiuHdkVK1Ho+4p9f2TXeY2wujQQD9ZBPHCn/gJ5dpVWoW+jygQsZduwPmUHeANrNhEKNHjXfAHggXPCNQLfNnoLyDW1T/uK+mxb1GmQ6wIC97vd7W4bsdgdn2O2nt8CfNuyb9sXRcYUSKcDbgCbBcbnhJFMHHjQTn1302S53wbik9PJ1TyAp7KrKATiM7Vv+6LUq1DE7w7Y2ZC0Emg5yEBZzjE2o3Qo7lecCuIL4gviV/M+9VJhf9lKiQ8Ae854WU+aDHbi5T3NaIUtUBh/Q9IJVyd8SVCAwHf/+k1zfVUYqPSSZ6agAhKrzyJAo0Z2rZFdF3m/rO
 rIrjqNigY4aSz2WUB+q5FdPY8EJlEO2CkAv5l3PiDPek3pz+BOKwfxrf0lYSSBc4A+UEzZFtrRPOsKlT2RUhKCMtl6gJ0ynY6ewJ3yDX59BrLAlpJNxWZ3UfiAP+o4BR3cqkg4Psdq+5R3y2UQLyoDAFbxoMRTsVlThLJktwHp9kntB8bW1SoQqJYp4uA59h+gDuBFqAHwiUiTc/IyjfXGMuengtKCunPIeZrmO8cR64wWBxDv/B1rRmZ1XrKKiKkydT3aKEFtONCCo4L4gvjVvE+p7Nu9T4egv4XlhKucRxJyUguC5Lkop7Vg1usLABI6jvjhhWyZ/cwzus5W90GN7Fojuy7qflnVkV3974We3ck0zciueHCaKDbYyzNy2pFdl06JlwE5xRe4t3Ya4A6k+bOTfQbz1qGAA8wMBGV9qk1CL7bqewvClvseMCeePIi2nvWp5jqyUqftJxFyHJtKBWB3wUAvoI9SD9SFosxIsYDZ8VgGnE1VKGRwrbJgXXYWzTAUe+AN4lv7jNxCvKnPckJL2h4bTSouWgFUCvjgqU2ap52rSoN9BuBTiVEBcRyAXnZ+icTjfBPyM51bC+IL4gvi1+c+veKKK7aEcoqbvjyTDL6UZAyNDGIz75eryDQSJT3n4dknpG+ieEybohZ60QIEVh7PX0rjLK0Kk94HNbJrjey6qPtlr43sOosiP27k1KGRXSeNJz/LyK5LCfFAEIxT2AEjII4CD9hjq/GZ94hX27woKtYD1lHvM5prwisCW8BMdQfslG05IRkD9hQYCrnlGSwJ8ILugDwlPuEwtQRQ5UE8Fd0xUa0BPZuKHK+5DJSp4AA9nVjBNpgG7zrC2g+Q1ok2SrxjTIXEb7xMsixRaxyz43e+prLzVhHJyK2JC2/bIF5LAYi3P9lx6IzBu68i4jwcjzJ2TXxWzgnDWRBfEF8Qv173qagyk7xgqWmf/exnN9bdDPpb4Pbc243razwO6r9zPPzwwzt
 lXTIAn5fuLJUFlh5N6Soxnp3UM3DvBT6PVCO7zv8+qJFd13tkVx514bl3E+SHRk7dzZFdlxriE5mGn1wG40AdHFOBqdwJ7WjeNDYbkAk2/cbUelR1UO7FAeCBLWsLQG/DMvKhC/PIg57OosDY51hUrAtwqdPgPT58U8cK5h0T0AXD1O/YZ4CvDIhBs0qBWPEg3fHEjqMiYN42dHjQkTWjtZo6HtCeDOgdaxT42GWAe8JoBt6VBXjX2mCZz4mgo4Jhn1odMrATW01aFJy3dXJdtJS0oT4L4gviC+LX4z6lNHteGjV6KygXtlEUBp5QaauRXU3FbwcWouLshKI9dH29eM1LQMb+vYA9M3nuvTintcw4dy9dTekEGC9wtgZjhGynVaFGdp3vfVAju673yK4q1NO0CO50q8hOj+y69BDPShMLTWw0QB3Ay6LBpDMp2AXNsXpYz/eBeuuAaRFYEkudGp/oNIDesqjbYrRT5wP7pqA5YR49/BOqEtSCXMcEYtmA0oIA1O0XQINkcM5Xbhq4dxxGSbQfcBy7kIpHFHHH6ftErkl0mijvySokGaxJhcA00G4K2JVBq8ZntFZQzxPvGBy3l5HWAvOWg3pWIJ2EHZcyVhlpo9QUxBfEF8Sv5306CZR78WTE1ElHdvWcIjS8733v60QNir4oNuwwwF48+Vmblae5vvbv5Wl/BqQSdQJgqJhokZhGnVMR4NX3bnHsOrrFtz/pudTIruv9P98JVbhGdl2eVhGVehWpWUZ2XQslns86dhpA24JtIN48xTs+dHAMLONFB8oBfZANSPnLAXmiuMRmwjceX7kc73l86uAY1ANlSr0XVXz0iQDDksJaEkU+YSjBsOZW8CtyDR88YAbRtimEmgyQ+aFUXkydPwsOFV3Foo2Wk1jxrQ/ecSRcpJdiplR82wbyGdSKNQjIA3g5ajzrDEB3I5lPVBq/Vwlxjgk3qdwzGFZ1bC2IL4if3326bB0aP/GJT3TP
 MM3DW0E577mBS6YBX+Eno6QBd35SAzRRxG3TMwdci+luORCmOG412ut2rq/jlw3i4loAfH2KQJB9TzMglfMJxBOAbDMK/lbqao3sWtA4jRq/V0d2NVKzvjnLkFwDLR062QbOJ41OY/218cRHiY/PnbINyhMpJdFSWFooOKwfQD0qfGK3A30wSk32IqJUg3MecjAM0ME8KNZ0nBFQATK4lynglvne7wCt/atYmIJsD2eqvGPNReOVB+QJPWkaqFYJUakA0ZY5TpWAeP61IlDhfQf6E/Ix/nrbUMFIqEzQDsyBesA8yrtl8cPL5rMeFT6ArsKhnOTsw76dVzz6LDS5Lm3IzYyqWxBfEF8Qv737dJk7NE6SQAKY3yr8ZN877zk27qUIUPjOKfSO1UjYnvuS+ShdwDi+9HleX0rnpZdeuhEfmihjXwBfWLhJY16rpIhaIYF5AkzKYhZ//jo+1wDZNDkVwVgp2EnSgdk1qUGhluv6ruv18IxwTu453nwWmf7IrpvFibee9dciOg0QTHQaarzMUsMaQ1kHvDKAp8J7mAc+W4iPUk/JAfgsIVRo4JvRTwE6qKe8J4N1qju4t27iyCckJeilUNsH1R9MA3nbplLrVBuFOmEu0/nWsapUOKccu+x7v2srLal8+F4lwLkkaoyXiP3xvfO/ZyCpQHtsNJT3AHw6s7brpmOrCoVtAnYA77MKhHO1TwDvGFVKXBstBYkVXxBfEF8QP5+0zB0adeTU2kcxnwTKqdfTeFQBscQjP01qw1aqPHim9a8vVWza7U6iuKmosF9ec801Xcx6rbJevmLiswZtlbQwSPpF6YslebEb0GovPtfc096hppNk72Egf84553QtRdKLX/zi7r+TimGOz3uSTcqx69zsGkkXXHDBxn3q3k4LzzIM+lUQvzpJC6RB6YZGdsWfmwkwvree9Vc+TnxOKraUDCYkjKFOlAnnCKD9gRMGUSGoyYDq+OR9L7PTyGAUvFLjW
 wuNaeKoazIF5GA+gzDJAB4Qg2HrWsfvqNcZvfW0007rwN9xqHQA3ozwCnrbkWUtlwFxvPw64vqNzrL5jXk5FZZ0fHUuIDvKOiXeCwTYt51YE0oygzkF0OOH98LTYVUrRQvvvtcSoPxUQlQ6XJ/k9gaMlaYgviC+IH57adk7NE6TVEYo5+NeRuOi2LAszuMF1l5fnXNVQCSwHOWeKp6wlPO4dgaHosgBQ32mJJ19U8mYxOcqao7oNyoJnsO2mVB2ewHio67Pa/0cn/CgrGGOHeSnpUi/h1SaWFuVdyoAGZ1YC7xKqWSU34RfNcJxKgD+p7GRaVVZxkGl1nFkV8z1rne9a2FliieldIjfamRXzGdkV9MDDjigW+576638iK3tYE+Bd0p2ANHnhI6UAXv87oDTQxOEJrIK2AW9FPnEXE+kmPjDwXw7OBKIz2BQ4J0CrxMpO00iuogQY5sqFbYbxdtvgL/vA7/WCeQ6F8efUWi1NMgZsAq0ywDe56yn9cG+wbiKBlC3P6AO0jMCrXNxQ6WyEVsNKI/fnarPJmN5Otz6Pqp74uP7nA6sKhGOO/Cega6cU6vCF8QXxBfEby8te4dGycvGizMgsxmUU8mBzTQquO1SQ7fyjM96fb0oAYSkEpRQlyyVUcNFzplUDd8qUdcS1tK7xDtCcgxbtVQoa9eZOufZrdLhN/HXFsRPDvHb+Z+3fTBcg/SJ0Gk5kUdUDHVoDlimZUnl0XWUWKjyX3DvJVJS24/Ef3un/NA1sut8kmNWIcxzb1y5bDayq6nPlvveetZf1lGFp1biQWLU6gBiq2D7ntoNMoE84GQFAepU5YSABNuUebBNce5DfEY0NQXApoA4seTjlZcDxVR4vnXHo6JALQHWMmVfCMi0CjhGrQiO1/rmY0cB6QF1n9OpFcBT59PqoEIA1D0YKAKBdQCf0V4tM3V+CSvpeIG58qDcy22lI6O1BuCj3qeyIswlaw+AD7SDdNep7cxaEF
 8QXxA/XzV+WTs0zpKcC3V9UiU+ZQBywOtuXV/7Cmy94Q1v6DrRSp6rghlI55577gaozdJa4Dep/Lzuda/bGGEWAAYGxtk4HD+1z8veu4Z9xLm96lWv6vz6q9qHYhUgfrstUgmhKoxpFFv3UixXKo9pEfKezz2GU1KZVPlLxdZ5p++H/1AqAO6ttPYMgeC6jezqGbhV35t5JrY5iVC71ZgQW43sauqz5b63nvWXdUThqUdszYirFPfWX55whn3gtx4FG6iKAANCATx/vAzyKfTU7IRgBMDp5ClbFh98Or7GMx+7DcgH1OA2HRISRQYg24YL7BjYX+KFB+YAd6hFIeeWCosKgvMA0/G2q0RoDYjFJy0FaUUA9IkHHxXe79MqkYgzKjmOP+trwQDw9qNs/N7+/B7EK9OAu+uRss+5+Fx2moL4gvi9cT+3CVzwHwdENoPy+I+nSZQ4xzNLh895Xl8v16in4F5HVsnzNQAp+kwqAAZ7mTaBqlgCPIOzXX0ixsGCl70Bpqj7OnYWxK/f/9y9FCinWgfW2/uCvScgiwtSGSBaZh7T+J86dyKjkKqrPrKr+5+lOa0Zu6G+E3Cn+Y9vNrKrqc+W+34ZLTRTQzwADAQCxoAu5do0/vDkfAd+4ze3jLLNopL1gGjCTVLHwXX88WAW1MdWk9jw5mXLQbI/h/VVDgAumwyYBbAgneqvgygwptxoQtMawFKTjq32Df4BveNK9BzHGg+8balwAOu0JGhpyDbt38sjsd8TPadtPUiFg1qjMgHebUNzLFXeb2yDCu94Qb3Kh46t9smfryKhHPstIQH21kZTEF8QXxC/9yA+LzO/m8TrDXI9S6MsTjKyqxecY5rF2rLT19c5ByC0GgAriXCSkJLC3qUj8KSjuXqRx2ajkpRKg+c2v7aXfdTcFqKE4OSv5fG2rr5iOkHzebPzLGJk14L4xaaouhT62FK0AGjF0S8AJ5x99tndffKwhz1s9OpXv7qbZ0N
 27/XvF0Ji1GidOM0DUh2Ls960g6Vtdd0WPdKuinsgfprQue3/eWhk19bvLy8zwM8E8a0SLyfkZDIQzrzvYvkIWIP6bMN3iXJj3dhwokyzkwDeDKrkRZNBoALxwBg4A3XwDLzj2882MwIqJRxM2y5wB/FsMQAZuINkx03pFn2G3QfcB+Itd1wZdVZnCC0AcsJOtj78DPqUDrlqp44fmKuNayGwbw91FQstEFHoHbNsPR107a+1//QtM1He206uBfEF8QXxexPiJU38osLo5LcVlCdNE0KQlQUk6CwaO8EkoLDI65uoM46bSu5l7dkLkMAU9ZwP1jFN6v2nnrJjqNgYkCoDbIF8LQX9jtDnn39+Nz3llFM68UalyXusIH7vhpjcbsd5HTBj91LuGTMBi2Q9jJQ+A1gl/1WcEgh23WxbRKc2D123cSO78pXP0sI3KXizuUgqxpVmsNOAyD7EU7+BuFjlYBPkUowBOTBORJgo9K31pgXO9jPAB88gly2GvQbQg3eWGjAuuwHTcRakg3JqOoi3HRUG6nZirlufPUWnW8eYUWdBvN86B+vaFzDXSpAWBedIuQfFztv3PKViCtuH36hgqCUbmEpWa1ZLBvK+o9xbj9ru5ZGQm1ohnIPjSNhO5ZzyCIS3intbdq19pv2uIL4gviB+b0K85Lcgc5KRXQGs567n3yTZcxMwsAF4cUtEByNdS8SKqOCUxIQaXMbrq9kcSFFDjUwL9j2jHT/1nRgDTMAOX/Rm/li/BVRGE6fC63Dp/WW7Wi5UEvodob0nUi6ugw68UjpSFsSvd9rtjvO2nXtL5930DdAC4H+rYp5MRG1BfrPr5thFfNop9do9NW5wpoL4CSE+SnosNVHiA/EANBBP7QbG6QyaUV4TojEdSIG9yoEc+A6AAlnbViGgjFOp40lnt1EbYzsB8OAcZNu2fcXKI8wkhZtKTr23LoCOjUcTVUaWdQxsMwCdzYX67rN9UM6dm+06dxUM
 Vh43OVB3LB7AerrzygN86rtKR3LUeQ/1RKixXftRRmm9SEfh5DYMZgvt/VwQXxBfEL/Y+3kRI7tulgABiGax2WxkV+AFzjNa9VbZukOglhc42I1nnY0lI8C21/fAAw/sVHFQ7LlIaQxMi6ATC8uimrTTCQ7MsD44Fs9856LzG2HGOVI+nU8bvaRtpleRouxTWD3v/Rbcs0xY3naEbkfg9X4Q2tJ9IKxiQfx6q/GLHtkVgAv52Sb3/qQQv1PP1ESq2imFf89CPPgG4S3Et0o8SKYsg2NwbZrY5n37Tew1pn3otM/sh2rNbgPCKfA6lKo5ajYC8Qm9aH+xnzgOvvJEvfEQ1owZeBfJxhREA1+f7SPx7rUEAHTKjFaHdICN9QeMU90p8FHh3XQsNJpJgbzjTIx7yj01PiDv2HRwTfz6DNSUVo9AfVvRCdC3uSC+IL4gfvH38yJGdt0qUX/ZO2InGQde84D4aa6vMICJdANcgT+QoWKzCFAI9R8C+gCGeEOlNi9ySKwvoIYKuJuwb79UTB0TgTihB4CLSOOdlOXU+H5oQqDme+8L6nyGgZdVFtIpmfLvs315ZxnAKutRbQvi91ba6ffbdiDePc0SPGkfk61SBAD7iABQaUaIb+0cbcSWwDz1mLquwyUoB/FsNEDa1GfzVPlAtvWBPnU9Fpgo8+YD4PbRWm0o4CoKfm/KhsK/7kFPabc/21ChsF8tA47JA5ClhQrvAcubHlBXMQDwAB3wUu7TmdR3AFuTMbtLwDo+/1h/VFYo8kAdpAfq+R6pYF5Csu9su4357lidV+LwJ259gDxlHhuTylP6GsQjP+STrxCTBfEF8bubFjWy66THJt7xONBdBMRPe30pcdRw56JjG4uKc2KzVAlQgaLuAxHzRBR+dZUXHUtFjVExANc7FfPbdnV05Tl2vPpHaVUAQ0Qby1VWRL3ROuJ8qPIplwsvvLCzJUlUeMetjCiwbfQSlQPb966xzSi4OwXx3
 oOT9pfwDjX17kvnaq0OmVehzPF5R7oujl0LTK6LvgOZZ/fK/MUXX7xhZ9I6kXllmHmhAjPf2p/EEM/9n5YeqW3taft1JAZ9/osF8eP/vyqck4yGvFVK5dTzpdIcIT6hIwOUUYQTRhLIJ+IMcI4n3hRsU+B9BsuBeMAdUE4FwLrpZGo7wNr6wBXEJic2vd/Ytk6mmilBMbC3n6j/tu2hyOoC3jXbsr+AecoO24x17cv5gH/HAPJ1OgX6AJ2PPvCceOyyz85HdBwVBZ5QsYa9ZJJf//rXdw8r+4s1KP535eecovC3kX1kn5PHQXwL8P3wkgXxBfEF8buTFjWy66QJBAb6VhHiJ4VocGwKdAE7wOA7T2c8Agow9J156jmQs47rBvr91tR25gH87g0tB+4PYKlVVkUD3LbDv+uMLDkHKr0yVkbW13qb9RyvCgnIB6daBVQSgD3wmvdzzf0xTZbOOOOMDXAXcjHzbcuDd7IKinP0js79wI4ar7Z3duaFcE4oQfbYzGtlDwBq6c484S7zxLN0ANWynnkiX4BdK386ehIIM48tMq+SklYtEfUyr2Nz5ll4s77tZ97/JvOuV+Zd28xjkxwP9sjy2Lf6zwP3Rs6FC0B0Gv8tlduspxUn62jFSpkYzyDzGdMGyCdbtlt2Gs/FjKI8blyGStuE+ER+iaUjkAkuKeBgNio4uOaBB+bxxQfiwT6YBdxsLX5jfWALnC0H+PGtR9X3274P3DFYzhLjj8kz73d+L6tg+BMBd820HiBAm0WG1cWf3DEA6ijgINpv/IkTy91xOM+cdwaIClD78/uTURlEhKCiuCGFEfNysB/Hk/CQbcdUv09rhH1YJ3al9CFIZ+J2n0ODOxXEF8QXxC8mLWpk13lB2jpA/LQJyKvUgIYAseOi4jtGdhbqPtXXPDAC19RjAo0KG8gCz9PCfq7vO9/5zo1y8Y4C/IQgfbMkqj1V23sl6yU29stf/vIO2lNJc4xnnnnmtc
 qZMl3/8517hw0p9+6pqPvOL/Oiq7QtA5lXwcs8W1buJRXPthMqtbv/PNB6k+NwrzgO+zzttNOuVTlMRUhrYCo/BMbMn3XWWZ2wgFvavNV1c+8LhTlryj7YZoZsYpW2CfEtAILDQOwQxIPN2GtAeQvv8cZnHnQDZMAeb7rPVIn43j3EwK8HZyLGiK3uN9R34G3/6RBr25R932VwJAq7WrQapRvWw9EDUVO2TqgAm0Ju/84R9AJkKr/fOhfQHGCO5cV+A+I+U/yds2O3X8ftOLVGpIxUXlRwVFpsW1kF5M0HvtPBVUXBdtOHIOCe8t+sk2t54gviC+IXo8YvuoPaVsmLnoCR2NJ7GeInTa4nCHLNVMzAO0UTwABn7xA2Gpm9k9ruO1AOvCjnKm2uPWgy3W5HaFGBvMPYtuzDe0eLb3u/OF7vPkkl5R3veEf9z1c0zbvj/CKvWwQPz8oKF7kAiO/baVolHsgDT0DdDqqUgZ144M1TtsG7zF4CzkG85ikKOHsMkPdw1FSmWQkYs8bwp2uuYnWh0ANi+7dt26L+g2lquu2CeA9WIZQoLCLFqBjImud0irW/ALVz7HcoDTS3nXtTkUkZOH+tD85TTgx8MK5cHCeAt1wZxToTVT9wbtspx2Tbtl76I/Q98f1OrgXxBfEF8cv7Ml62+7Qgfn4JsIN0lSQ+b5/ZINhCAD/QZ+OkuOrAK0jDLFDGrmVwH+kVr3hFF7TBNo877riN9VQkvOd4xg3sRbmXrEN5lebVGbHug51/bsyz4/y8rtssncoN2lmq+y5BfHsjBFoDkaASPANVuYXOQHw6tYLaAHsGW8p3oD4QL9wiANcB1DQjrgJwIA/o+bZ4yDz8qN1+D54Td90x2gfIp96DeMvtj59NxBkVBBUBFhwQr2LQ2lDSoVdOZ9Z2sKqAt/MPkFuntcrE5w7IHR+AD5ArL+vHQpMOrQF7rQyxELVWm/jjo9QXxBfEF8QXxM+SKLkUeT7egvj
 FqPuUVZ04wTx7jig3ypWHmaDF+sk6433G+5zyU0Fgx/BbLQSuocqCMmVPyHpGpz388MO78tb6bDvsO1pjbM/v2EBZLyR9ASYdvr7+57t/v8yz4/w8rpuWnUmVdJGo3vzmN3fzWogqLQDi+yOugsl0VJVZPhJqErTGOgPc+c5jnQH0+S4dXoG476jw4JzqLmQRwNYBBMiDdxBvmYedF5DvWHAArkoAT3187cDXcVDAfR/VPJ8Dwwl/Gb96fp9zbSsv+V22FXhPa4RpQm4m4k4qOSmfxMdPRSB2m6j1Cc2pXIC8c0hH3+RsI2EpC+IL4gviC+JLiV+dNE1HaH54kJ/yA98sOxR1vxe1hehlG0bTbUfW1HkR6FtHR1ogyOZjfT5671nvTSE/qfnHH398d3xUe7av+p+v3v0yScf53bpuIgFJOsrqC1RpSSGewhy7SB/iQTufejqyUsXBaQvxFALrgHgPEj20NQMCd14/thrw3od4ijxV3f4p7popW9tLOoUGxhPeMgAc/7vcAnEL8UNhHtMfILabROgJxCuDlE2UdMeYkJIZMCt9CQL4PPNpNWj7FTjm+PRtx/6i4I+LGV8QXxBfEF8Qv1USfUIrp2fPpCO2WlfEET5rnUOptzreae6nGLKPWKaZve3wWfD2/9NOdYTWCTfruQ6CK7D3WO69qQJAEfWutFxfMR0TgeHjHve47t2loyXxTas1v713rN871nlE66n7YPH3y3avG2tYBiYbV+mQ2Jrbz5UWCPHtAESZxguf8JJRikEnUE94SWCq82jCSgZOY6sB+NQA6gCA18Qn/BQbDaWArQbMg/gM7pQOr14qKgfsMR42gfAMHpWOqI7LftxU1O549uNb74exTKz6eNED8VHcA/cB9PjWE3YzFZqo74H6VICUWSo7jifZwzMAr+xYjpQRiE+EnFQE0pG17XCc+cB8rh+YL4gviC+IX/x9v50Oal7golxI/NVC8wHxNs+qxk+TwaDA
 AGKxi8YhJjrQoxSKEiZqhiguLJLsG2Aw5+Hl7xnOFgBGRA/j2XZuoBN8irJBvXMfUJL5vS0Dkfa3DlCwyI7QtgsM3Yuuh+NwLan0ylklzTvIvrWS3/jGN+7Uf+8ufcz699w0911B/OLvl+0+n0Xi07ozlETDUSmU2qg9lZYE4tvBmNqOrQH4DPokJ8wkoE/HU8p8ItS0HVv51XUwBesUeC8FHSDAOnCnugN88WFBvu9APZAXjcZyn8FyorhEtXZ8Hkh88h5APPD88gcffHC3LCO3OpeEfeyDewvr2WbgHKgH1ttoMqnoBPoTzSaVCuUQcI/6rqwC/21FyL5tK51qk2PNcdy5VvaT7wPwlhXEF8QXxC/+vt9OBzWdGQNM4JeIIcxgMlV9VpDfSRVRbqOmiNUO7EHH29/+9s7OoWLA/gEWeWgNoKeyw+/tOU/tV4HRZwp4+o3W23e/+93dcq24ifuuI6kyVqlgDeEh5yWnQqtc2K9KhGUZBMo0x7pM98uin2vUX5Ye++Cz33fffTvBTBkn+zzpfVcQv/j7Zd7PZ/8Z9izpta997Y4NqFZpzhDfjxMPXsHppBDPQiODaBBPiReBhpJDgU+O+h6bDZU+EA/4fS/SDLgHv4H4dtAosEwVAvG89MDfvBeCFwFvfWLR5/cB7tb6khyvu9x61dO5t2/BiaKvjFR0osKD9Ixoaz4jzaYypFXDeoH4tkNrYta3o8gG4NMZNxBfSnxBfEH87qcMcgMmKZ0+U6wpnSDSAC3sgjqlGdXZM8y8kaL5kzfroDbJKIvrfn0pfeAbNLCBUOlBp5YAlQQVA35yEK9ioDyp/kDfe8Sx8O0a+EklwXqEHRUDaqJW4EsuuaRbx7tHi4BKgOsj+ovrClqon1oSVE7sRyWBrcHUsWUgoFWE+KH77thjj+3OOxnIF8TvDYinwhvMq71u9seHX2kFID6DDSXUYiA3QAtwW198/O9874lOY3mgXodUGcTz6cV2A
 +hZZNhfKCyUGNDO12dKeQfw4N6URx7Iqxykw2ki4QBj29TjnwIvdi7FCvxH5ecHBMxRzFlRhgZ3av3sgep46Nt+Am1kmTaWvIoE65DjCbCrACgn5eJzKgmpFKmY+JzziqXJfHuNEjHIbwL7ZacpiC+I35kUMANyBiuRqL/GopAAHbiRfO8lp0kcCIoEAhKF/5Nn6aBWEL9zyXWSwXniu4N519p1F90H2GtVEeIR0IN70WRcX8q+94pr6zvvM/eAioZ52+Irto4Bmazvt9mObdq+yobxTUSwSfkZYErFxbFF8SyIr7RbEO96u3fti7AqqcBWWnKIb+0lUeRbO006t0ZBBvHA1HxAnhqeQZ8SXpIyD+Yp8iAezFNCKCDUcgAP2gPwVPrEjwfgvjcF9TLwB7+Oj//ePtJ51nZ11hH9BvyrGEThF0Pevh13Rj9NPPZAfM63hfi+Tz7wnc6tfuuz89PkqJLinC0LjLfbb1s0zGcUV9uOxz2Ku8/pg2D7zjNlnz4B7TUsiC+IL4jfOmkaBk8SSBPKTzJapwF2AuXHHHNMN//hD394Q4UCXpTe3GtD255XB7WC+NVMKgEgXE5fAGCu0ue+o+QDJXYh95NwkGLOp/xAP9jXaq2SoOJICHMvsAu1ceK17HzsYx/rKglaIUztQ+uEa2BfKgiOQauGSobjco+O8zMXxO9tiNfHRauVfWmxqrQiEN/GRm9Btg2hmPCKAXhQSW0GmVRoIJ0Qiubz2RRwg3igTa3m9eSPB9mJEw/iwb3vgHcL8dR1WbOo4wjER423Tw86EK9yAORtD8iz7HgI8eT7TVT4NiZ+C+4tvMdmY+o3CRUJplVsrKcMdKrVCgC2HU/botGq+Yn2ozzjgc8xqFhQ1FPJsL5ydo7KTMXIZ+W9ChA/zxHoCuIL4idJlEsqksR6kQFzqOIGgpPYXvTJkaybdaiwgCgwvh2Vdx4d1DRre/6Bt2TPwIL49YKyWZ+TiUoD1N3r6Q
 8A4vUT0IEY1ANy8O7edj+pJGhNYDNVSfB/IEBZrtLhPX3ggQd2v0suT/zq3y+TZkKp+6Wu24pBfGwk/U6diRMPXKnPgLWF+HTOpMADWJFkWF5MqdKsJRTwdGwFo+YzGFNrpQHy1AcgrxkSvLPBgPyMwppwaZRr+8wIqo5NB1r2HF548eipFNR7L2z+Uy9ASn/CMwau00E1lRTnmwoM4I5SnnjvlpmCaa0M4NoxBfBNKemmtpnvHaPfgPZ0RgXztsEKpMyUJStORrhNS0M7uqvt2kbb2XUZIX6eI9AVxBfEJ8KGRNnUSVLijSYGSDzP+tNIgEaYxKijlM9lV9SotewVkrjfWgMoo23ey5W0dYSyRY/UqcKqEkChZysC+EJTznrf1X2wnPfLpNm27aOu24pCfFTnfkfPRGUJzMcTn46ZCTWZkVUD1z6DUko1r6Bavk6mVADATTWnyFPNAbfIMlmug2oGhuKLT3hKYAuwHYd52zdvv7ZrP7YD+K0P4r0MTW0no6+mA2k6uCZspHOMP965t375gL9lAF0lJhUd2e9l30Vxt93EpFeuygSgO2fqoM5vykALhEqMzrjKKOE6MyBUwmbm+jiGZY0TP+8R6Ari1xviqdd85BKQOOmkk7p50UVUyiUKVAA9kUiyfuwxy5biu5bYH3itJRFbtA7271PqqfPci5W0vZiWYaTOug/W736ZNtuGbdmmbdtHOu1XWiOIjxoMUEFlRhyNnSb+dzlx0TMYFFU+UWJkMM9+Alyp5lHcwXeU9AA+8KbMU9uAOAhOpYIab9uUbvui/Ns+W02i3agcmAIAFQLwmzjrLcRnICfTQHtgOeWTeO2JBW/9/qBOabGIdcbntHCY1wqhlUEIuZe85CVdJAtAb5n+AAmP6Xxin1HOsSillcDxUPMD8sv00Jz3CHQF8auXKHtJXhC5HirmrhFPrhYnyb2i30qg/LzzzttQ32eN/LGbScsAa4MkhKKRT6Osi7cu+Q9
 kHcrnut2nBW+Lf07WfVD3yyzZNmzLNm3bPmoApxWF+NZO04ZZDDS3ijxgDeQD1tZaw0ICQqnkIJsS74XN8iKbB6qAFdADW0BPhQfygft0fqXKmwJylQEg75gBbiAe6AbqwbzKQBR921MJsA2tBs4PACc6TdvpFGgnbKR9WA/0J1Z8OqmCfftXFvZtu1T6VBD8zvr6Adi3VgGDtxjSXFM5gBd2jpVI5B3HympEpTdCrXnl51yVZ2LvJ+SlCoJ92c+yPTR3asTCgvjlSB70SZScJJXRJPdukhanXA9Dx+c+XZV4w4438M2rHljP+bIg+F9LbDxU983SOt6nBW+Lf07WfVD3yyzZNmzLNm3bPpZpPIWC+DlCfOC9tdMEoNOJNfHiASdYB9TANIMvZRnbCPsIKAfuYJvy3maqHZVepsiDeKDPWw+y7d/2AHSsPfGTqxjw1+sMm+0DZZUIAN7GXA+wKwPnmE6v6QAbVX6ov0BUfBWYlI2KhM5AfP/Ufy0IQEamzFHqRCVg8xHPWHaMsQSp2PDIa8UAQwF6ZZvKBohXWVhGJT6qwW6PWFgQv/3E0pF05plnbsz7/yT5/ybpb5IUS8wqvtzdqxnQhF3HYEQSn71+OxL/vRELJfeuPEvajQ5qi8rOybkVvO3+c7Igvu6X7WTbtO1S4VcY4vugmtyO2BrFO3HhwTulOB1b286tYBu8A3ZQbb610wBs1hequxclZStRZQLuifsus8VYVydZcO1YAK5KQ0ZEBRj2A6LtgyJvH9R4+7N/v3PelOzEi0/YyQzmZHnUdNn3gXlloXITjzx4VxbO2faBu3B1gP3UU08dHXfccV0vf+fENgDmKfAsPpZb3zGCeJUUZRVrUuLq27ayzsixjnNZ7TTTwEyFmNy9pMk06WUve9nGvApkkv9NUqwgkjJd5Zd7QupRmM4666xunupESJBE+fBfzLr8oVl/J+77ne6gtqhcHeMW95wsiK9UaY9BfEIatip8
 YD6WDTkDF4HV+N1b5T1Wj8B7IsewhciUcXAARqM0+wxYwTWVLyElo74n1juIz2cQbz0v3kSIAbkqC/bpWFUYADGYNg/4qfMAnrpPjVfhiO0l8J5wmdR8lhrnmhFi2xFT/UZlwbk6Hx3UADubgCZ16qXpCSec0ME777uQXzJLTSoklHg2Gn0BZMofhZ6yaXAr56vVQDk7NsfiGrg2KiFyRmt1TAXxexviqXRJ6RgqHXLIIRvz/mtJie4iiZG+Li93YfOS/I+S4r+nNJ122mkb89T13U471UFtUbk6xhXEF8RXqrQgiAemAfe2Eye1t426EjtNBnJqB3RKZJrYPyy3DMADdso4dRlQA1+ADeITF17oRyAf9Z1KbR6wA++EjgT5On9SDhMlB7zbnsqE41OJYNuxPxBvf44DvKsUAGeQb13nTk03n4pJOuk6ftsKPPvO8asMsL/EIgPgATpwN5qjbJ6KKZqA6SmnnNIN5W3fOtw6Nyp8wmCCDRAP4B2j8nDMifbjOFQwHIcKR67Fso7YWhC//UQBjgps0I1AkWhGucbukfjKE2ZRMthLUqKjrMPLvW3mPeecczbmo6ZLnhFJ6SC7bGmnOqgtKlfHuIL4gvhKlRYA8RkhNBCfqCw+Jx66DGSjyvOfA+d0YI0aDzYD8YHg2GnAL2iXgTX4MA+IZd5xU4BLeadMgxLADuCtb1vgPR53+wW2YN4+7Jc/37E6LhUE+8i+Kfy2DapBtGOOXYb67vcZrKm15KRzLlA48cQTR2eccUanPAElwE49N4w2cNdZ9aUvfWm3DrBnWTClylPewTpIBxqprKgQUOx9R6HXauE4gLpzpMYrM5UI50Z1T+faZY0TXxC/ecqoiZIIJrF6PO95z+v82JKKaeb1nRC1RXrb2962AUirNBz2NPdp2zLgv5OkMpvUevENTrJqFbSd6KC2qFwd4wriC+IrVVoAxPeVeNOM3NqH+NhpAu9y5qPIB+YTJ95LV2SaD
 Pgka9YGqaAcnCa2u3ng7DMVHUBT36PQB/CBb3zzts3uAnbBtuwcHCd4T3QalQRhKv2OJ13lAJhn0CRZZQDMO2ZKvcoC0Ka2U8mBNkAH5ubTKdV3fO+AXrQZ9hoWGqEjLTNsNasNwLdufPJUeOfiPBLb3rEqByCfypHyzSBSGYAqoTBjCVplO80iRnbdycRbHeA+//zzNyK5APTPfOYz3bxrHRsH60vCKF500UUbcL9uSubQy90Q3zlflfacc6usC6uXtCrRbCZJO9lBbVG5OsYt33OyOjhXqrRHlPh+ZJpxEN8q8VTrfA7Mp7OrDELT4dWUqgye2UQAffzxplHo458H4WCetSSqtQ6gXvYBcrYZdhi2Gvuy7YyUalux6VDzgQEw91vbVcnQ6qDS4nz9XqXAcVgP6ANsEE9FlwGXDODZYUD8c57znA7KKfG+E0aSjQbEg/fTTz+9A3jL/I6CKKuUODZAp3yUg9YDx+oY0kFYubajyMarT6lPKMtVhvhFjOzaJtdt0qwCFuA2YM+nP/3pbl7lTAg4yXUV0UTiOwf1kmPaSy80L/BUZl7zmtdsXA9DugvBqCz8p1ImrZe9UqVK83lOVgfnSpXWFOLTKVKOPSM5EJ8Qk6A4YSZZWJITXjKDPqVjaEYXTRjKLA/Ug20ADthlwE51ZyGgxlPOqeFgHnCn4ysoD/gDbaBuP1HSbVcn0IxwGjVfhUFnVMsdg2lGaU0HVZUDSj9oz+BLjgPMs7/4bB0VCh1sAXl87Cw1RtRjszHUOwWeD54VAOjzvbMI2U6UdtmxpGxVKpyXioN9sfSo9FhHOUeJdx6AXitC25qyihC/qJFdk2zTvWE6SXa/xN7Bk5344ewfe+1lFbvEZZddtjHKqH4fV1xxxYayntaGdpCa97znPdXMXqnSDjwnq4NzpUp7COLbWnMsNG1u1V+gmTCTADjQnimwDLxT5mOr6VtrQDZoStQY8CwDdoo0gAeyMruNdU2p6E
 JLmvpMMU/nWJAPcEG8/Zg69nRaVRmx3HHJjglA259tAnXbdQyOy3bTIVeISkDPq07BTxhMVhiKa2LZA3Q2GnYaYep0etVqoFLAvuP35kG/qW3Zn3NJq4RycCzWBUDKK6PiKlvlnqg5rotzC8CvqhK/qJFdW4jPID2TpGnXX/XEu5+WB9aXeNVVdC+55JJuXmU1NqFLL730WiO1XuuhVF7ZSpV29DlZHZwrVdqjEN+PEx8lPp9jpZGjvgPMVnkPJGd0UYq7aRvvXKYuU9MD7OYT0z2Ra0C6TK23TsJQUsUD2xndFcRTsjPgUiAe2FKrnUvbCZcn335YVyj7wNm+48+Ppce8dUA3NR20U+FloM2nD+Kp5umgCuYTQlKmwIN6U+p94sODdC0Ejj2DUiVKT1oAHDNwV96xLylf56fSouK16hC/qJFdC+Kvna688sqNQZ60MLz3ve/t5nW2fv/739/NG+QoynqU96keSgXxlSrt6HOyOjhXqrSHID6DBMlU3ajxoBck9u00gfeAewvt8cLHDw/UE2oy31HAY6/xPUgH1NRo0AzW44u3DtUe6FPLA84gmAc9HU3BMNuJY0tce8eanONU4XAc1O1UBFo7j4oDsFeZUDEA0YDa9gE4ZZ2iHnB3TO13phT6dKQF/rzx7AVCUDpe4SUp/raffQB7EW946W3HsUWBN4iUyolzsywdg5VNogWtup0mKtNuj+y6lyDeSzRRbD70oQ9tKOjORRN45rNc2cXHP9eHUkF8pUo7+pysDs6VKu1RiI8SD96p2AARQAbiMxBS7B2m8b23KnxGTY2FJpAfFRyQ2lagFLCz1VDCA7cZqAkMs51khFaQH6Xe72zDdh1PBqOKIp8WA5UFsK5CAdaj/lO/eeqBN7BWYWjPzbGrZFDIwbpsnjof4LeNjK7qO373xI8H7OJ5A3gdW/nmVQKAu21R9Z0b2NdBliqvDGyPDYeib1vpzCun1SAVCOWj4gXg1x2Odir
 E5DpB/Oc+97kNuwsl/a1vfWs3r3Nt4sZfcMEFo4985CPd/NVXX72rftWC+EqVKlWqVGlOEB/460N84sH3IT6+d3AM6FuIT7aO7zMYVGu1iT2khfh0agWwUaljq0lHUoALsuOxz4BMGYhKzoimUeITLhLAB9xtV05oSzAMkoEz64/f/Z/2zl5ViqaLwkamegnegHgHgrk3YGZkLJgIBgoGgrkYqoGCkWAkHETUwMBMNBAUDQTFzD8wm4+n+dawbbvn7+3j6Z55FtSZmT7d1dVdu3atvWtXFddnV1ruSXmIeYegx7jA858NrPjkHnjw8aJD1AlBIISGkQI88ZA+vO3EzOPF55544DEgIPFZG573AEnn+lxLPhgEePopL2UlD65hdIKRE0JpJPHbT+IZjQCsiPPs2bPm+9OnT5uVjwDrxzOxDCQuFYxlOFsSL4QQQgxE4qsnntANiDyEPZ74Gk4TUhyiXsNpEm/O90zAzO/sqpoVauLprhtDETeP1xvvOuE1CSeBSJMfZWMt+OxUmqUwMTiytn2+8wzck3JCwskPTznEGM82Hm0MBAwIvO116cwseQlZhoxDniHZ8cTjHYeAQ57xvGdH1RgEdRMnwmcg4Hjg8YSyVvy9e/ea7d4JCYLEkx/lyURZltuD7HMdv8mL95Jy8vwQ9sOHDzfvhJQlQvmfJH76JJ5Jodm4iGUqWZoRMBcAmQJMHn38+HHznWedyoZPknghhBBiIBLfN7E1BB5yW2Pi6wo0fNbEMcgm5D2e+HjLQ+Tr0pM1BCdx8pyTuPasjsO9IauJ168kvnrisxlSjmNIQOKzdCUedSbXco/E93NvzsHDjncbTzohKpBxPhO+gvcb0k74C5747AyL8YIhAtGOJx1ij4eeSarx1nNvvPAQdEJsiKvHu8+qHoTbsGQhpJ5ruJ6Etx+jgvfEfbJST0KQKDvPsQ0x8btE4rNZ0c+fP2d7e3vNd5ZkxHADrJWOXAAmmkLYwTYs
 0yaJF0IIIfaJxGet+ITShMRDjCHJ8aK3CXzIOSQZ8g6BhnhyPCS+ht/UteMzSTPknHJA2vEsM9mW33xmwm3bE1/LmpGDTAaFXGcn2EyW5X95FsoKKYesQ6IJ34GoE6+eVW+4HjIOQSeGnlECyk8eXB9yDYnPuXjzCdvBS0/i3njWM7kVrz2EnQ2iMukVjz/GQp3wS77MI8BQyCo2WYozoxWQ910h8UPu7LrfJP79+/fNJ8szssMvYAUY6j3f79y503z/9u3bfNOorVZKknghhBBieBJfV6epE1s5BpHORk+Q17o6DSQzk1j5DknPjqkQaY5liUnO4f9cH9KdEBjCZdiwqSbCRjjOZ105h+8xODIRNxtPZTMnSC6x6hzDSKAckHM2aAphhwxnV9i63GPi5iHUPGN2rsUYyco7eNIJz2FdeMJhmJx69erV2c2bNxtyTsw6hB7SjReeMBo2hKJMEHSIPWE3fJI/hgRl4xxi4VnZhrwJA8IAYGSAuHsMBOYOUG7KyLvblYmtQ+7sOgSJJ9b80aNHzXeWUyM0CmBsMJoTgp5zdn25NUm8EEII8R9JfJdHE+93iHyWmIQcxxOe8BoIcQ2raa8Pz++E3uR/8b5zjOsxBLgHJDwEHqK+iMQn3Cfe+vymXInXT8pqOFlXnfh7yDnx7JBtCDjhMXjOIcV46iHbSZB4SDKhMBDuTKiFgGclGjznkG4moBLLDoHnO/lD5jAQIN2cC9nP5FSMC+6BN5488O4TOsOIADHw5MlxDAPi8Fmykjw4J2XlGGXHY8+IReY2UKfU7TaSo6F3dq2knDqGmK+SssRnykTd55ADZ6QAAAoGSURBVPvr16/VPJJ4IYQQYv9IfJdHs5L4ulY8x/hfPNHxwmc5RjzdWT8+ZJ3feN3xWscTDrHHKIB4c4+EzPQR+D4SH8895eI3BkEMipD3lIEyE84C4c0yjXhIIdd4wCHBEHUIO8Q6m0dlDXfIczz0fBIzD7lmQ
 iqEHC85Me543lnnnfXrIdrky33JD7JOiE42lMIYYBSA6/Gw84lxwUgBZeI88kg8PaE0PBPvD8875SQ//s//sjINiTqlbreRHA29s2ubyK+TskqMkMQLIYQQ/4zEr+rRNE0jUXfUIXVJnVK32zAJso2hd3YVknghhBBiUiR+VY+maRqJusua4NQpdbutu+QNubOrkMQLIYQQkyLxq3o0TdNI1B11SF1Sp7s+eXLdJSaFJF4IIYSYBIkHq3g0TdNK1CV1uq1eeEm8JF4IIYTYeRIvhCReSOKFEEIISbwQknghiRdCCCEk8UJsRuI33dnV9G/SNu9nIIQQQkjihdiQxG+6s6vp36Rt3s9ACCGEkMQLsQHcB8H9DIQQQghJvBATg/sguJ+BEEIIIYkXYmJwHwT3MxBCCCEk8UJMEO6D4H4GQgghhCReCCGEEEIIIYkXQgghhBBCSOKFEEIIIYSQxAshhBBCCCEk8UIIIYQQQghJvBBCCCGEEJJ4MWG8e/du9ubNm527txBCCCGEJF5MFpcuXZqdP39+5fPZPfPQoUMHcm+x3bh//36zJvwYcPfu3WaDqSFAPg8ePLCCRw4dGuoH9cNqePv27ezChQuzX79+LZXnId/V2PH8+fPZp0+f/jrOsYcPH/51nL1Pbt26Nfv9+/dcxtv48OFDczyJDRAl8WKOa9euNWR6mfLMRjtsuvPixYt/dm+xG0C+MA7pHA4ayDhloRMawmilQxvK8BU6NNQP6oeDxNevX5vyHj16tPm+SJ7ru9oF8KyXL1/+6/j169c7/3fs2LG5XEfG25sa5tqcS8JYkMSPEFhrqSQSW8xjoZ08eXLeaF69evXHuSdOnGg+T58+PReivb295jtWMr+x3LiO6/lNfrH8QqTJDyGpwoUFiNClPBz7/Pnz7Pjx438JGInvgHLX5yB9//69l8TjXUnZ+OQ332O5ci2/URg3btyY54kFCziGVyDv6eXLlwrTxBCZ6fJiHB
 RhoHMdwmgdkuwJHRrqB/XDQYJ+l356FXmu72pqgG+sK29wpPCgispb4m0/c+bM/BhcKzLeJvFcW983PIckiR+pkqJyEBxIaRoAgoGldu7cubkiC4mHsPO/CEJINYCAU/k5l9+cG8JcGxvXVkXCORzD4gsxJp/qNQjBpwwQ7twjz4KQYkhQ/i7rNPemTDx3LHyeIUIPeBcxYGK15pkh+DmXMvCO+rwD4t8DA7IagZHpKEfqjvpuG37UdwzNNmKsVQM1v6n/auTGe4H8VYMTuWzfEzlrd9LVaKXzqufHWI4BXJ+zr5NOG06bolwY4rUT4J2kPUYnVA8Y5azPqkNDh8ZUHRrqh2nph9ou009H1nN/3k8Xie+T42Xtft22tajNdb23dt3BW8IvkniOvmsh5fXcPhJPPabNRkbDr5aR+LyLvDfKKIkfGSIgtQIj0PVYCHH+N6+8/wsCio3vUVwICUJVFUc6BvKoHVcXiU/eaUxV4bSHdULk214T7r+IxLffAeen8UeAeZ58RzlF2dHYON5WjGIciPKlM6XOMrwa2YpsRGY4J7LY5W1IR0EHh/LkXK7ne2Q+HpTIZ45D7kIaKjHgejrcnFM7nmq0Rv7IoxqZtBXKnTzaQ/5pM5WghBCmPJQ5zxbCSRlpz+RPJ1PJKveiDPuFK1euNEmHhg4N9cPu6Ye+9s874R2kbLVvr3yli8R3yfEq7X7dtrWozXW9t9Q/3xPrn2fJsb5royv4Xuuli8TneUPgY3ivSuJrasfFp74OHaTilsR/+IMs10ZRj/WR+Eq66zBNFGFVeBEWzl9G4nNuH4mvCimNoea/ComPUYFyIKURVM9XVfhRwnwi8H357yrG1I5DMKhX5KJPNtqGX9oDxIQOEMXa9si1hzrjEQmxq51EPZ/7dd0zZQ2RoOOp8s79u7xOtR3UZ+vytAE6AZ4phidtM7JPSt4cj1eNc7v0wTbKkA6NaTg0hpAR9cM49cOiuq0ytg6J75L
 jZe1+k7bVd7zvvcVL3kaM/lqf7WvJu8rRonCanEf52qMJq5D4GHKL6ksSf4BIZwIBRzHhSfjx48cfFmWENzOg+zqa/C+CQl6xZMknJL82tjQghCTf+axKiPTx48f5fWORV29YlPE6JL56vqKQo8zqEFw6Y56D/8X7IIkfL4lvD2GmA6zD5V2KN/L05cuX+fB5DFHOjTcDmYxchMxwv9o+lhGDdrtJW+zqpLs6rLanpI2u4fLqiapDyNXjFo8vv/G8dBHZbSbxOjTG7dAYSkbUD+PTD8tIfMrGM6e/XhZO0yXHy9r9Jm2r73jfe+N5ukZ1qh7pu3YTEt81+tClX5ZdK4kfIdpxWBGWeiyTItqehCpwEfy6/FM7Xi8WXfUARfElfq+rI4TEHzly5K+htS5vyqrep8R5RRnn2gw/VqFuK0SeURI/XhKPvNYh/8gzhlnif6u3keMox9opV8Roy9BlwiSq8m4bucm7T9lzPkYhco9HpnY8q3TSMa77Vs5oD5fzDvI+0uFmGLy2Xzp0npNOK6EBu0DidWhMw6ExhIyoH8apH9Yh8ZQr3uVlJL4tx8vaPc+3btta1Oa63lvC5ChT7hs9wrnIW+SpfW3aa/TJKuE0y0g890+b4LgkfoLoGg4bagh9WT59/190XYj70MjQ2ybPIYkfRzuOIVZjWNOp1YlA1GcMuHRYXWtCR1HW2MD2xKIo/Dp8uWjYtU44457peOhcSTFaa2dQvU51Ih2dS3vt5JpHJl/mWZHvdF61k6/eyeoB2gUSr0NjGg6NIWRE/TBO/bBqOE2dmFwngbZJPM9Y67uv/+5q9+u2rUVtruu91ZGtOvE4Bn/y6ru2La9dnKUvzK2WtWtCLnKwSoicJF6MDrGOx7K5hyR+f43Hdc9tn7OJURel2Y7NXAdR7HRmEDCU7TKFm7LmM0PMYzJWxyJDOjTG69AYUkbUD+PSD/vV/hfJ8Srtfr/b/iayNaY+XxIvRgOGkTN7W0yf
 xI8R7WH0TRCPG96+eG6yxOEqiJcty9EpQ2IqDo1tl5Fd1g/7Ubc65iTxQgg718GQCXBd63avAzx1dMzkhZdpneHsLCmoDImpOTS2XUZ2WT/sR93qmJPECyHsXIUyJJQRYd0KSbwQKmAhlCGhjAjrVhIvhFABC2VIKCPCuhWSeCGEClgoQ0IZEdatWEjiT506NT9gMpmml2jDtmPTf0lnz55tku/CpJ6x/ZvGnS5evDh78uTJ7BB/rDiTSQVsMplM6hmTafzp9u3bjUf+f9RDHhmJmrjxAAAAAElFTkSuQmCC"></image></g><path fill="#000000" fill-opacity="0.0" d="m263.36478 389.604l115.11282 0l0 116.13983l-115.11282 0z" fill-rule="nonzero"></path><g transform="matrix(1.0205472568014475 0.0 0.0 1.0205268625811956 263.36476587926506 277.1458259483198)"><clipPath id="g18f7cf5d33_0_17.2"><path d="m1.9610188E-6 110.1962l112.79521 0l0 113.8038l-112.79521 0z" clip-rule="nonzero"></path></clipPath><image clip-path="url(#g18f7cf5d33_0_17.2)" fill="#000" width="224.0" height="224.0" x="0.0" y="0.0" preserveAspectRatio="none" xlink:href="
 iTpvfeusrKqe3pObJy49iFnv9T8AedTd8QbRYOQ3vf3e+77vm7xKv/qr/7qr/4XOf7DX46/HH85/qce/+tf/f//25ADOf77X46/HH85/qcdm5r7PzcFeDsnoHRZG9OOpbKX29dvMFrQS8qXUQR8GEH2jgxsin5un7/JjV4Hzvh6BkKqsZ6sRXusFGtMI47TDZgPZmPxUTB4soahk2pcJ4oYCS5Hf7KSCZWBsbhGBkMa6fStxOxXgitFw2RyD7MZWpyx9UxlG1mqHWQyt5UOrwoGs4e4efY2T769y7PhaYaC1bS7FaNzz6Z7fwaONAv3xlaZLdEw5FmM1SML86FkRg8X4jhRKa9dxWhkJa7AGlzhNUz45jJ5NAunZx72AwpG/VWMnG5mNLGF9Qwd1uBKCvamMpTRzvmSXhaTNFQHltFb3EbJ4SS6Amq4Uj+ELayS8q8jmYzrZKZ8EENwAq27M+iPb6EjWsNixzqrhlUSdqUT9Ids2uI78PzQg89/vp/SU7VU7k5iUa6zIqCBwfplXqydYaLNzkidFV1wPT47CzmwNQuPT7M5/GEwGe4KZvL76Y1toz2oGu2pFuZts7zcuImjdpR0zxZyfGpI3KNAnWbi6e1nvHr4mD8/fM63L1/x7NFLrp+9wGBBDVVeOUyV2pguGyTFq4GknZn4vhdFxI5SLMmNVO0Pw/Nnn7Ntiw9n+lz0++QxFFHLuU4X5+tsOCJVrKqGGSs1Y8voYrSwifWBBaZ0g1ii65lX9rLeOI7ltIqpU9XoTjVT6NmMMbOXiVYXgxmNNISUoQhpoS5Ww2d/v5V9W05xc/U6987eQBVey+f/cBiPX3ijz+vCmmfmxfXb8hrL5B5X4PFxJEc/8CM/qJmL9nmuj69zdXwFXUQN3p/EE/V+DHF7cjBWDXKlZxpbsgG1Zy6N3qXUyZzev/+Yf3n1kLqMWrb/xo/P/nEnfp/6Mm6fYa13DvcPQnj7/97PR//kwb4PfBhtNf
 Ho/jP+9OoV5+zLTNnWMOfUY0yuYN0yhim/jsBPYgj/OoL4rW4Maqu5f+U69qoZ2sIbSP0qnupQM2uT13j96CHP7j7k5aNn3Dl3c1N8t+X4f/5NgH75ywbVJJcW7nB1bIqKIykc/uUJYrZm4zSscMe5xtn6fmz+FfQcVtB0vIgKnwLqfOuYLrXiiFXR5ZaC9XABHcfzsUapsPvlMhxYgfJEMeaEViYidRjju7HLogyLCMeiGhmL0WCXQZ8o6GYytYXhUypsQUqmIio5k1vDo6kJ7pqMOILTaNqdQP3edLQHC3BF6DhX0cuU/J4xqATL8TQcR9Kw7U7E6iXikucej9dgi25j5FQVk7J5g/uTmDiWy9gRGarASqyeJUxFNjMZ24BdxNfoU0ZboonZ+nERdzMmbwXdQTU0eOTS8HUc6QfDcSh09Of2MxDXjj1LDKVYw1EZ4HwRjEvRw3jbCFcddupCIig+XkZDsg5leDE7fn2Y99/2I/JgCgG/C6Ijrw1DTgcj9VZsyjqqEqrIDCygK7mN+fpexnQzjNcbZdhOEemWgTmqCosMbG+whpk8C5emb3H/+lMeP3jGyshlLk/cYnLwCk9uPuPl08c8vXyf5doBOezcWb2LOiyJT//zVo6LccwUD7Fc0sPE5vqc1FDkXoUppgbN8UJCt8Xy7rvBRPrVMZOlwiRGtyT7spxZR+vOUCyHFAxEqbGJ0A1JWgqOKlBFNTBc2I0tsFHWvYGhcBGzXyFjYWJ6mRqZhQZGFDYmzdPU+GSiPFbBXKOIuEbFP/7tr9j3B39WrRvcPnMRneznl+8E4i3G0Jyipz66iruXr8tMrpAtpunzQRCHRGgmMQCTrIdZZsdV3kSuZyFp7oWc/DKNmuxelq1TzJZbGE6oo3hfMqU7kij0KObK7HlevXxOpI/M92+O8fE/7mK/iHrStcatuasc3xKG+8dRuP0mgIO/CiI1qJE7K+f58/0HTLWPszyxQVuKmrjtMoP
 964zohojalsbuf/6Md/52Jwd/GcxE14yY1wLt4e1oCha4fuUaj8+f59LIBKPVHdxeXOTa9IUfC1BxsnZ5trefG85BelPLOfTWLr7+R09Kw3VcWTjDme4pWVAVBfvTyPBQUuyrpjGkntZAJa6kVpy+ZRglmUziNEMqEysKDf0iUn2clqLIBlmsGsxeuaz1z4lguzEH1jEU18pUeg+LShvOhEYmTlYw5qfA5lOK2T2PgWP5LIRWMOqdQ7EMjnJbDF3yGq7wJmbLhjGHFNF1RNL5mByHRPwHcmg9mMNQTg/OlDZsISLmiEac4Y3Y3OR5w6uZle9NHS9h+EQhg8eVjIiLjxXrmS3tZaV5hDOlOpaz6hg9Wc2Afxlmjxzsx7LRHMoieIe8RrKGnph0aoIL6Unqpt27iI7QJvpP19Mgol4anCI/PJVf/vXHJMsaTfdeoCO7n5bsARQi0JCd0ex7ZxcBe/3RtE4yPbxB+qEkdv3yGFHbM+mLlsHP0zAtadThX0z2YRWJR3LI2edHT6qaiXQTS9ldPHKe497IAo+XN7h78SYvHt/m1atn/PHlU7599pj7G7dxVAitaCZYb3Hh+dsTvPNPR4nyacaZ3sp0lFauL49WT6EW7wqsB3PR7UgUYxERKYdZKW5gfP+moSbgOpZH5zdBaL4Ip2dfFH0hTThyDJIqhSi2B1EbVo2zwvxv9BOixCyJ4/CV5wwowhGqxBoiNFJgFjJw0CjmnPN1CCq3LCo8k/mH/+Mt3L9MYrJ7gctX7uKwnaMquI1qoStnjpnWEFnDNgdXF89QXdRBwMcJhO0qwBRaS9mBPKqOlpO7PRu/TyKI25VH+okqFjpXOG9w0itUVrorlaz9peR/k0jenlxmNON8e+826pouPvr5Eba/c5pEIa87T59y78JNgvac5thHgXj93g+PT2Lpap9lZmRa0rsDY0o9N1ZXRBOFlKZ38/TJMxzl7ZzYEo331mL2/SSAY18kopXHubIl
 bKqGuL06zay6i8EKLdXeacRuLaO7YoJrZ6/+WIAdav3yeNsoplwDefsy+OnffIL775MY6VpmUbuGo7CTWlm02K9SyPIso1IQsjOqDrWgjyVIxcCRVGx+xQyIqy51O+k+Fo0zqhS9OLpWFlG7Jx17aD3XrfOMxNXRHd3KcGYno5KMY2my6UeyGJNBHz6ex4RXIU5JwAlx/f4jmdRtTaZsSwyGoyWM+tXSL0NybnRDBr75jTAsh3PoEPztC1QzcLqOJRneQf98LEfzJdlUTIS3Ci62c71/lYkwMYigYhyCxYMRaiZzJFESlCxnNDGX2cyECHTYJ4fRQEHrI8n0iTgsXlm0e5Uxk2qg9XAIPj8/SIa3ksaIUip2yqaeUjLoV0pfUi8bejtROxPY814kqggN19duc75PHLGun43mPkpPqPES5PqDpObZldtcnVkmZ0c68dvyqJE0UbvFUyyIGr8lFvWJWubLe6iS1/m7//0zvORaHbUjjAr2zjRYWCy3caZymHtzN3n2/Dl/kuNfnr3i9eNvuXvmGhuL1zmnd5J9KJyIL3OJ89HRmaTDnGxmrmqc9u2B1H0aJTSgZEwootktXyikgytiQraD8v2Dydi8yxjyyKD70xD0gsIth8oFC61Mqa2c/MSDd//jr4hxT2O6eQijX7nsQRl93ukMiTF2+ZRgj2hgIlrHaHYHa3XD6ALKCf9tNIVBauqTavn0F+4Un27h2sgaL54+4/v7jxhPbqFFDHCwaVSed4LBRNk/ZRfTjU7K4sx0xOmpidAKWlZRLZVDuScJxYFM8vaXU+Wv5LzWwaCyj+Z4Pd25gvbqRdoTu2k9rUFzuob+4ibmO1yE+6nxcStkeukCLy6f56JtDm2GnjCvKkJltuprnDy6dQtzZS35HgkkfJPAtGkIbYIWjdSwuysXmJBgai4awlnnoihpEGV8O5k7w0XsyZgzG9DG5OH+q0N4/9KN/b+Lor5KDHFUsHlg8scCXF6aW
 /7uxQ88uPeMxthqvv5lEIWBGjZmVpkzSc/xr0blVU3R8Spy92Zh8i/BIAvec0ISKqqZ/pAKOmSoLTFNWCXFDPtScSQ20y1i6Be80R5XcdZ+iSllK30iirF8E6NxbTiDK+j1SKH3UIakVyYmtyI6pf9NyXO6fEREnpXoJTlt4rpr0j2c6TLk/fM8X1tmIbWa9p0RVG5PpFs2ermyT5BHK+IvlCGQQ9zQ4l6A0y1HzECYPbv1DSY7AlVMhpfjPJlHn6eC4bhBJko6GfBMwiriG/FTMiBisEqHq9mpwCHd2NU6w73OIXIOZ7D9px5CAgl0CWYZkzoZkWvvjqphtXGU0gNRxH0UQsxX4VS6pzBXqeeqa4o5tZBFupaRvEaGKnuojK+TvqTFVdVC9GcBtEYoydkeLoh1mHRJiJYAQeQABT2Becya5mlRjKBL7pRNbefcyDlWqmySKkPYc3TctkzzXNz4lfSU16++59W3cjx8zbeLVzGmV+H9tQJdluBxghqL0smNJ4+509FP48dH0XkqGc0wY48pZ11l5MH8FRaFBhyC3TaPdNbFZLsFGTN2FVMt3bo7U5Js9iqTfQPsfD+EX/82huRKO/PqTroOp2I/kUnn7nQKD5XQmmKWNJWumN6GPbyFtXqZIzHbtCOlLC3d5aLjkvS1fNqjahkp6+LCwADXDXbydyQQ82WC4Lv0y+FZOuJbKffXMGVe5NG1+5w7d40Ll+6yMXoRfYqB+K/TqTiUR5cYvNq/jk4hBJdg4JWNB6x39NGbUY42Nl+6eAt6oaOOZAtdeX1cMrtwZg8ykKWnRxKrU87NmioGlWtkvMxMe3KVrLmGvBNKAj/0x/uDWMo3q0qMmsr0IUatVloVpWgScuit0HDvyS3GLYvES0Kn7M0n9A8nOfheDEc/O83X/3CUcEnrjQv3ues8S2+M9scCXFiYWv7zq+/esG51ohpFlEnSb4mbUoAXy/tpi+2iJaGFit2pKLenoTuYzu
 CpWukSvUxnaLFLz7KHCrqFqTEKgvQHKOmJ1dETKikWUM1izxwPphbpCZYIDu3AnKXFmSYdRB477FdEpXQNjTj+UsE4V4dWWCrTYBEDcMlz2EPqmC0cYSm/gfk0LRfLJDlPyia7pZP6dSwp0s9aj6uxJ2gwibDUYhBdh3KYkNdu3xFJnXSHgbROwdlaur2LMcp595xQyPM3MSbOdds5iitBJZ0wX8SXz5jgtN1DDEZEbYwVkZwsYaPdzkJ2A1bpaSfe8ubkexG0R3QyWyL9VzrRaK6WjrAyvH92HJ+3g4namk64IHOFYPSYoKM5pYlWee0Gn3L5vUriD4aRvzOW7phW0nyqiRcH9/xnN778v7bw5X/+kIKQcjojK4n7IpbJtAouVArGaOdZWrjLhf5xIYcmlhvFOCSJluptPDl7gefXL/Hy2nW+vXuDh6vrXNe6sKRb5TU6GD5djeZUO3enrvDq2kNWaozUHy7mkqy1sy6fhK++oCCpjCvmQVaiRbA7sijeUUJFjJ4J6yptkkpxO1JRpbbz+MojpjWT1BWMMHXmDncdw/T5pKLdFUOHrHezewlD6jEWG6Xrx6oYzTTiEupZyGsQehGDLOnj2vnLfHvxtnTlCkl9MbLYRkr3xMoaZZG6K4GgL2LI8q5k3LqMvdbKWP8iTwUTv//uezle88Mfv+PR/fssDp2lq9j55o0OTVQT9moHRtUozrELPLj7AE2kQnq0DyG/C0GxPxVjohDS0AyzrVbsleNYG4bRK4ZoiW1DdbpRcLqNCf0iA9VGaryKpLPW0ijfD9qSTti+NJI8CqhPN7Axt4guLY9jXwex51cHOfb5XhH9OPfW79GZ2UOOGH+8VJaoLSfx++0x9kj6tYhBP796FWfHIHVSJX4kwKX5+eU7q+OY47JpPCU9ZGCGx3eucs3sEIZX4SwwoYtupGxPHqX7cmjzKcIS1cacpOWYoKJZnMEkQ913NJ3eI+mSiq20S5LopXtNqIa
 47jqLPblaMEUeW9yBI7aVwQApsuK0HZJKqj0pTKYYuW6f43K1gZmwRjayJHlEsBMx9Uxk19DnW40jWy8iKyT662QRXySp25KoOSCiEtzpECGX7UmjUQbA4CmuJ32xxrOahZ5RFvIraTicJEIvwShIOigdYyGvmbWcOqzBSvrE3QdCRWBB1cxkdNB2opjOeEGnCOm2Qa3iinrqDiYRt+0Upw9UoMp1cVnEcK7FKKYUjcIji2TpIhnercS4lZG9K54jX6aQGdmNtbCH3s03c063Ysg2U+Wbz+kPo8kX8xiTFDZkN9Kt7CfSU427eyvvfpVPiWyQLS6fo+8Ec+oP0WiCmrnaN82djUVWChvo8K3HXiCizO9hKLeD1qAUMTtJdEmN5VwNztQ2wb4+Juv7Jfl60AerGS4WXD17lzurgk6KTmYax7h6/Qb7PjvG3/zH35Eh2LfZZU3BzVgK+tBJz9kYu8JKUy+5u6QGBLYxVO8QkU4wnS8/axvjfIsOpyB97/5c+g6k0Xm0DGtahxiTnv7AHLrF/FwRtYxF1jMqqT4pMzQjODpZ3MaGyYkxtYvsg4XUCkkdfzeE019EkXugiMivCyg41cZ4p4t12zo3ptd49vQhL1+84runLyXhH3L31h2e3nrIrYUrXFi+xWBlL0vNLgZr+rh7WzrxnRukS48/9GEAH/7T5+z9hQeZsk/dmS0stEnfF/GesU5zdngZp3Eeh6xHe2ILs70zrNkWsVQ6cGmkThWZqE2wECEdPf6bfEz5HazIY059E4b/jmS8vzjJlz/1oySyhmtrN7Hk6Un4PIKTkppu7/jiua2YlAQ7V8/dZLW+A1tyswRC848FOL+8sHx+bpUdvw0mK7SZudF+ZuTBvaEyuF5KpopNjCWpaZDB7RbM9PjYl8rNdzd9KrGLw1klprWHsqWL5TB4vEKSpBiLh4LRHIugxSLLagdd3rXY0gR30hsxHy2SfpWFUUp5lXS8wbAu1luWOKc1
 MRxQLF2tkqlg2dSjuczlGzD5VOFKa6c7pYF09wyStkQQ/JH0rOAWNJK8qt25tHuUvXkTqMOvWtCtmuFo6XdKi+BHMl+/9RWnPjxJp3+5nF++YFUJ3UG1aKXLab1qMQQ3SjfdfEe0mXEZlHq3AoxuCViOJ2P0ymA8rIDy7Rn4766mq1PIYO0WDx3z9MSXs/9tL5KkC2ftjRRSkNSUNI3ZkUlhpCC22sxkpiSz4JEtSc90aRcFB3II+G0gAb8JIOhtP/x+50VruBiC6QxPBA8f3HtAR2EH0VultB+ooTGpj+H4DqazOqWnyvUeK2dCM85MyxC90m36sruJEwQLkbRsOS5dVPrRSKKYV0Id44EV9IVUMZQoa1FiedPdzrb0MZ7bjUlt58b6TerLrBzdWYjWt5KFrC7plUNMK4c40zrGZc2gGFEb6ccbGa9zslpilpqhxZFjxipI1ynd2BlWK31dycjJGsHxalwp7QxJavR5ZmM+VIQrUPr8qWp5jIZe6elVbmnkfROHWhKmR/CzSGYnx1dB4O+P4f9RmKTYEPpSwcMuSTJFO2u9y6y5LrAxvsbDjTNcHBhjqtrK9fM3ef7DczGVW9xZOYut1IyzZZZJywJnhtc5Y5+mNbWbQ38I5pOf7Gb7O4E05NlQBmkYl7U761zCJc89KWi8YR1h2bVAR76dKRHjjVnphBMLLJlmmB8+w6J1TkTYTltChySlrGv9MKeEwLzf98fnD6EEHmvGPHJJ+vglTIoWTn6TwqF3T+Mb0M7oyh3u3X7Mzakl9FkD1IjRFXjV/DsEnVtYfnDzAW1108yLs8/1OdBI/BplIV3yohvlFvrE6WJ2h6IOVxF3VCmbLc4W08D46XIsvoKReyXiN99ZDKiSHljMQIqGS/Z1VtQ2hmNUdB8vZqpn829LkkJepXTLxjQcUkiiGbiydJFLyjaMHtmCf9mMSi/U7EzE4lPGmGDtsK+SfulNyv1SukUIMe+dJGmvbH7JE
 F0hajSbf4N0S2LoeBbj4WpJsyrGC/WSDrUc+NVRPvp1FBFbMrAHKuQcy+gJaaErWke7FHqdIKQ5tp61Cjtj8c1yPk0YAioo+TKWzO3xlB8uo0vKvS2yk3MWG+dsNlY0FlqPpnL6gyA83o2Qst8v2ORPzP5sco7kUXlc0l6GS+OdS5ObpFxCJz2JWuoDSyg6VEbgewl4fBDJgbf8cHtXXqNkimff/YnXf3rBvz59xq2b32NomqQ6rYHFrgkcmX2YTsl5CkYP5QyLOAYZTtUzKOJxSgfrEsSN3pZC56l6ZrKMTCS3yfBnot+Tgf60iolULVN5FsYrzAyG1+GQOrHQMspKpZab1RY2cgS9TimZSWxjWQS9UiB4Leg8dLpYHFsSVdB/trqXIdUIk7X9OOT1piQtnKGlTJ4sYzzo30xzLKSS2WQRoKyfXYzO6VfBtNDRpAhwStJvLKKOsn25RHwQQ+7OPFRHSqiWuaiQCuPxhwj8v8qmM0HH1ObfYZUSALE9zPSsceviIy5NrWOTOezK0TBi3ODuo4e8fvKdzM5lbpy9w4xeUm1whVvrl3Gohdji1Lh0TpKDa3nnH/bhfqCAs2tPmOqdRV9mYsk4zcVRF51J1bTKvExY5rg+f05Sf4oNoY0BVT8L5iUeXn/ArbnLrMxd5ezMGWr988UUWyiU2fE5WEtGhIPVq3f44/07nLPP0n26ilxvNXUtZ1i9KOd44zx3+4eYr9UxWztEpxBWbULvvxPg9PLyd6+e8f0P3/LDn59zwTpDa3CNbLL0L3HPsagadDK4237uQ6JfHhb52Xj6Zk+rwHA0hybZbNWOeIa8CxgLFPFFyyYrrVKCzfQHK+jaL8mY0MVqjYb2A8mYBO9qjuehk6Ta0I9xtlbN4OEKrL7l2I7koNuTSsOBPEZEJD2x7ThiuuR5KunwzkMfWEObpyCvYN2YvwLrsQx6JbGG3fOxbQpYOl5XTBvznXM0hyhI2JpK5e48zJE6wV
 wzppBScftiuoLKaPUvou1IqQxTO05JqvYwFdd10wyLWEIOl1JUN0+boJxNukR/ehVJe7zx+bUH+R5JRH4Zj+e7MRx+9xDtMeXkS2fNFzcv9lORsiWOcknZuhP5KN3T6KseYUQWX5feKyiqQSOCUgc1ELojl8yj1Syv3uL7b5/w9K64+ZXbXB0cxpTbiMKvhF7pma4KGwPSh8fkXFZ6pbdVyfcyDYxV971J1cHAcgqldxpOC71ktdHnlUW9YHDL4VzGSoekJzqZydcKCiroPlGAWTqmK8/MYrZ0cemamxhpj6oWHM/HESxf+xcy5FfKoBDCsFzDSLQIu8HJpDy3K7oGp2DlvAyhUUzPeCBVDLIce3g1o3E6SepuHIFFOARHZ08qWYhoYUy69Ljg8URGO/0pJqokOauFgjqC6yk6VkG7GFTUl/kUeTdhSrGIUVVTE9LMiHTJG9ef8PKPT7h9bpqqJKlAJxtZGZjn8f3rvHrxmGd3XgmOPmTeOMFInUVEM0CZXE/i3gycbU4GlE6CtlVQLut05/x5zk+usWQ4w0SlkSavZHL35FAXrmOibR6zQklzarkYew+2yhHWhgR/zy5zfsbF4yc3uX32HOqTdRiKDSytXad/8jJnl1ZYaW/CJug/0mKiOrCO7jyjpPAcq4LruvgaSnxyURzOpzmmVnrn6Bux/7gDTs8tf/ftM757/oiXV9aZqdSjj2nBKgIbEmccSGhkVNKhWi6+06sY/YlCLII2I8EyzJ4F6PxaxPVUDPilMehfK0rv57xtWdCplF5BTc3+BBzCxptiGZSEmVB2USkCc4gb2E8q0AdoGEySzQ0QQUmXqxBBdwZWCfY2MlkmF5ZYjMMznwUp644EJbM54rLSD7vcBbGiOnBlGJiSobjRYOWqfvrNuc6cFEQWYXaHiIOnmyXhDLjiSrAcSxfhVmL1LmdAHNvkdhrtsRSaTtSiDqmXbquhO01HX+kA3/fNsFxsJsuthKQvIvn
 iJ/58/CtfDvyzp7B/HKHbg3j3f3uf9Egt0+Zx2k4XUh9dTn1aN5XuiYT+OgC/LZnMuC6Js15g1jzHmEo6rlLPeuUwvfntTNcI7ml6uTm9zHSdmYnyXpoCVSRsS6X8SA0GWX/TySbsRYOsacaYVxs477qMUcSoT9DTt2lkOzOoO6JgVgZjOLqE2q2nRHxCBdLpLpoWOGsaZ0I6Y9uhDCp3pQvdqJhL0DDhk4nxUDoDpxsYDZUu55FOvyCmIayUJiEga3ChrGEs45FVzGdpGI1plj1U4vAuwuSVhnZrLP1ukpxxm8mpl77oYK5YhleoZC7bwpAk71BmoyC4CoPg6UyWngERuvm0EqPMz4zMhPJYKbaSCelel3C2r3Bh8qKYVzEdIvCbC9d5fmsdY52C4L3Sqd73I1n2+bJ8/5JUgMfXr/Hqh5fce3ifNeMK7clG6sT8SwPU0sULGTNJXbh4m43ZG8wOXGLZ7GTO5ODC4Jygew+m+jXGes8wrOxlWdNP8PYo0g9kEi7rsDAkvfPZA+ozyjmxJYCmjBrurm9gLDRQ4pXHyuAIi90GKnzi8Pp5IHG7pKYczMec2crVuXVubdzh6vodNPltBH0VRPL2dJozJITGz3F54uKPBTi9sLL88tUr7t15+qa4V+7PpGNHtiRFHmnC7B3BVTg2+9MpwZkgFZNhDZJWFW/ujOk+UitMbmQyqpxevzLWtGMs1vQzqWqn72AejdsCyfosHlN0NaYTKqYyxaFlE0bjZQAyNOgFbza6JplOLaD/UBKtB/JpcS8QYUqXE1Eb/AWDDiqwCeqaTlbRF9aJPUY4PrmTpbI+zjYNi+trmBK8W8lqYT63VRKhBJuXYI9qlvPGBRby6rAHZNBzJI3uPVn0i7sbPHPI+8qLfT/7mi9+dYCcY2UMyKY0eSqkkzYwEVDKyNFkzL6ZdMVpUfmUk7w7jaD34vH9NJ4iQbnjXx3k7Z/sQS+i6ovOJOrDGKI/jcEQV0Wm
 DHCcn45B/Ro3z62wNjiKI7sBw5FcMSExt4gS1gpsjBTWUR+QhjGjjoEYWaOAAmpkI1tCusRVrVQeqxZB1nDZsspwQhOalB4eOC5zZnIVe2wtHT7FKE+UUiYdbipSzPFYLio5B7N0LFdqJ8t982xoncw3uhhMM2CLa2cwQHrwoUg50rAJ3m/+zbVPkLFdEm9k89Y7D6WkaS5T3jmY3LMYjmhgukD2VAzYfLwIq3ROzTfJjEaa5JylcpyupD+6ifNicv1CF07B9Ym8bjHacQYLhDryJanrxhkRyjB65TMg9WM+vYOh+FpKZS8uzVx6I6SxFhsTiWmyjt40pndy+8J5usobOfy2J1//8wEOvx9OXeUYj19+x8LAIj2VUnNW5nh67REzjQMYxYCUkWpKApqIFaMxFLdwbnqFb5/c5v6jR6yOrjKtcmDOacWkneDZD0959eo1bbJOVkkt78+CCfvCC7cvTnJBOuCTh48pFNQ/uS0Pny3RTA7P4hJ9lB6Npy2ogNLQSnKDOymPH+Xs5DXGWufQR7YyJz37yeNr/PHxY6b18/jtKiPdvYThGgvXXMusGx0/FuDKkGX57sQ4j5aWZQPaKf0wWZKjBqU4V+r2XDo9JelOFAt6VuOUsu06XoL5qEKwox57torFqj56ZONHFH1cHh+kXZhfE1JL255o3H7hRUGUnkmluHNgk/SFMnqDagVlquiVxFvI6+BMhoqufRnofOqlmDdilZ93Sgq27k9jQEQ7mG9ksXaSUUW3JEcHK9Ivz+U0MiNItbD5ho100boDGSR8GUmtYKjNr5L1xE6u1ncyk1Mpg5aMdn8uQ0kmGRTBoZQ2GiWNd73jyba3wvEV/Gkvc8j3tRz87VFUkXWMSPqPuGUxsTeOSa8Uig5kk3usFpWQQam7GkOqgc9+LUn4VRz1wQVEfZFOnGBgxsEyNMm9DEpXGa/uYU7dyVhmC/0JbejFwMx+NUwpbbiKRllvNRKx/zTZHnnUn
 pBkCGqi43ChiFDNkrKTtogqGnzLcER3cGtwnv58E4s9k1zRDnJeDpOco1Ous84nj5LtcXRtC6fPt4o2rzrMAXUsKYxcsc0wmtjMRsXmXTTtuBKlR8pj2g8oGA+rZ9C3gAFvMTgRn146XU+QiHNHCJ0Hk+k7Uf8m7eyBpUyIWPrFEO3pch0iGltiF/YmPSn7TlK8JZQKOU9zuBKrzMdglAgx3shK4wjTRZ1s6GzM1pnQB1ViOVou9FHDUm4zzYJmxcENPLzzmD/965/pEfF7vxeGx+f59MRbubtyhcqEOrb881F+/5+24CPrOz4tqfenbznnWmFKOy+4Wc1Yowi8wohNOYw6rpssnyoS9ldQfLCa2UY7C53DXJ6a4oxhElfjEHWpDZwVDHwton904RIlQZLICjOndiey5WduHP0yHENBDy82O+bVB3Q2zBIXWMtk56gQhpKA9wMJ2HISjfzeXKeTu+fv8sPzF7x+9IIxtQ1jkob14QUe3b7F0ugCmlSpHfGdgqA92FUuenL1PxZgyAeHlpM+ihYk0jOa3Url1iSWpAjnu+XS6KmkVcSnPlHG6Ok66RG1dO9PkV7QQvuxNCIPxGDyjiD381gs7S6GEuupFEztDSoh5nfBJMpiLxhHMIQWUuEehzFYhia6VsRYyZAgoN0zlW63TGyxgmVtNpYLWxg8msege5HgUSmjUW0sxTeylFqFK0GNLaBM+l465oMxaHfIkBzPEUctwOhfR+WJBmqPlGPwyEQv6NS3+X1JanuQhplcG3cmznNneJG5dB2FXuU0BOloSe1hXNBoIbuTdkndk1vLKBLkm8wdot+n9M09q5MZRnQx+n/DvrQB6QfjDOXoSNxeiEIGrTG8g0pPFS3STesFbcvdEindn44lrBV9WBO9SQPS39rfvDPYLzh33zDGTIGaiK3JHHs3iNqDhRhim4UMmhhMluQoMKE9Lpi/+S60Wz4zKf3MlZrkZ/UsZagF93Pf3FrnKh
 wQkelp3BlG+RcRtO8rYiKykXExkxGlnfkcrRhWNwYvFSNhtQxIH5tU2FirceKMU9LvEY/Nc7Pv5WA/Ll1a6oRmfzLhn0VSFFDPmHQ/k5ivffMOFy9Z58gWVvN0jMbWcNu8iCa8Eu93QvH4MJXWgGq6pf86Y1uFTuqZLbSKSFVYfYroPV5JjVs5nX5KOg7JGgQKYeSpif0qnObcHl4+ec2Lm5doT5NELx3GVjXKGectrtgvoxLM9t+jZPvHWdSUjHPv2X2+u/8MQ14nfcUmzNlW6sO70AbV0JPWRk9mL32qKSb6LjLeusi6c4OaVDHEOjuzbeNMqq2051uYmlrmztmzuGrNKDwq6FUMSh/PY9fbwTLThfTXjHDFOcqTm5e5d/s60zLblpQWtEcLyNmdQ9rXScTtjmfOOs+rly94/t23vHr9ij89/o5LAyvYUtqZqut/c6N6e7yFAdUY67ZVxtvHBINnfixAz7c9l7M+T8Wcocce10iPoGSvpF2PryzgyUoq9gpeHqlkXAZkJKASvX8ZztQWNB65REuK7HznCOm5vUzUGQVfisjam07Szjzit2fSIJ2tK7mS2G3JFIugNt8A6HKLl+cR13FLla8Lmcg1sKFsxXW6DKt83xEjfTC1QwZIRFnSLQ5eR5dXgaTi5t8bi6nek0He9mxK9kpyyNHspmA4vIbJmHr6QxoEYVUiRBW2UOlmRX0sKwycLTNwrsHMpXqTdEG1lHArC9KLJkR41sQm2o6mSYrGU3+qQHC5jdlSM5b0dmabR1m0THNh9joPNq7y9PwFrhjH6ZAhGtDO8OSc9DtlB2qfQooOZpEqaJbtL6hVPM5szxrnbOe41zsiWCZr6FnG2GkhhuTWN3dlZCV0UXC4mLFSB8OyuX0xnazoz2AXHOoVATZ7SxcTIxrZNCoR3Fi4WhKukeG2c9irLIxmakXQjdR+lYLmsJBAsp5h6VfjyUIpsYK5klK98R30h26+mdbCbG4XMxX
 dnBFU7D0ufdsjH5fs9eadSy174tEI+leJ4BKFIDL8KuiSoe71raZPhNUdKntS2EWnVza6E+WsNY5jlLSrOaqkXSikSwhoQBLNLkg7FNvGeFyDpLACg6SlVSim26MEo5/8PNHMmRobRdLp033U6KJbWWofxJotlJXajq2gF5tg4uN73/Hy2TPuPXnK9fV7bCxeZlK/wqpjiVsrZySN5tEltGOIqsMUX09XrFaIQYsl38ra0Aa3Lj3gzu0XXBhZ5ox55k23npHKYs1tY1264bjg7nSnA0v5MC0Ffcwa5sRkdZQfVlPmWS0CF2NL78ZSoMOWWsdisZ62zXe1D2eJLnJoOphJ5fZYqQ4Nb24BfP76Ba9evOA7EePrh6+YqZNe27XGgm6Spb5VLo2scVs64Y2Fm8zr/t2bMK1htcstgoyO9CZMYbnyQgr6JfFGgjbvVMhA77n5rmf3m7ezrdI5bFG1TMgw6LwKOfrpaTLy+nDqLNRI+iRsE/Edr6fIT0HsoVNs+9UWPD4OJnRPHmp/Sc8jBYI4aVgkOW370lhOlHIfkiNilOeVcjsWqGK1cJT+HDX9qeUS8znU+6ZQ71/FaNYwjhQ95hgtfeFN2NOMWJO7pNtY6Y9owy5dZCJWxBXfLAPQyGh4FePZHYwJck68+TiU9DrvfEmMOIrdU0VoInp5bLsMSbg4mufXWWSGdWAuH8Kp7Ge0rJ8xKegX9Q7pUX3MVBsYyeqgU17LUjHKtZVLghRtJHwTTty2DAoPiKmUj/Lg0XO+v36TJ06X/I6FIelHZg8RWmKHGFmW/LeNK9ZJFjWdjKcNSQfuZ1iwd75rlacX18UUKqgRE2sVtx2Q5Ok9Jl1MjHCmZIJFwyovv3+Go8aIUcxmWkxqMLaTATGs4axWun3qsKXZRHw6rJufRkiX4p8vexcuRhdXJ7jexGyMmsmAclxisEafMkq3xFK0sxjFYRH88TwSv86k2qtShCcoKAM3FNLKmiRpryBm
 vacYnLeYiJCSM0SBK0x6oaScWdJT61PBYNYQ69pZTKGyd/szMO6XTu0jHTGmA6sIczpXR19cPVn7sqgKTafySBxlB5II/HUUpiohiwonjvIpnl27w4sXgqb/4zXf/b8vefxU1vPqPc4uncWsMjFS1EtbdDutkuqqE3XytUbEpGesqZdrs+d5/vRbHj++yqXRRTamzzKisjGa1opOaOrm1F26C/VoIhoFGcVU4pX0Si+clJ59ZuoJg42TdBQY6Cyz4BCj6YhsQONXTPmOWGoF9dXfxFDjVUHuN7FEve3FvMYg+Hmb1yLEZy+f81o66qO5qzzYvBe4f1jwe4Db8vXr7+Vn919y79aTHwsw40jIcvMJhbhoPmE7BD/2xWI+FINF0mgwqI6B6Ab6y8WlBFEHTgv6SPm3iNOWyQaaFBbao1qpEpZvyxqgKVRNvSRRwcF03N4O4Pc/O8y+X58i30+NXjZgQLqlM6iR6cRexlONjIXWoA+pky7Zg/m0mkUZqKkKLXmH0/jsp2588l8O4vu7GLIPlAqi1TJyUsWoIOxkbBNnlBZWa3qZTFdLV8yj/5gCs7uCgcMKLDK0VkHmtfZ5RitFSBld8toN2AJbGBYEsYgjOuLl9yQ5u5LbMTU5mRk7x4WhVVwFIpSwKunBkrjyWr1xrSx0L3Bj9Z4U+YtMd09ytcnEnEJHrVcVFUdrxahKWDDP80RS8obJzkZhDwNRLdL3nFjFRecLLIzFVNN1WPA6qhqToPymiSwVisP6KxkWslhPqcUeJetxJB/L4XQcYUos0sE2P1GwmtcjaS54JyIbq+pAcyKLOq8MuZYeJjMbGI3RMB2vw1nayarWxojgujFSLSmaSb9fHjrp7DM5vcyGVzDvrxJakETMaMcZoWEwU/ph/TDn9WIKm/dCyj45xRDsgoz2U/VYRbgOOX9ruonxglYcae1CShpMh3Lpk4phE1w1CMp2ieGtlHSIWRTS4lVKrZiqwU/6vtIsH
 a2LyUoHziKDJFcdnp+F8Lv/9C4f/WQLsZ9EUeijY3X2DmsDi6yYz/DoxktuLl0Rwxnj1tIadx7c5dGd2zwUAhkpsTGhmWBKzne2QcxG0YcxpxOt1J9G2fPZnmmeP37I05v3uHD2OvcuXv63DhohBFczzqXJIboyail0z6XKI40EMV+9iP/ZC8HIf33GLRH/VP8ZJrSL9OU1ULiZdgfSaNgbS9veJNrl2sxieobYbjKll+7+6T6Kjkay2qjh+cMHPH31R74VoT1cucmYCN8WZ2C9f4l7l+/y+PY1vn9w/8cC3PrTI8s1e7LIE9F98/Yxqr+Ko1MGetW2xLzB+aY02mssNIeKw4srTpyqovlIMU3iCh3BpajF2YebxxjJ0zBRbKArq5PSg2VkfJGKz0fxRGzLwZxuwFXSz1KRXuK8lwuGeZa1dobTBRG7l5nRDpArGKQUB672LSVBBBywtwD/L/NI21dCa7SWQdn0sRDpOGICE5E14urC7l7pGNyT0B6Mp1MGYfnNH4rF2aXX9IXXc3F8ncXmYaZkg1aq+pkuNDGQU0Glfyz+7x9FIf1rVJDnco+L8w390rNaaD5eIAW+lNZ4DenC+WmeiVw/d5fv/uV/8PT2I86aJxnbvKn8VDUzRUYxBhlKObfbQxNsVIvwgosZPlWDXXqyQ9LXmdaIK6Jael0e5Z4VmMQchk+UMi6YZz2Wx8DxcoZkvfW7s7Fv9kZJy9EkPRPi0noZEt2BVAYOJtD6zSkGT7VQdriCkO3FFPlUMSgC2hTtVHKzJJUIIqSM/rgaaj0VFHsWkfl1PO2747CHVrEqCTApSWoV/Nz8sPKImItLRHupQlA/TS19spkZ6W2T8bK2EYLzp8WshFCModXMmZeYaB4Q4elYkmQYSWpnPNci4hqSc9VJ/8ykY/MjaUcU5G7PEcKoos1Der6k5iZZzSodnB+fx1hgZm78HDkJzfz873ay9xeHSH3Pn2HdKi8ev+Dq9EUWTN
 M8u3AdV6ud7KOFNImZ3F4/w1yXpGPVAKNtUzy6/YBr41ewiSFp45oZzNeiFYHVikHPCHZ+/+IpPzz7QfrlE84PzGGM1lHllo1WOnnNyVyStqZJCqcR9X4U8X5VXFi7yZ9e/cC3kmAP7t/g0tIFlk0usg4mEvVpLElfxFOyOwvF56EovzyFNlQhHVcS1b+B1M/S2PezE0R9HsaZ8Rle//Atz759yr2bt1g0zmKVuZ/pmmC61SqClF7eZfmxAMM/9FkuPphPoHsiX/3cnc4DeSxXiZtffsJKxwCGdC2dKXU0Sx+xepVgFYEkfi5J6VaDQzXJxdVzLNToMO+Jkb4ihT+pm2bfPArkhJt81Zijm5k83chkUqn0gxZcKc3M5bazIiV1VC1u1twnz9NHoXcF6VszqXCvQCmL0ixit5ysYDa5lvUCvZT/xjdvoJgCKtBs3unhX0HYJ0cIe+cQqv2R2DJbyDoezbbfbmPvz/YR8G4gnYnVjBZ2cq7FxbzKzKRCXEvQyP23IQSJMWgiO7BkyOYFFFC0N54mSf2cXTGE7U7EWNLK3rf8cN9VxIVz1/n26V1uj88yndeJzj0F2+FkXAF5TATkshwmXS1AcFHWRrc3jT7prI7gKlyC6g5JsH7/Iho2b5Vzy2dauuKAINhcniRzZi9GEYTWQzBc1n02yyDuqsaQqKZTemOtWyk1YhL1+zc/jTAghnKJgdIehuX3bCEVDAeXMRjdiD24Sbq6oKp8f1BwsVwMqPKEEo2/jtGEPjEgLaZAEehp6XUeuW9uURvN7cGR2kqXXHPSF1GU78ugU85NK4PqkD462zwjCN5D/s5UFocWJPnHsAjyTW+SkBjaWJeLCTG1Af9yqRBK9GH1FO6T/rhXRYNUiT5/6X1yXoY0C+ddV5ip6cJeruHxlfOs9k4R/LEPn//Ne+z7IpBLF+/w6uVrbi/dYaF9lo7MdqqFTtKPlnJ6Rw52oZihMsHbwRVJl9d8/919bs8
 s0pNhpCOxh+rgEhL2ZJC2P47qyEIcLSZGNVZuTqzTlVhDpzxXc1SnrI2L5jgzpbI2ye5ZRHkV47Qtc+fiGe6cOceGdYrL9lWuX7hNr7ITz3c8CJKEVnrXUrk3l0KpT2HfJHBqawhxO8PIFnNO/siH0k9DUG9LZWVkUVDzpaTgU8Hg7zjvPM+saZ0V64r0yh6itsa/efPtRwJMPpK3XHxKgyG+TZ4kE4egyXVxgMu6bqoP56CSYdos3hbphqaj0pM+jSbPXclsi4OHi5e4NTorG1pM565EhmPFwU+2UC89sMavieEU6WJJbRj8s6WwF9MoHbLyWAkjOdIbG5z0ZjVLUilE9NH0HcllKLiaQV8ZZnnc0OESBj3L5etypsLUtPuWkLwznvitKcRv3pDtEU6GIK37B3GEvB9J9BchHPqVB2/9/SG+fDuCiK8kPfzq6U2UTlhuQRdYxmRhL1rpkRVS2LWZegqPpxG5MwCf3+3gvf/6W2qln2UcqSZ8VwFxW6OIEWPSFw1zxebkmmUGfWI7fWonF+cestCxRL8gt3Xzbpd98Yz7biZzK1a3PIY8hBQELWdEgGObN3x7KWQgaxiRFByTNJyS1LjbKudxLIiIL8Jok/5lFdzcKDGTfCBChkNNT5iBKv9KJqsmuHv2Lncf3ufR2CLODOmg0eUUiUk0+Cpxlujpjih/8xnNhY4JFrQuDLKO/aEVLOe1cKa4SfpfK2a/MpoPFaA6mMFq0yST0meHjuVTsSNJrjUBrXsODYcyhG5S6Q8swSUYPShmYM3u5fz5y5xzrWNLbcdVYcJZbGaqoIfR01ViLlVYskboTq6j5ECm9NwcBk61CbqaBMvneHXvFrfPLKKS+cnblUtbohh6Ti3v/P07/Je/3kLY3jKunrvMD39+LJi/ysrQImPd89TKWub61lAapacn3YhTM8udG7dEqE949vAxU5pl6XRaNLE60g+k4C9CcX8/hGO/DyVSkHvNsSHrMYJF
 erAtbxBXnYsFi4OF4XVGu6coOFKOOtnMRo8dR6nQW5VeuqKZLlnPe5fu0pXawK5/cCfgk5OUeWVTvT8ejVuS9OVMor9O5cQfUjj+myD2/MKdwLeOkfJROOtDK/zx5be8eP6IJ6+e8Pj5Ax48fMbN5TVqU5S4f3gcRZjqxwLUVmiWxypq33wY1RLayqLawkZZN4odmYR+lEyLoNLmJ9WbxB1zROXxX8ey1rPEi0eveHXnIZO1FrqPlnMmw4QlRnpKhp5zTcNv7sxwKfQMF/ehyxklN0BFxs4sOqR/XO1f5IJxlMpDcfh/4M+pL2JpkjJtkS40JAIcOlJAv28B5qBSQctC9IK6ZV5KEqVvFQbXk783k6QPIlD6pNNZ3k5DvJ6IL8PZ8U9bCdsTS76gVvI3kZKCwRSJGIyCg01eZVhi67hsnuCSdYk14yLR7gXs/jCGj3+yh//219toKdZzae4WDu040YfKac4ZZKHcjEHOXS3XdkbK9J//9c/892dPeL5+lbUKGcST1UwKUk76FUgabt4SJ/3zeDVjQSr6vQskVbIZkTUc9JafeZUzL1hnl+SK+CqWT34WQPr+MrqOltDpXYhOkj15m4KR9hnpTWac0rvOKw3cmVzkkq4PW2gRPW7lsieNaCJECMndjCW3oQ+UvhbdykxKAyvZ9eg884h8z4+Q3/nStDebvlO1tPup6DnVTG9UGysFGqwBgmRybp2+VfRIOnQfLcJyqJjJiGYM4S2CT6vMFws2lZlYt4wzlFovvbQRZ5acf2YHNllLu8zFcLiWqXwjlsAqunwb6QpsZG1w480tYj88+VfunT9HlV8WFXJ+6kwrE7azJAen8/7bwahKXVy0XWLD1MdSs0b2OE1mRcO1qfO0CabWJvSw2L/OcOWA4N1Fnr94wasnP7zpeA/WLmHK6abKt5xTfzjJ/p97EbovidM7MzglezBvHGGotFeMoAVTjJLu9DaUIdVc7LIxmd1GpV8LPXLez
 Qka4g8XE7w1mpCvIlCI0a/2jqHYF8ex92Lw+/1pSg+VoNiZjdm/VhJf9koqm+VAOkXbktnz1nF83wqgYGeurNMCrx8/4TtB0B+eSKd88ojHd55gl4pWGNTIkY+jOb0n/d+9C5pStdzjKQwvXaS/qAuDFFmNpFHxvhyajhczIINjOFlOmXcWkVtzpKeVc310mburM6zUiLPKBdhkQ9ZLxYklcVzy/8cSy6k+EEnU9mSyDiuZzGzGlavD1bbG2tCq4KhsZmwN4V8lknK8A6uyD3tBg6ScJMWJEno982k6miwYmisYpSJ1Z5q8bj41x9Ko9EhGtSORim/i8P0ojEzPWuy1PXgLn//u77bhti2awshM3vrbL2XxotBJb+oWNNZLmgxKB3AU2pjTzzHT7WSye4nBKgelx0s49qtA4relcda1zLLOJfg1xg3DFM7ERjECNZfHL/Dy+i2erV/gRpuZ+WQto5IUI8cymRCHnJI1HDi0+U9zKJkqcjAs/Uq9Nx/TYUkxMZX+QwrGRVDGVC1VPirCv4gjeEs0xQeyydudTINXNV2yfu3hVfSFlmGMaGBk89MFkqJLWTpURzKolERulV7acUrNXPs4i5lGbIKF/ZFN6IOqaPFSYfCtoHBnEiffCyX1S8Hi0AZsCW2yvmamcwyMxTdJ/yzAeCSbAd+yN/8CwLSinalEvfRaHXa5Xl1+D/dn5+iNKaM308RYRR+9gdUieOm0+Z2CqI1v/lkLo5t07ahGHMmtOAVnJ9MbmFRaeLZykUffPuD5pYuM5jUQf0B6bs0400IiT5eXaVCP0WZY4fWN+zy98oiR+hEy3GXID9fRJrhslf5bK/u+NLDC7ZUrTOpmubxw802yvJJu90zw7vWrl+iyNahOqEg4VIbbByl4fxRAiIiitciCvcbGmckz9NWYaSwZo6fKikmSuzW6TZ67jqHWcdYHVukttaGK0hB3oIj4oG46a2exFJqJ3p4gCdxIjmcZ2TtKUQ
 c30htXKyGjpjugBPuJMhwnKmiX/S2T+TSJCa5phkVwN/njH1/xL98959nFa9y/dJMJy5oYi5Wcw3nEbE/8sQBVxxKWB+SkZsR5ek7X0uKZQbUMhm7zA6URhZz+zI/swwrC5b+nPpXN9yvHFlxM5mcRBP0mVNBn8xawDrJ3xUmi5dFwrJDi/4+rtwpvK8/Sve++m+/QPDPn64GemWaorupKUVKpMHPsxI7txDEzM9sys8zMkmWSZQvMzHYMMYU5qXAFCxrOmZvft6S+qr6ox085srRhrfd9f1v/vfbeZMKFCyoVnTQ5ZVPjUc9q6zLXhqa4XKqi5kQUUQf9CT+RwHLHAOUB0Rz/1T6st3qQ7lHOxc9ccN9ij5/8bBKw75L3yBJH8//oAsGfeFK1JwS9VRLq8wU0uBRhSG6lXBhEIQVRKNAeZB3Bgd/K66SYh8xL3sRZdb7CSnIQrg1ucGt6ncW6cSZThDfPuhC+043IfSm0l4ywWDvCJfUQk2ntmITx9FKgC80T3CsxX8ksQOtRSPnxKJoORdNnHcW4gwKDdQwaEYV++wwGJJYtSfRWSVwvP5mKXtywz/wldWgj8xFm5S2hTBipwK6IFJcKgveE4/tVLFXimOOi0nXSnEk7PWj1LRN3E7cJlRgtDZwhLNZorSB2hz8VvvVsNE1Jk6uYlKLul0hbEaokV7a1yTXvbw34lSMxFyoYKZnAWDxOf00fCynNli/V84W7Ug740Hw2EZV1JCMStyalQEfkWBWciCZMOCV6TwQJEuNHgxqZDO1gOqeX8SA5nr4iZNLotSdiST2WRL0UntYhXlw8GVNRF40iGqP6db79+jmTWWX47hSGcjLfhZLCqT/soEDRzMs3j/nzD3+23Ghrvnx/a/Ym0y2LDJRPcf/WCxambtGe3sPm3Dqv375nfXKO1YkbvP3mPW/fv+Htu+/407v3tGQ1UyY1tjS6xKxpgeZcicI5I0yIgM72rLKgn0QlPH1jdI3
 bM+ZzbsQQLTxcaOTG5CUmag1kX0wiTmopfIcrxXIMv338DVeHZ6nwK6AtQsSv0kStTxIG+xSS9oWTLLhhjMzG6KdEJZixWd0j56GSPvcsRnJ1PNi8yoO5ae706ehPb2QiV8utFj36mCwCt5/n9K+P/bgB011yliczayxDddps8mg7HYP2aLJlDJ3bdlts98eTLGrp9YeLArrRJB/3J3tfCAXeGnqbV7lU0ivdn03AMQWFVsJDwjwDkS0SKQYZ8S+j2V/F1cUHPNu4wXKxjnzJ0fE7XcT9Qgk8lU7EeQ9+/5Mv+Kf/dxcuu2LoTW8lWiKU/ed+uO2RwjSPMhQ+0VmWadXT7lSB1iqczO125DkXs948KcqfhlrUvD+zn65YcQuXMhJF0RJOxJN7OByNg3BJRA2L9TM8uXzf8v3QqKin0x8P8vP/toVD4nxD/cKz0ytMlQ4IwI9iitPQYJfGTFYvlxS1tDmnkHM4kwIRlm67BHpPSfNJUhgUrkvf5UHEjlB6pPhNEuWG/KsZ8Cm1zFVZEEYZCKiUSK2k1yETvfy+VVLGeKyKsaJBkqVBU0T5m+M7MFaPMVAzSLV5Abuv2dkkNjrGU7XXlzqJ4LHH4/H8Mo6aeL1EtyX05hUXrRP0prZS5iWx81Q4dQGlEqXicTheQWfvPDdX7nN/1XxDaAszEh/bfMsJ/NALq9/aY/8He9IkCmsvZjKr1NCXoEJpLQJwpoD00xm0X8wWzitlPknLJXE5zdl82mV7e0X0zLdyZdmLGMtnp59REHoglIrAPPra13j49DFXu3oIlMjlczCVmMMRhDkUsvXzUNpVizx99JBX373j3atXfP/yNU8fPOfh9afcnljl+e1VvhenW5amutS/LE36SthtiKnmfl4/+5pX71/x+vkrvrl9n4n6PtrzOhlq6KS3op9RzTx3r9zm9vo97okLb0ytEie1UxMtztw+hyZDRWN0E6qMdiZaekk9l4TPfn+8pZ4P/swF
 b5tibk1uMKaep9RdTen5KBEZO7w/PYrrluNEbrHh/C+PE7U7DOWZVJqD6ikLDcNpmzWOf9xLZ0IFU1U6FKd8CNzmQMrhWLReVaQcEUP496385r//mi8/8vtxA+qzG5aHw3Josc9hNK1DIF9UNb+P5qg8HHcFoQ6vRXEsnFM/PY7n8RB2/M4Bj+1e5Jx3Zaxaz3iYkvi9fsTLCSs7pKDbKZPRkCJaZOMGsga5N7DA+9FprhQKN0hzltjmsvNXJ9j7y4PUSjw59dk5Pv/JCQ7+zgltlFbiWRpN4SWkXCgg17dY4mMhRs9sZsIquKyoQuWWivvHNnz8kwPEBxXRVVRKnWsp+qhyrnbomTZPXotU01s+iDqwUdgwCK8PLooCl7Gq7OTWyDQ3e6ZFyaLY9dOznLhYx4Bw3/dfP2O9a8aycmI6r06aqZiBOCk8pZxkz2J6xGVUwp8DwqaTZ8OZlGIfkGJssI3F5kNPsk4JK4tY9HqWMCfKPBtUIc7RZLmUb3Q3LwBPp/V8Ji3nhbUk1lfYZ7Ixfpm7d59InByWz17k8dO3XB+cp+BoPFO5/dIYOZR+5Uz57mDmlVoK5RzlHs+jP7aV6dQWRkQcZszfiTUPoZRj62e+yCEuq4o3MNt+iTeX1rk1f1M47CrVDtHofArYqOmnyDadC1sCsdsaQblDhZyzfCaTtUzl9YkbqujzK6THrYAhd/O6XTkOsv+mhH7Gc6YYkpQ0FiqR3rxaKqkJQ0IT2Y7ZnP/cncRANU8Xb7KuGqbGr5jwc0XknlOicChiSbZjef0RD5bv85f7j3m+tMSj3lne3X/InyWu3RgYwihRuDyqktWxZTYuXWdOOP3t6+8YqOuhJEIjjfqM777/lhvj41xuX5JiH6IxXE1DXDu69mlWZ26zou9jpd7IfJMwYLmW2FO5lCd2sLnylPbCHgySAlTRIoASQdP81ISeE8SxKUIRV4up2siSYZYnd16wblwi36ea7f+yl93/+
 Cmf/OOv2fG//siBP3jRWjzGROsobfK+Jz+x5Rf/8BH/+t9+geeJYMt7O35sy8GfupMtLD1Q1ktVmolDn7jxy/++DbsDf3cVNN4+YlkvObo/tFkKwcRwWjfDJc0EHAnG6bNTeH1xkOjDgez5xQlsxbn2feSInVUYv/rFKdnwdEoORBO1xZ+8oykC9EqB+TQqHIqZz2thMTtfijII/X5X+vZ7M3s+gdXgYhJP5vLhJxEEn83G/4QCu+3RkuMLLVfw0sPqKQxsF3aoJ2yfO16f2hK0U+LUnhii9oYTsM2frb915cDRLJTCJIFfeJK6z4/zn8j2/d6b7IPR9Oe0ownNRXkoiPgvvDj8e2tU0gANtimUSrTWJ1USsNuRBNcKUePnPJ5dYEpAvz+3kf5EtUTGUnGfXEaKWhgVUDeJCyyFNzLnqWTmdDiDJ0TJT0cz4pFJnU8d7YFaRkJK0DkksBhax2RUHfM1JobS1HS65dAs8axLXEZ3NpY6qwQazsajuVjBvfFrvH/3iJH0KrTJdby6c5dpKZxGbyly+X+VNHrTsQRafOvojymk+GCocLKoqiSEdjfhbWGla1Ioo1ltGPP04qQlqHyKGJdINBVTiS4sn+XKDlYb9JTZmpeD5TEhsUrnkkvjuXxMwpALVQZMAZnCflXMxBayFNyA/pgPnVaxDAcU0ifncirdxC3dFHdUBka8q9C7FdMVX81ESqNlLGHAgQgynMvZGF1mrd1gWaAxod1ktHmWxphudHmDPLl1g+8fP+BezwT92Q0UOUSQetiNcu9k7m4uos2TdOReQq5TEc3JLYyJs48Xm9jsG0MTVU19qJoXa5s8vLRErQjbRM8K179+y8TABlV+VXRGl9MjrNyp6KW7aJxr89dpUmiIdYilubRD2GyTlb5ptLkGCsO0LIxt8GDxHsbGHmLsYomzTSJ2jx/Rdj7cvjTLommKqY5R4gM1bBeD+PQfj2B9PJ+BhgVeXn3E6/u3uKyfIN
 uvlVOfR/Av/+MzEuT4qn0zufB7Fwrzh1iYWKQhulY+UxJU2TjZkYPE2pf+3VzQA7HLncpuFntmmClqpsI6mlaHAuJ2BPDhP27lj/+5D4ddoZz9NJQgiaZhhxLF9mUnAhrIOp9O1NFIjn3sJqqdQaNXDe1hovqBhfSfT8Ioscwg9q6TBuk+EsKQaz6z5uGp8W0s1YwwmKCnzr+OMjmAvfE6bhSa2NBMsFTaTYZDDsGHUiTa5opL+uH+SSgFtjmUOVdTHStKNbzJQEEdLiIAkRdqcbeuIfKsOKdzLY1JKi5IZPbd6Uf4thA8toYwGNhCVZQGhRRuQUgNgScl3lUOs1Ddhimq1XJ53bxguTupkQZx2YaAWvThzSzIsVmKk6ZyyWRE+Mk8LtF8n+PoxTwM3jVMFIvrRqahPeaHUSLxUFgbQ/kGcdo1uoNyaJBo1nRaQZmoY9YuH3FiJd3yt6tDl7j36Gs2Ozop3eEtySCD6RSJrq4JwmYKtPZpch6yMAhbTPsJ1+33wlf2o8Wrm9ZoI12BKqZ6l7k+fY3uWEkuxT3yU/j6nEK4OAejVz369G4GM9sZi28QN25m0K2EZomLHdbxghuJjCjMK2Gq/nZ7mDjn8IUM5oJridweSOIh84DdOrqCqphNqWM4ppYhzwY5t/l0OabR71ODqWCCNeM06Rdy6Uru5pbagEb2v1PRxu3RQTRJtXQr29jUz3Bz8hp3+xbRSEz13OrCmZ+fYdtP9nJAEtXo0KqI/zRZsn2FsTq62jaZm7hOtfmCXnkfVUEtEq3r6YpslOjbQFf2NPeXrvH6mjSksF1lsBqdYghdjTDk/dfcefCS1988ZqlrjERnEYyGWXorWxksHqQ7tw1Dml4SgmCGUkVP0QjxgizuhwLY+6uDfPUzZ7Lc0+koNs/0UXGloQeH/Qn4nlZydeUOLzfuc29sk9tT94Rdr9NTrkEjqdFrXyJZ3g1UhtXhfiDD8hXMunZR4q6R6rAW5jt
 6uDY1woZh4u8Y0F25/KBnjme1XbQITLt+FCRME0P6niC+MB8k87f8u2Op92tgvHBMOM5Ao3MqVbbxJB2JJflsIZVxelZH70iOX2clvZZO20LL1b8BuxRmnXIwCQtNBjVxNUeLIUcyuY0/zack/pzPRxPZymztmGXUQYd5TKHwZH9UrRzUBkyRUpDxGlSFM6gze8iXgmxIMvL4yX1u9mrIOxZp+Q5oub3XMhRnXRR41bgskCz8+pEPkVKwGfujpOALWa4a58rUDTau3RexWePBbWGPW4/o8CkjdY81fh/bUeyhpNZOQapPM6PKHuYSxA290hk+E8ecUyp9Z5It98MZzmVilCLsFggfdlNQLVyoCzeyZrjMnfENFivaWKzR0RZYQNbBcGlwFSGfnyfsQBJDcfUsJLay2Krm5Y0rrMsJ9v71eVKOx1Kw34fgD90plYZtF3EzSXydi1RRccCHkO1JaML7uCWOt5hRy4BXMcstA1wtaaXPo5Rp89yXIh0VNtloojV0C2sOZbewIJzZ4VfClH+R5VatBus4ao8KK55JZLV+iCFxuemoRiYClHKOajGGJApSJFPtVIfBK48BN2FXiZ0GD4nYjiIo5gXJrjXocqckrpu4WtMp8XmJW2Nr9KV3MVA5QU9OJ/q8IerCtcKIbVxqGGJNP0lLVBlBO7z46t8OsedfD/PB//qSi+fyeHL1JbdXnjOku8Zsv6SCt8/59u1Tibhq4awaJrNHMQjvahINGGtnmNfPsjKwhja1xxJT9a0LXJmUqH39HjcnJsWd7vL81TtWuzZZbV9krsNAg7joUOMcK4Z5ciWR1PuUUBpYLS42jibbSJh9Jaf2Z3D+UBpFSQNsXn5Ivn+ZpLpsmgVfjAntvHn5Zx6sPaevcI5V3SW+3rzBUO0wao8SSqxjKRP2v734kL7mKdrTdfSkmMchtomrp1PmVoQpU8tMy/SPG7AjvW55PU8aw1uiSUAz6dI48cfSSD+SIm6X
 QrpVJmXnS2hXGrkxtMZEqYZS1yLyjivIsc5EWzbG4/kNnk1OsyZRqu1MGhNJrZjCq1HLCS+2z6ZRePBK2ShTcUqO/OYLfvs//8jpXztRJ7+fUKgsEUrjlEfq3iASD0eRdipN2CQVg32yvEcxd6bXuNTaRsBOH2rcC7hc0UrMQU/2/dyBI//fUQIlpo3k1VLhF4gpsZ5mcYfE7aEC/1kkW6XS7SvxtnOWOzefWu4D+9O7t7x9+TX3Jmap80oUobHl9EkleVJs8RfKBOr7mQ4qpzckH6PEyn55j17HZHEIKUjHLIYFztUSQ7UXii1T2SaTungyeoVHrb3cKFBZuNWU0oAmuAmNXSXDyU24fREg4J5tGWJ8KVVDwF4XGtPrWCsbJOdsLvkHAkkSuE87oURjraRV3Hi5qJ3BkFpq7IrQhbZxvXmQudBSWsQtDefzGAoqoPNiOj0+9UymtjOlnqXVR4UhpIxWacqR5FbmotSYXLLplv+6PHJI3eVO1Kf+tAifrYnrD3nkU3XKl+KjIYwHFKO+UITW1XyjbgFtZ5MwCNcO+wtjO4rjC5c3u1QKc15mqrROomwClY4lXO6YZUE3jildy3C+OJjEcMsMl7gcqr3LpFFvMlzVj/J8Ko5fuGG1xZWD/3kGqz940qga4d2fX/P9y8c8u/mEF1cecWNinhvzq6hD1KiFd+cnbqBLNdEi7jLeYeLp1dt8vXKF+c5l7l1/zOPbD8RtpllpHUDplM2UsPzbH17y9RUR5boJuoPrKTmdxISq37K6K1ZSk/kiU87RaNo9ioU9zV/Ad6MumWXAdI3N6Ss8uLSBSpqnWOrfnJCMBT18/+INr783P3PjibzmMvfWb7OxdIe2kHL8t/nQr73ED9+/lW3bQK1oR5WmpUCQIMWmgJwLkjSqelgbWv9xAzadCl7uE9iciVexXj3EdHor4yEVjIgi1tnHk2edImpRwVrvEmtdOgwRYrP2ClKk+aqju7lkXGG+S
 keLxJP+iFZy3Kq50jdHa/oQGS6NlNmnopfIM+VXhjq8hcM7E9j/aRKZQdVkeCdJk0TL7xswKnVUBmiIscnF70gceecKUexXUCqRRN9gwk+a89wHTrh8Gk7a6UyKQlqpS+shK6KNClGbKnHmrT91I8ujkY2BVTTi2I3ulRiF88zfka1KJHzz6mvePb7P3Zk5FiU61gprKYP1NKunmWkyESzObRQxWIkoY+RsDN3idt2nEtAe8GQyUcdYQS8mb2G9kzmo9kdblpkNxbRxs2uFtcZOel3NozASMUgE7gookQjbxFiEhonASgqPKP7mmElttAsH+34ZiEJYczC1kQXhvXaHQhHBaomtlSJgGm6n1DLsnMyEALx5But6QrXlLv32o/I+UhSaY3GMx0hhSKPqJLoNSZJYKtDQ5pxHp0MK3fYlzOYMyOfJcTjgTf4OT/J3hYgTC++ckPjpX8yYbzF97nmE7vbHYYsfhcdSGQ2sZTJAtktcs/ikiFdgM6boVgv7GRWtXNfNcymvlOR9EqtDDMzoN7gyfYNbM5uWWUCp0sDVAWaUCcFGGizNvZZHi3ekJpaJdkjCbV80ChEu333mcyxcmt3IkERBk4j2s8273JhcZ0qcol2h5XL/TVpLhrg7Mk5jaA5xR8LJMt8kXNTNZOMky3PXefn6OQ82N9FGVRK735czv/CUKDrLD++/4f2tuwwUDTBUPkf04WxKPSqojKohR3CkPkZHWaSGas8SQZl++gqGxc2meHbtNpu906gSmmlJbKQtq4sZEcluRSNXxqak+R7w6v++5NGfnvDy8SuJwHM0BWfh9LsLzPQt8H/++oMkglmq/SX65pmojFRT7F5OmWsZk2V61iWO/6gBS88lLo+aB+l45zGXo2MqQWU5MZVnAvHb6k3c6RQ6JQpcauikxi2H3LPppFjl0xBaIxl7mNuaPsaEkVZE4YbjS2l2kjgm/7YiALsQLYp5NIYG5zKm6ydYNYwyUTfEfN
 MEZQEZJB+LJUzilsF8d0ONnpWWCXqzTNR7NNHkXkTqmQQmhkVtDTP4fGGPza9dKBA3KzivIPKUxLZTHnRmqigRxTe1j9PXtshk5whLVd3MZ8m+pKjo8y1nSTXOzb5JVtUm5ksNmOT15ReUKD3EBWr7mcmuJNsmSn5nvrUmiZoDUeIQ4jK2mXSdjqH6RCorXdK0FUb6/WsYds3EcCad7pBO5jVXuasfw+SaIS4jLnE0wTJLx3guhZ7YaqbjajBaCbtZhWNwLWc4uJKuM1HiLFkM5HShckqzjHKcLJCiEucYPBfHsDCI8WQABvmslYIO5r1y6T+bgM5eHDiukQH52zbzfX7Cy0OBWdIcFYwJ12olPhefSKbxdCKtJxMYU1SiDa8ic1swcduDRASS6HYspUf4tl8csc1atsM7nwqPWhRHsig9lSsuWIbOOYk6ifttwSrm2yfE4Ssw+koBiavXukYTeySdceNNnr16w/tXL7ghzbdkmMaU0cdAz3Uu960QbheP1xf+FDhJYlFNMNWkwerjY4Is8ZTKuQ0/nWq5mbYoXC8NM0+ncooN4dmVoRUKvRrpr14Qt/kLs33zVAcpubgjmAt7onD41I3Ui0omdIvcnb3L22cvWZtdJ9O9jItfhOGyM5bx4SW+e/eMh+Ji4+3zlicjTQt+lIZp6Kmd4MrSJldmL3O9f0ocWs1Qy7K811UmGnoZrOugKb6OXLdyGsM7GCvqwphWZlkgUSKGNFnTznffPLcsBHgujd8VloPTJ46c/Y0jrWmlvLm6ybWRS1Q7C8NHVVEXWU+5YJjapcJy0Wu88e8iaItn3HLmfn8yTygYDRfVDSlEK6oQtNUd+499iLYJEbcLpMM5gdSTKRKjiuTkV5BtnUhjYgd93rkYbFLpdxTHlA/qdMylXVygwz6O9L3h9GQbeLR0n/sjl+kKL6E2IJvpgnoipFATz2TQEaNlOVPDqvBDf76Bjc4F+fduaoUb4w8kCvs1yo6
 kcvDnh7DZKTHtfCyntxzn1/+0lS9+5kSOsEnYtmiqUluYKqwm30ZBlV8Ng2auEecZM0+pqjbQEdlAvVMGVY4FckAKpLkzqb+QhTakEY/j2QQfTqNglw/KL0LoOCaMJ5m93jEDxZ44mvwqWCjRMBMmXGUeQnw+g4lk2dbeVV7MrzAVm0+bRFO9XQKNh0PRW0UyIoxhvhqqOp9Ew74oVPZZtNunYLKNpm1PKBMJPXJChG9PKYVDq7iUI6AeXkS/gzCjRFWTNOtGUb+4kTTzBeEum0xG5Rht1opj28ZIjG3BkJBHy4lIhswDjKV5m6Roih1kW4Tzsnb6ErA9hKgjkdQ4CYt41tPtVMyoKPNiXBlNp+JQfBlOlThWT3AdXedzMXlU0SrRP3JnHPn2BYxKVF7KExZ3T6PDJZWirxwsDyNZGLzCt+/e8Zc/fc/Lq3JuL29ILB1gcewKD9dWGJW/STudju02idSnk2mRgq5LqmfLP+0iQoQrzyGe8P1J9JeOc//eW14+krj44DnrbUvcEYY31EwJm83z9t5tFpoN+EhUdN3uQegBcdWP3Yk5lU+/8O4l7RSXmke5srLOYIEahU0aYQcjWZAI+vLNN9wTRrtx5RXff/eO1395wpXRdVZH17g+vM5G+yTaxHZGxBjevHhrEdi62CayfAvIl6hcFVSKLqMLdUovJXJcCk5FU3A4hQrnSjY7x5kvl2aWeo7aH4DbJ/6WOyIirIsZqJ7mwdI1cn0a6c4apy3GQKuIWqV9AjWeiWx0j/y4AWO/PLfs+Ecv0s9miXKbHzoSSLVwV7xtPnafe7L7V0dx+MxdeDCT0ouF1B8PI0FObrx/MwvVur/dNyZs1HYui1ZRYpV9Jr3yuzYHcYMkg8SDh7z8+hYdog41tgXSzBmUupSTKJxXJT9DD7hS7BZBZ3AZhugqBvLrmKydozm0mPAP7bnw4WliBf6P7VYQJ87q8dVFXLeE4HquioTAJvKtsvHcnYgx
 Y5B0m3xyrUvo9quyjGQfiVMzI9DeV9pLi0QhXdIsi43j0qjiPHKy2gPFcUpHJaaYH74SQd3+NAynUhlxEcfxLyLzZBwlvtXSfD3oLyTTeyYZ/flsJsWFpuMbuJEsPCRc23I2ln7Z/+aj4XQeDZLomsJC5QRGETLVsXh0Vmb2y0YrUVx/Ql4bXIM+Lo8iKdJGYbHeBBEyYblJK/OtR8nMRTYzF1vPVKgStYiU2jqXAXmvy7lyUiUyttrkMWFeXndYXPdEAlPhasbChQPN084kwpabv39zaydgXxDKs0oWszpYSawVp8xj2rWAGYnYcXsk4XziSoOT0jInx/z930zBKK0Z3WTL8esQAdFJ8+sclGhPRRD5iQPOzjVsrn/L/3nzgm/ev+DN2/fcGl2VxplCJ0I3rxLnOBEl6FGDWjlOun0hWSLY3cppjG2D2O2KwO5LV6KPp5IXqGF59Rnf/vCGF0+e8erlSx5eu4O2fIzHyw+Zrp5lWjsukbGUDGFgr88D8dwRSpLUYJwkFU1cL6PZHZR5FjOlm6ImIpeUsykEmb+qydFI4jHKedXxw59f8MO3f+GxcOOj+TXuL86yZBplKH8IY9G0NOhLnl+/zmzPNL1Vk6Q65BJ3KoN4OYcNijY2xsUxFy4zUHaJItsUmsy17V6AQuJ68pk8gg+JKGy5QGrEMJfmHvDwwdc8e3iXS+N3uLF2w8K0usQGLmyxxvoje7w/8/1xA9r85tyy165gqtwyJCrmEHk8haIzKZQ5ZJBuncruX9px8oNgap1KmC/sxOlYCiGipENZJXT5KDH5FdLrrEAvUUonB2BAIo15MllrQCVXxy7z5soGm7oBUeF0mk4mo7ERlxQrbonSMl45wxnzd4u/ccZ/ezBFzi0YS6a53ztJ4clgvvjfv+P3/7Cd3HwtizNXmG4dI/BgOtUJOhbVBpo9k0mzK6YxRMVaWSdGx3RG3JTCN6X0S76fzOxgTdXJnPn7rsEb3Ls6y
 XqrganAYtQeZdKYI8zlNdN8OoKkvXF0CtuOOBYyoeimy7uUqjPm6dwa+sKyaTgrMO5Tx4Q05ribqLdNOH2u1QwqeizL1QbOitufCEBz2JfeC1Lked30eWTRZSNMKE7aY4m3eYzmDLEmBev+lYswSypGr3ImXbMYlLjdKSfYFKllOFlij+xD28VcykUIzTe2DvqIoEjMMjmV03UhlyH/CsukuS7zcrU84bOkWiYiai0LrvuFNRcCkxlzln3xLBAXzaXXPgzFJxfJE7G6Wj5KhQjL8Q9cybDOYjqjV1LCGLfULdL41fI+1bLduSIKhVSeDidym68c9zSuXn3Ku9ePeP3uW168eck3z1/ywDgt6SKbcomVyfsjCLOV46YwCC/dY3P2Hj1S1FcWHvLw+jWKk9rxOVNFqXchJuUQf713n1ebV3j54AXvJM6+fP6EW5fXuTzczdLIquVu9nCJySHCvW67Y/DeEYRC4nWCVxP6Wjlvfski5oXMqqaoTKzDdosrZ/5gS/CBJMpslNRIhPzmwS3ev7/F2tAcmxNXma+YYrFllhHDIqPaCS53GWiPaWCkSs9YqQ5l0Qiqgkkahd0UVmEEfWZDzGlPTGV1qEOaxAmzKQvqFgxKRy3sHX0glYTzdcz3SFopLaQ7XUlfchGbgmeX1GusaQZRSKo48B/27P9PV4IDG3/cgHZf+S1HHA7G+VNrAsyrwSWShAsgZwkgl9uky+/iCdiRSrUUtr7ASGf1ONNJ1bRJsagu5tPjVUireeCrcIVBuMQQKvB+KJG+3EEerV2hKTwR5YlwGo4oyD0ahu/2QNrS6hlKauamRIjMkGqiDotzHRJ2Kh7m9e273CrVkCKfvf1frLCySmJ28SqrLQYmhKc6g1pZK9YSedARp08j6JOcvRRSz0hsLX3OSrolgvaZZ8sE5tJjl0pfkMB8t/lBlgNc3H0Sx0+c0Zj5RmLaWG4bc1U95Egk7DgZj+pwlHCbmuvqKdpdMh
 h0lmjtm0+BVSaNPmoGhMnM+6raFYDONo/lpnnW1YMMnI9i4Fg0pV954/1bZ3S+ahZjW+ixzRCWTMdgl4HpUATDnuVc6ZeIFZOD+++8aJB/W/JIw3TSk8HDIYIAGi4LIxg8cmk8HoPeo9qybM18L6TlcWtuBRJH0xn1a6Av4G/DpmZNY+z8cBdOosTmB6H0i3AYHMwrh7q5XnOJrqh6VuNbLMvqbHfGUOjZwGx4IZ0ikvaixtm26YwFVDEdVkjViTBST0nMcinAIM6oOBhG8oEE0vYHi2N38NfXL3nz7iV/fvnGsi7z/f37tAenELjdneS9gVQE6bk8eZ/HD9/x7PU74bDv+PbxUx5tznPNNMSI+WnEYWXkHEnAkNImqaWEma5pnr15x5tvn/Hu0VvmWvW47fGmI7malxub+NuF4rfTjyTXGkqSh0kJM2Ao7ydHBMXhywASxSRuGC9zybBGgmOtYEocYVZSu1LohiwVPYp6Bkol4jeP8eT2U8vApvt3RDxu3OHqxBXKAypIE8avze6nXZLN91cfWZ4xmCtM7/C5N1/8+xE+/hdbAs5VstxzDVWSlrnKHtrkeNTFdtOSNYZREkSDCMH5z2Jw3p2N3ZYgwgXFrk8vCctOoRY2tjuYQIwkxdEszY8bUOGetOz0W3c++pkVTruDafUpoFbstcIqnkZ502K7XOp8a1GHmcchSPOlivLYp1J6UlxPlHYwSiKUryi4txRIoY6W2GJi9kZZHijSkVvCNnHQUOs0ao/n4f5ZJB6f+1B+MYe+yEYGojRc108wEFNDb2gXl4TVliMSKdtxglZpoFbVouUhIWvlncIjrUymNDMbnWsZ7BqyU0HZSWmQ81l/i4wOiXSeDEK1LxKDebir+dmBdoVcb5rk1foi/RIDPLbFk3QsmRpp6kKXBq7MPeGmKFXnYWGhXf4Un66zLMIe81ZicP3bVLb2aC35ASrU3uZF5ZmkSkQs+DIQ1elIhpMqabYJoOJ
 YEl0SvYsPSeNcqGRRXHPUXcnMxVKMThLFpdn7hbmWcloZLWwj53AcjVLcvcfDGRBmHN0XyIJHKQv5nUwEZGAUbjI/FmDcv4Rhh2QmReRGfIroOin7djZRYn+uZbDVbISAfbkJFyvzGL50BnyFUYVpF1WzIljjrDcM0ONpfuBoIdH7ralIrWVZMyzbnk7x8VAqRCAa5Fz2SbFp3VMI2O9N4J4Yy8NutB5FpDrXUiHCm703mqXWZd6/ec93b763PCTl2+fvuaY3ErDtHP6/OsmZL90xtc1KfLzHS4mm3/3fF9wcNVIVoaDctYzmoDq6SwZR+tVTcKGBsaJOdMWjTHdu8uYbcdan73jw8BFD9S2c+Z0TZ75wZU47SpSkifM70tDm9vF4dYWHG9fpzW7j7CdBWG3xJOB4Er0FeroSqjHFyvlIUNOVoqMpop5iie2pNsWEHE+nPN98p8ITafLHPH1wj3mdiOfQMvVxncJuWbSEqUg768RQWiOdmQZUhUOc/DSEY18kEXS+ho4SAy+Wr0jaMFAiMX/MtMmDB2959/41t2bu0BwvrOjdSHOiljibDEIl7Sz1CnNOyzbfe85I3yb1wv1J5yt/3IBJbonLnp8ESjb3RrEtCmNoA/XRNThsPUe5VwXd5i+Cw8vpSW9Cm1FBpfBY8fEIak7F0iJO12UlhS6cY7RLxxRdTU9KBwHiEPFnIykIqyNJVDzc/Pgp5wwMUR3EH0kjySqaXBt/tMEFjIsidLrl0yaRaq6un9azaez4yRnCd/nxsMLAepHwjRTitDDhtH82U+7FljvvTXbidKdD6T7kh0F+Vu3yIGWLOx3CsuPB1TTL9rRIYQ2Zhy8l1kgslqJyzKLwYLJE3QZWZh/z/ukdZhXSrJ+7U3IkVCKIkf6QMjodcuiX4zCY0Mxy+ywdotrtEvUizpaIG0aTv0/2/0w6XR7FJH7hQoxVDg3nFZbRD9MJnbK9FejtE+mWWDpuHYphpwMjh8JZ
 SxygI1SizIlo9If90Z0IZVzcZ+BiNqtFRmnoPIwiIpOpeuarZjBJVDOejhZxk6YMLpQUEYXOJpku20R6xd31thIz/ZUY5dganYuoFb5r9MxhVT1peejobFqrxOcGWkLr+c1/nMLfL19if5ewnYaI7a4U7Amn9FgO9W5llDgXSCwOI1McsNVRtt8rh7GYCspFWMzfpz5ae8Lbd9/w/tV7Xgqvvd28RVVQGrv+9+f8+v/5JaeE6R/ffML7P73mzeYl4clK/PZ7cE4irzKoXo5fOa9Xb1lWzlzumKKnrIdXd57y5N5rXr19xbu/vuXrpXlSHROw+jgQ56/CUYbXolX2oC3qRROWjz5GxZxug/m2RaJOFeF6IoUUiYKaJD0JF4qJtpKUJQJpUrbTLmLdGNpMV+Yg9YkG5gev8er71zx7+oKrfTPMynld7psSXtTRoRReD1Ry5oMwkval0ZXYIuJxBW3bJq11YzSnDbBquiGxeA5jugl9Xj8Prt/iv979he/ef8ub779jae42bQXjKL0riZYaizidxmLvdZ6+eMXL91/z/MUTrojjzozf+HEDlocULae7V6Kzy5SDnsudkSkW9Iuc3hNJUWg5eWfjyQ2sJtsumQTrQpS2mSgPpNEunNghTjQoimuU+NkX0EFffDO6sBTcdtixY0sgHfVd5PgkcO7XNpz5rTXxx7xJOZ1K+GF5zYce5B6MEStvZDSlnUFhnnn5vJXsdtQS2VolGvVLPOsRh5sTxhsJLKHFKlaiUZawXir1e6Twt7jg/bEP6btDSd0aRJtXM5dztIxI41ULv1VYK6QJ0ml1SKHuuCjx8ThyvRpYW3rI++9+YKmkjqaDcZTuC6fbvZD+5BI6fevp9lRSJPl/VPZnMqcTtXsJC3m96GNLSd3lTeRWP9ROBQz7KvE4kk5zqDilNMawqzi7dxGmM1FMnAhi2CqIvp0+EhuzmRT1X2ueZDisgj5pwC6JI7PCg+PmFROB5ZjsQ
 +g+7ku/i5L71+9wrWcYvVu6iIFEWCfzvXsiIOIk7XbmCy+RjMv5qBBh0znm0+pifj5DOZlWwcSZL8oUdDEVWsOQIMFwQDXjdcMMTtxDl9tNpX0aylPuBO5NJ+JkOkP5RrpyW2kX8SgXzm3yqqfNr4pKm1TiD4bg8JuTFAepef71E15/95x3b77l6/sPWRLu9P/Cno/+YRu7P42lSTnAnbERZiuryXPLxGV/EtmBbYy3Xaa12kh9jpHN6U3GqqcZytagjSqmN7+W1/cu8+KbZzyaucpU3SD2+xNIPq3Az/yMxYgG5jtGaQgsIu9MMsVuNby69piHX79gqu8ug5oFqkPbMOToUSdWk2ArPLY/mWoxkPIQNTUJOmb6V7kj7Pr29SvevLjLtZkNNozL3L90m+XJNZbGzU/InaDAtpy2/HmG269iFBfrCirmSnM/t0Y3xaWnLKPyNweusCj7sy5//82D55IE3vDm1Ttemb+SePKce+sP2Zi9g7pshIrIBp5dv8mjOyu8kyb8+skrnl67y8Ppyz9uwOZI5fJQgBT6mSIW60YYreqlu9TEQF6PKGomTg4ST0rEVu0KqPOup0MaoEk4MccqkRIH4ZfAOubi1PSlDDCjrmP/r77ig//xCRl+eQyVtWD/pRMHRX2/+ue9/OF/fkakeThO5jCp3s0YGuZZLdMxn9zEgn8+ffbias4Kqo+JY4qa9XgWiQqruFTXR61LGqFbfejx11iuOkZI8+XsDydjawCFe+PoTRSmzGjFKHynluYrPZdF8dE4tDaZaJ1yBNYLKFOYuHbpPm9/eMXzjWWabcMxSRw2XyQxj11vl4g3mVBP6gEX3CVS6STHG7zLLcvO+h1jSdoTwLk/BlJ7Ioppe3F8cS7z1cseZ2FhUd5RF/NFjyL6DrkzIOrfLzF1MqmPG1M3pZCmaAkopPdiKp2n8rjct8zXpjm6AorpuZhFx5FgxiMkoiqaedg3htonDtUZcdVoDXoXBY
 nb/RiOk6hl5m7rKIYdE8k8Gk6TVR7t7gWo5X3dPnW3PA99UphxQNhmzDxCw6eK+TIjD8an0QZWWRDgFz/dhffxKCocFCzWS9yKKCX/eAJF4n51USoWUltIE2b13eLI5/96gQ7VCu9fvOPt2x/47ukzrrROEbbropznjzi6LYTp/kWWmrtIOuiF/25XgoT3a2Oq0Jf2saSfpLt8mG/uv+PR8iZrbUMsaOYocSsm64gkCadsVrT9zBZ0YFJU4bU9GPfdIYRKdOyIraEhoJQ8uyLKJPXMNErEXblteSrwixfvxT0fMNe5Qm+ZCcUZiX3HBBMyhyiN6cZUbKQlvAhjXjN35+Z49eA6N8bm6Uxr4PrkBq+fvbA86/398ycMVhrEbDQ8WFrh/tQyK+pRtBFqSoQ7Z5vGWZb9nW0w8XhxXfbhKsumFR7ceSJN+JgfJI5/J6L05x/+yvrULNeNU0zXmejM1qISUYi9kMFIlYpnyyvcGpxjuenvRtOXWEctG85nY/TOZ6Nxmkvto6woq5kSBjPJCRwuHqc+Rk1ZaAv6tA6BSDlQaZ10ZPVQ7N3EUI5JuGKZa3PrlMVWsf0Xzmz9hRNhuyMI3eeJ9a9OceLfT7P7X6z44z+fI8q/kfuLV3i5cIOnQ5fEdZPoOBWJ7rjE0n0BeG91lXiRRneIecBrLT3eaajERSP3+pIuRTPvnoxyhz+Ve/yo3BdCsRz4eaWJG0UD1NvkoJIi0llnoJLma9oXy7y4TGVEnZykRb6VAvr6+QPumC9R+2aiFnbrjmpltnTUcle5QaLfemUf/jv8CPsymgbrZJqc5fOPx5O1Mxx3afZW4coh22LGz6ZiOhnKjH0Mw+7mkRO5zJsnoZ1PZWjHRWnAQDbKZnj1+B7PHtzn9swmt0Xh+23j6PErZKW+XzirgOqTMQxZxTAaLIUyPsdYRjNtrgm4/tFRwL6A3ohaSmzicN0WQH9BJxqbBIwS043edVRL3K4Xlux2EJH
 xLiPssIJmt0aGJQJPp4j7JLUyEdfGpcJ+TIIR9Z61JInI/e4/HTh3yIULH7panhRV55jOWGQNOcKXcVah9OVrhdkTSNyfTkniKPfuPLc8BtvcgE+uXGewsIfwk7F88pOzFKVpWC7rpMA6Ec8twThuDyHgiHkSQrplQFRnYhsPJq9xb3WR1qw6ku3CmRFX29ROUGZfRLl/A3qFTra7CEN8CzneteT4Ci8K8xdZx5BnlUCdiFpTcBO5Hqk0Z6t4+eQx3/35O96L6zy//4rpngUqk3qoSu7l7vwaM3U9LIqRxO8PI3K3O/GHXcm/EM944ximlG6mh6/y4t0Lvn1xm9kaLaN9lzGqL9MvKac0oInVkXu8uvkXbq09ZKDYwHKjgeijCRhyJR0VdTOqG+fx1WssGY2MaVq5MTvPn57eZc3QQ3OUhiSbXMo8cgg8JcJ5MF3EPIFMSReGtBYWB27/uAE1/gXLbRdSKXEspEtRwUCkqP6ZJPJOR1DhrqDobLTwQZ7EsDJqnUTBQ1WsSmRcSaxiJEzJ5YYe7owvMG/oYjhPT3mgmuLwNmKlOI79xlVO0gkOfuqDs30usWEa+lpmeCBNPhMln+WZTYNNmrCIF95/vIDH4QzKUsZY1S/R42b+2iKUQbdcIvZHEvRZMIrdvpaLK1opWLU45GhkuzTMBGvmmSTJKrTmuS/meZXW6eisJMtfzGND2YgmooOHV1d5uDDDckIVbUcjaD0lAhNUx53pJctIwalENRs1JnoDMymSz1OfzqND4p/OqYiB4EaJnHm0upVbRvSPhNdLcTcycDaOMatwy0Al8/MkJi5E03fMk74drsyGVfO4e47ewlziz9kzHJ3DUkgJrUeTGU+QmBzbKs0XTdOBYNrNNxxL/J+Q92w8ryRtnxeHPvZFH1bPQkgF2eL05u9pO80cKK7aY35gTmIrPRKNm89l0BNcIw3XRL84xZAU64Dw65II43x6O1NxtYwmyr9JJJ3ObGE8
 uxPfw2kozmXjuy2M0tNJ1AtHdQdXSjpJ5PAHvhT75qBNGOCS8RbPJbp9I833zcu3vP76JbfHF5mWiFsv+xd8OlHYrFWEa4AstxRct7vjtNWDwIMJ1Ca109U4Tn/OCDfHbrA5vGgZG+i504/qdMGOagONse30FXbTGKAkVlyvyq2ekexuWs1TztPayD6aRNzhJOqSDCjDyoWv1KRczOD26gYv35iZapUrs4vcW77O2tBNFgY2mOkUZotUWZYUpp8N5cIWV47+3oFPf3oMr0MBDLWOcHf1Jq/XbjJTa6JBIfUrjfbk+n1G2pdpL53gsvlZkDNGntxcYyxfT6XE85gT+SgFO7olns6LY/blq0hzKyXkcLplpMmcuHBzeKVwZzm5jpnYf+5FYkANRZFtpApupJiXBhqv8/Ivz37cgNlOwcsRB8Isdy1nHEylyiobd3GYfR864PBVMEpn4QEpvNTtQVQI813K6kJzNgmTFOJAeJNl7ma5fwxnPzjLxQ9dUEqhNAWrmem+zHz3MuqKURa0y6yXSAySjdwo1lLvnEPGgWjyDkbI63MpPleALknH/SsP+NOrR+Ja1TScTKHzXCrjwnEd4gR+hxNEhdIpts2j06+RmfpZ7hvHmE2ulG0ps6yJ7DgVj/5UkuWq6GRAGePmuSJZnWxUGJjLbqBLXtciJ1UvrzN6lLEoBb+UXs+gj5JR5wx05yJoOaGgRaJZw7kEuvyrmVPPYchU0XIxlxFhvkFhysvxpUzElKE1Pxn4pIKp9B7Lc/qGzVc193ox6VjCenKjuHWkuHQ4R3/mQMbn3nRLwU5I0/cktFiav0eYRudQQpnsX9KZTDp8KsQViwncEYrrkURptDwmJaI1y7Ealv3pFgczRTcyElTBoDRjh3UavXbmOaQ59AdVs1I4xaWEVuakOZdiWkU4ZDtDK5nOaGdU2UlPdAVzsl0an0La5bPMzxCsPp9nGcdY6FyEWtwo1yGXWi8dt5ee8Jcf3
 vPn77/j+3fvhHWEea6Zh812YRQX6XJNIf9gMhOdy/zpv75jUDuFw5dxuB3IoETOz7TxCvcubWDK0jAqwrbUNkmORG3zza8NSQOMa6WR68eZb+pnSDVHtH0epux+VjsksmYKBsXUUO1fSVtcPVd7BinwEIa/mCA1do5c/xKmtIOCSQaGhG8f3n0q3HyLFeMAqoB6VNEGBms7pH4DcN4WytmPXTn+c3scdgcRcjyNsWIdfUm1FBxLpEu279XDZ7x8/Y4fZF8XeibxFIH2ORxNd3oX3QoD+jiJ6dGdNAo69dZNMlA2So9sqy6njbs

<TRUNCATED>


[26/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1239] DML Language Reference write description parameter

Posted by de...@apache.org.
[SYSTEMML-1239] DML Language Reference write description parameter

Give example of write description parameter in DML Language Reference.
Describe author and created parameters.
Update metadata examples in DML Language Reference and Beginner's Guide
to DML and PyDML.

Closes #385.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/452a41a0
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/452a41a0
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/452a41a0

Branch: refs/heads/gh-pages
Commit: 452a41a0257956b3b0d20cf894f37f2d3e20a90f
Parents: 1695060
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Mon Feb 13 12:44:10 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Mon Feb 13 12:44:10 2017 -0800

----------------------------------------------------------------------
 beginners-guide-to-dml-and-pydml.md | 59 +++++++++++++++++---------------
 dml-language-reference.md           | 56 +++++++++++++++++++++++-------
 2 files changed, 74 insertions(+), 41 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/452a41a0/beginners-guide-to-dml-and-pydml.md
----------------------------------------------------------------------
diff --git a/beginners-guide-to-dml-and-pydml.md b/beginners-guide-to-dml-and-pydml.md
index 67f1c96..8479ef7 100644
--- a/beginners-guide-to-dml-and-pydml.md
+++ b/beginners-guide-to-dml-and-pydml.md
@@ -283,14 +283,15 @@ is 0-based.*
 </div>
 
 <div data-lang="m.txt.mtd" markdown="1">
-	{ 
-	    "data_type": "matrix"
-	    ,"value_type": "double"
-	    ,"rows": 4
-	    ,"cols": 3
-	    ,"nnz": 6
-	    ,"format": "text"
-	    ,"description": { "author": "SystemML" } 
+	{
+	    "data_type": "matrix",
+	    "value_type": "double",
+	    "rows": 4,
+	    "cols": 3,
+	    "nnz": 6,
+	    "format": "text",
+	    "author": "SystemML",
+	    "created": "2017-01-01 00:00:01 PST"
 	}
 </div>
 
@@ -313,16 +314,17 @@ is 0-based.*
 </div>
 
 <div data-lang="m.csv.mtd" markdown="1">
-	{ 
-	    "data_type": "matrix"
-	    ,"value_type": "double"
-	    ,"rows": 4
-	    ,"cols": 3
-	    ,"nnz": 6
-	    ,"format": "csv"
-	    ,"header": false
-	    ,"sep": ","
-	    ,"description": { "author": "SystemML" } 
+	{
+	    "data_type": "matrix",
+	    "value_type": "double",
+	    "rows": 4,
+	    "cols": 3,
+	    "nnz": 6,
+	    "format": "csv",
+	    "header": false,
+	    "sep": ",",
+	    "author": "SystemML",
+	    "created": "2017-01-01 00:00:01 PST"
 	}
 </div>
 
@@ -331,16 +333,17 @@ is 0-based.*
 </div>
 
 <div data-lang="m.binary.mtd" markdown="1">
-	{ 
-	    "data_type": "matrix"
-	    ,"value_type": "double"
-	    ,"rows": 4
-	    ,"cols": 3
-	    ,"rows_in_block": 1000
-	    ,"cols_in_block": 1000
-	    ,"nnz": 6
-	    ,"format": "binary"
-	    ,"description": { "author": "SystemML" } 
+	{
+	    "data_type": "matrix",
+	    "value_type": "double",
+	    "rows": 4,
+	    "cols": 3,
+	    "rows_in_block": 1000,
+	    "cols_in_block": 1000,
+	    "nnz": 6,
+	    "format": "binary",
+	    "author": "SystemML",
+	    "created": "2017-01-01 00:00:01 PST"
 	}
 </div>
 

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/452a41a0/dml-language-reference.md
----------------------------------------------------------------------
diff --git a/dml-language-reference.md b/dml-language-reference.md
index 05625fd..22ec0d9 100644
--- a/dml-language-reference.md
+++ b/dml-language-reference.md
@@ -933,7 +933,8 @@ Below, we have examples of this matrix in the CSV, Matrix Market, IJV, and Binar
 	    "format": "csv",
 	    "header": false,
 	    "sep": ",",
-	    "description": { "author": "SystemML" }
+	    "author": "SystemML",
+	    "created": "2017-01-01 00:00:01 PST"
 	}
 </div>
 
@@ -965,7 +966,8 @@ Below, we have examples of this matrix in the CSV, Matrix Market, IJV, and Binar
 	    "cols": 3,
 	    "nnz": 6,
 	    "format": "text",
-	    "description": { "author": "SystemML" }
+	    "author": "SystemML",
+	    "created": "2017-01-01 00:00:01 PST"
 	}
 </div>
 
@@ -983,7 +985,8 @@ Below, we have examples of this matrix in the CSV, Matrix Market, IJV, and Binar
 	    "cols_in_block": 1000,
 	    "nnz": 6,
 	    "format": "binary",
-	    "description": { "author": "SystemML" }
+	    "author": "SystemML",
+	    "created": "2017-01-01 00:00:01 PST"
 	}
 </div>
 
@@ -992,12 +995,13 @@ Below, we have examples of this matrix in the CSV, Matrix Market, IJV, and Binar
 As another example, here we see the content of the MTD file `scalar.mtd` associated with a scalar data file `scalar`
 that contains the scalar value 2.0.
 
-    {
-        "data_type": "scalar",
-        "value_type": "double",
-        "format": "text",
-        "description": { "author": "SystemML" }
-    }
+	{
+	    "data_type": "scalar",
+	    "value_type": "double",
+	    "format": "text",
+	    "author": "SystemML",
+	    "created": "2017-01-01 00:00:01 PST"
+	}
 
 
 Metadata is represented as an MTD file that contains a single JSON object with the attributes described below.
@@ -1015,6 +1019,8 @@ Parameter Name | Description | Optional | Permissible values | Data type valid f
 `nnz` | Number of non-zero values | Yes | any integer &gt; `0` | `matrix`
 `format` | Data file format | Yes. Default value is `text` | `csv`, `mm`, `text`, `binary` | `matrix`, `scalar`. Formats `csv` and `mm` are applicable only to matrices
 `description` | Description of the data | Yes | Any valid JSON string or object | `matrix`, `scalar`
+`author` | User that created the metadata file, defaults to `SystemML` | N/A | N/A | N/A
+`created` | Date/time when metadata file was written | N/A | N/A | N/A
 
 
 In addition, when reading or writing CSV files, the metadata may contain one or more of the following five attributes.
@@ -1126,7 +1132,8 @@ Example content of `out/file.ijv.mtd`:
         "cols": 8,
         "nnz": 4,
         "format": "text",
-        "description": { "author": "SystemML" }
+        "author": "SystemML",
+        "created": "2017-01-01 00:00:01 PST"
     }
 
 Write `V` to `out/file` in `binary` format:
@@ -1144,7 +1151,8 @@ Example content of `out/file.mtd`:
         "rows_in_block": 1000,
         "cols_in_block": 1000,
         "format": "binary",
-        "description": { "author": "SystemML" }
+        "author": "SystemML",
+        "created": "2017-01-01 00:00:01 PST"
     }
 
 Write `V` to `n.csv` in `csv` format with column headers, `";"` as delimiter, and zero values are not written.
@@ -1162,7 +1170,8 @@ Example content of `n.csv.mtd`:
         "format": "csv",
         "header": true,
         "sep": ";",
-        "description": { "author": "SystemML" }
+        "author": "SystemML",
+        "created": "2017-01-01 00:00:01 PST"
     }
 
 Write `x` integer value to file `out/scalar_i`
@@ -1175,7 +1184,8 @@ Example content of `out/scalar_i.mtd`:
         "data_type": "scalar",
         "value_type": "int",
         "format": "text",
-        "description": { "author": "SystemML" }
+        "author": "SystemML",
+        "created": "2017-01-01 00:00:01 PST"
     }
 
 Unlike `read`, the `write` function does not need a constant string expression, so the following example will work:
@@ -1186,6 +1196,26 @@ Unlike `read`, the `write` function does not need a constant string expression,
     file = "A" + i + ".mtx";
     write(A, dir + file, format="csv");
 
+The `description` parameter can be used to attach a description to the metadata:
+
+	A = matrix("1 2 3 4", rows=2, cols=2)
+	write(A, "mymatrix.csv", format="csv", description="my matrix")
+
+This will generate the following `mymatrix.csv.mtd` metadata file:
+
+	{
+	    "data_type": "matrix",
+	    "value_type": "double",
+	    "rows": 2,
+	    "cols": 2,
+	    "nnz": 4,
+	    "format": "csv",
+	    "header": false,
+	    "sep": ",",
+	    "description": "my matrix",
+	    "author": "SystemML",
+	    "created": "2017-01-01 00:00:01 PST"
+	}
 
 ### Data Pre-Processing Built-In Functions
 


[15/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1225] Links to release process

Posted by de...@apache.org.
[SYSTEMML-1225] Links to release process

Add links to release process from overview and header menu.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/fe26aab1
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/fe26aab1
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/fe26aab1

Branch: refs/heads/gh-pages
Commit: fe26aab14834f0d84d357d4c77f0d4f8a80e407e
Parents: 4b899f2
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Wed Feb 1 17:55:26 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Wed Feb 1 17:55:26 2017 -0800

----------------------------------------------------------------------
 _layouts/global.html | 1 +
 index.md             | 1 +
 2 files changed, 2 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/fe26aab1/_layouts/global.html
----------------------------------------------------------------------
diff --git a/_layouts/global.html b/_layouts/global.html
index 9e668a0..5aac166 100644
--- a/_layouts/global.html
+++ b/_layouts/global.html
@@ -70,6 +70,7 @@
                                 <li><a href="contributing-to-systemml.html">Contributing to SystemML</a></li>
                                 <li><a href="engine-dev-guide.html">Engine Developer Guide</a></li>
                                 <li><a href="troubleshooting-guide.html">Troubleshooting Guide</a></li>
+                                <li><a href="release-process.html">Release Process</a></li>
                             </ul>
                         </li>
                         <li class="dropdown">

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/fe26aab1/index.md
----------------------------------------------------------------------
diff --git a/index.md b/index.md
index add9a26..c84e7b7 100644
--- a/index.md
+++ b/index.md
@@ -84,3 +84,4 @@ command-line interface.
 * [Contributing to SystemML](contributing-to-systemml) - Describes ways to contribute to SystemML.
 * [Engine Developer Guide](engine-dev-guide) - Guide for internal SystemML engine development.
 * [Troubleshooting Guide](troubleshooting-guide) - Troubleshoot various issues related to SystemML.
+* [Release Process](release-process) - Description of the SystemML release process.


[21/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1230] Add MLContext info functionality to docs

Posted by de...@apache.org.
[SYSTEMML-1230] Add MLContext info functionality to docs

Add project info functionality to MLContext Programming Guide.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/cb6f8456
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/cb6f8456
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/cb6f8456

Branch: refs/heads/gh-pages
Commit: cb6f8456feb5dd8b664b2551f7db483075b85cf3
Parents: 7283ddc
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Fri Feb 3 15:02:13 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Fri Feb 3 15:02:13 2017 -0800

----------------------------------------------------------------------
 spark-mlcontext-programming-guide.md | 47 +++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/cb6f8456/spark-mlcontext-programming-guide.md
----------------------------------------------------------------------
diff --git a/spark-mlcontext-programming-guide.md b/spark-mlcontext-programming-guide.md
index 8c0a79f..45c0091 100644
--- a/spark-mlcontext-programming-guide.md
+++ b/spark-mlcontext-programming-guide.md
@@ -1636,6 +1636,53 @@ scala> for (i <- 1 to 5) {
 
 </div>
 
+
+## Project Information
+
+SystemML project information such as version and build time can be obtained through the
+MLContext API. The project version can be obtained by `ml.version`. The build time can
+be obtained by `ml.buildTime`. The contents of the project manifest can be displayed
+using `ml.info`. Individual properties can be obtained using the `ml.info.property`
+method, as shown below.
+
+<div class="codetabs">
+
+<div data-lang="Scala" markdown="1">
+{% highlight scala %}
+print(ml.version)
+print(ml.buildTime)
+print(ml.info)
+print(ml.info.property("Main-Class"))
+{% endhighlight %}
+</div>
+
+<div data-lang="Spark Shell" markdown="1">
+{% highlight scala %}
+scala> print(ml.version)
+0.13.0-incubating-SNAPSHOT
+scala> print(ml.buildTime)
+2017-02-03 22:32:43 UTC
+scala> print(ml.info)
+Archiver-Version: Plexus Archiver
+Artifact-Id: systemml
+Build-Jdk: 1.8.0_60
+Build-Time: 2017-02-03 22:32:43 UTC
+Built-By: deroneriksson
+Created-By: Apache Maven 3.3.9
+Group-Id: org.apache.systemml
+Main-Class: org.apache.sysml.api.DMLScript
+Manifest-Version: 1.0
+Version: 0.13.0-incubating-SNAPSHOT
+
+scala> print(ml.info.property("Main-Class"))
+org.apache.sysml.api.DMLScript
+
+{% endhighlight %}
+</div>
+
+</div>
+
+
 ---
 
 # Jupyter (PySpark) Notebook Example - Poisson Nonnegative Matrix Factorization


[32/50] [abbrv] incubator-systemml git commit: [SYSTEMML-871] Remove optional python flag from docs examples

Posted by de...@apache.org.
[SYSTEMML-871] Remove optional python flag from docs examples

Update Beginner's Guide and Engine Dev Guide for python flag.
Fix or remove broken links in DML Language Ref.

Closes #406.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/fd96a3ea
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/fd96a3ea
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/fd96a3ea

Branch: refs/heads/gh-pages
Commit: fd96a3ea9579fa6c1d5d042a95d090b40c9f2eb3
Parents: 5c4e27c
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Tue Feb 28 11:48:06 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Tue Feb 28 11:48:06 2017 -0800

----------------------------------------------------------------------
 beginners-guide-to-dml-and-pydml.md | 18 +++++++++---------
 dml-language-reference.md           |  4 +---
 engine-dev-guide.md                 |  2 +-
 3 files changed, 11 insertions(+), 13 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/fd96a3ea/beginners-guide-to-dml-and-pydml.md
----------------------------------------------------------------------
diff --git a/beginners-guide-to-dml-and-pydml.md b/beginners-guide-to-dml-and-pydml.md
index 8479ef7..e82909d 100644
--- a/beginners-guide-to-dml-and-pydml.md
+++ b/beginners-guide-to-dml-and-pydml.md
@@ -50,13 +50,13 @@ DML and PyDML scripts can be invoked in a variety of ways. Suppose that we have
 
 	print('hello ' + $1)
 
-One way to begin working with SystemML is to [download a standalone distribution of SystemML](http://systemml.apache.org/download.html)
+One way to begin working with SystemML is to [download a binary distribution of SystemML](http://systemml.apache.org/download.html)
 and use the `runStandaloneSystemML.sh` and `runStandaloneSystemML.bat` scripts to run SystemML in standalone
-mode. The name of the DML or PyDML script
-is passed as the first argument to these scripts, along with a variety of arguments.
+mode. The name of the DML or PyDML script is passed as the first argument to these scripts,
+along with a variety of arguments. Note that PyDML invocation can be forced with the addition of a `-python` flag.
 
 	./runStandaloneSystemML.sh hello.dml -args world
-	./runStandaloneSystemML.sh hello.pydml -python -args world
+	./runStandaloneSystemML.sh hello.pydml -args world
 
 
 # Data Types
@@ -778,7 +778,7 @@ for (i in 0:numRowsToPrint-1):
 
 <div data-lang="PyDML Named Arguments and Results" markdown="1">
 	Example #1 Arguments:
-	-f ex.pydml -python -nvargs M=m.csv rowsToPrint=1 colsToPrint=3
+	-f ex.pydml -nvargs M=m.csv rowsToPrint=1 colsToPrint=3
 	
 	Example #1 Results:
 	[0,0]:1.0
@@ -786,7 +786,7 @@ for (i in 0:numRowsToPrint-1):
 	[0,2]:3.0
 	
 	Example #2 Arguments:
-	-f ex.pydml -python -nvargs M=m.csv
+	-f ex.pydml -nvargs M=m.csv
 	
 	Example #2 Results:
 	[0,0]:1.0
@@ -860,7 +860,7 @@ for (i in 0:numRowsToPrint-1):
 
 <div data-lang="PyDML Positional Arguments and Results" markdown="1">
 	Example #1 Arguments:
-	-f ex.pydml -python -args m.csv 1 3
+	-f ex.pydml -args m.csv 1 3
 	
 	Example #1 Results:
 	[0,0]:1.0
@@ -868,7 +868,7 @@ for (i in 0:numRowsToPrint-1):
 	[0,2]:3.0
 	
 	Example #2 Arguments:
-	-f ex.pydml -python -args m.csv
+	-f ex.pydml -args m.csv
 	
 	Example #2 Results:
 	[0,0]:1.0
@@ -885,5 +885,5 @@ for (i in 0:numRowsToPrint-1):
 
 The [Language Reference](dml-language-reference.html) contains highly detailed information regarding DML.
 
-In addition, many excellent examples of DML and PyDML can be found in the [`scripts`](https://github.com/apache/incubator-systemml/tree/master/scripts) directory.
+In addition, many excellent examples can be found in the [`scripts`](https://github.com/apache/incubator-systemml/tree/master/scripts) directory.
 

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/fd96a3ea/dml-language-reference.md
----------------------------------------------------------------------
diff --git a/dml-language-reference.md b/dml-language-reference.md
index fca2b9b..31f7d23 100644
--- a/dml-language-reference.md
+++ b/dml-language-reference.md
@@ -61,8 +61,6 @@ limitations under the License.
     * [Transforming Frames](dml-language-reference.html#transforming-frames)
   * [Modules](dml-language-reference.html#modules)
   * [Reserved Keywords](dml-language-reference.html#reserved-keywords)
-  * [Invocation of SystemML](dml-language-reference.html#invocation-of-systemml)
-  * [MLContext API](dml-language-reference.html#mlcontext-api)
 
 
 ## Introduction
@@ -334,7 +332,7 @@ var is an integer scalar variable. lower, upper, and increment are integer expre
 
 [lower]:[upper] defines a sequence of numbers with increment 1: {lower, lower + 1, lower + 2, \u2026, upper \u2013 1, upper}.
 
-Similarly, seq([lower],[upper],[increment]) defines a sequence of numbers: {lower, lower + increment, lower + 2(increment), \u2026 }. For each element in the sequence, var is assigned the value, and statements in the for loop body are executed.
+Similarly, `seq([lower],[upper],[increment])` defines a sequence of numbers: {lower, lower + increment, lower + 2(increment), \u2026 }. For each element in the sequence, var is assigned the value, and statements in the for loop body are executed.
 
 The for loop body may contain any sequence of statements. The statements in the for statement body must be surrounded by braces, even if the body only has a single statement.
 

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/fd96a3ea/engine-dev-guide.md
----------------------------------------------------------------------
diff --git a/engine-dev-guide.md b/engine-dev-guide.md
index 8dff7f7..557f864 100644
--- a/engine-dev-guide.md
+++ b/engine-dev-guide.md
@@ -63,7 +63,7 @@ The `DMLScript` class serves as the main entrypoint to SystemML. Executing
 `DMLScript` with no arguments displays usage information. A script file can be specified using the `-f` argument.
 
 In Eclipse, a Debug Configuration can be created with `DMLScript` as the Main class and any arguments specified as
-Program arguments. A PyDML script requires the addition of a `-python` switch.
+Program arguments.
 
 Suppose that we have a `hello.dml` script containing the following:
 


[12/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1195] Improve parfor parameter documentation

Posted by de...@apache.org.
[SYSTEMML-1195] Improve parfor parameter documentation

Add missing parfor log and profile parameters to DML Language Reference.
Update missing parameter values for mode, datapartitioner, and resultmerge
parameters.

Closes #358.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/f802be0d
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/f802be0d
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/f802be0d

Branch: refs/heads/gh-pages
Commit: f802be0df69b4fce4a36be16ec7c0a1fafd44573
Parents: 41cb5d7
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Thu Jan 26 00:51:34 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Thu Jan 26 00:51:34 2017 -0800

----------------------------------------------------------------------
 dml-language-reference.md | 118 ++++++++++++++++++++++++-----------------
 1 file changed, 70 insertions(+), 48 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/f802be0d/dml-language-reference.md
----------------------------------------------------------------------
diff --git a/dml-language-reference.md b/dml-language-reference.md
index c828e70..f3fba3b 100644
--- a/dml-language-reference.md
+++ b/dml-language-reference.md
@@ -357,51 +357,73 @@ The syntax and semantics of a `parfor` (parallel `for`) statement are equivalent
 	}
 
 	<parfor_paramslist> ::= <,<parfor_parameter>>*
-	<parfor_parameter> ::= check = <dependency_analysis>
-	||= par = <degree_of_parallelism>
-	||= mode = <execution_mode>
-	||= taskpartitioner = <task_partitioning_algorithm>
-	||= tasksize = <task_size>
-	||= datapartitioner = <data_partitioning_mode>
-	||= resultmerge = <result_merge_mode>
-	||= opt = <optimization_mode>
-
-	<dependency_analysis>         is one of the following tokens: 0 1
-	<degree_of_parallelism>       is an arbitrary integer number
-	<execution_mode>              is one of the following tokens: LOCAL REMOTE_MR
-	<task_partitioning_algorithm> is one of the following tokens: FIXED NAIVE STATIC FACTORING FACTORING_CMIN FACTORING_CMAX
-	<task_size>                   is an arbitrary integer number
-	<data_partitioning_mode>      is one of the following tokens: NONE LOCAL REMOTE_MR
-	<result_merge_mode>           is one of the following tokens: LOCAL_MEM LOCAL_FILE LOCAL_AUTOMATIC REMOTE_MR
-	<optimization_mode>           is one of the following tokens: NONE CONSTRAINED RULEBASED HEURISTIC GREEDY FULL_DP
-
-If any of these parameters is not specified, the following respective defaults are used: `check = 1`, `par = [number of virtual processors on master node]`, `mode = LOCAL`, `taskpartitioner = FIXED`, `tasksize = 1`, `datapartitioner = NONE`, `resultmerge = LOCAL_AUTOMATIC`, `opt = RULEBASED`.
+	<parfor_parameter> ::
+	   = check = <dependency_analysis>
+	|| = par = <degree_of_parallelism>
+	|| = mode = <execution_mode>
+	|| = taskpartitioner = <task_partitioning_algorithm>
+	|| = tasksize = <task_size>
+	|| = datapartitioner = <data_partitioning_mode>
+	|| = resultmerge = <result_merge_mode>
+	|| = opt = <optimization_mode>
+	|| = log = <log_level>
+	|| = profile = <monitor>
+
+	<dependency_analysis>         0 1
+	<degree_of_parallelism>       arbitrary integer number
+	<execution_mode>              LOCAL REMOTE_MR REMOTE_MR_DP REMOTE_SPARK REMOTE_SPARK_DP
+	<task_partitioning_algorithm> FIXED NAIVE STATIC FACTORING FACTORING_CMIN FACTORING_CMAX
+	<task_size>                   arbitrary integer number
+	<data_partitioning_mode>      NONE LOCAL REMOTE_MR REMOTE_SPARK
+	<result_merge_mode>           LOCAL_MEM LOCAL_FILE LOCAL_AUTOMATIC REMOTE_MR REMOTE_SPARK
+	<optimization_mode>           NONE RULEBASED CONSTRAINED HEURISTIC GREEDY FULL_DP
+	<log_level>                   ALL TRACE DEBUG INFO WARN ERROR FATAL OFF
+	<monitor>                     0 1
+
+
+If any of these parameters is not specified, the following respective defaults are used:
+
+**Table 2**: Parfor default parameter values
+
+Parameter Name  | Default Value
+--------------- | -------------
+check           | 1
+par             | [number of virtual processors on master node]
+mode            | LOCAL
+taskpartitioner | FIXED
+tasksize        | 1
+datapartitioner | NONE
+resultmerge     | LOCAL_AUTOMATIC
+opt             | RULEBASED
+log             | INFO
+profile         | 0
+
 
 Of particular note is the `check` parameter. SystemML's `parfor` statement by default (`check = 1`) performs dependency analysis in an
 attempt to guarantee result correctness for parallel execution. For example, the following `parfor` statement is **incorrect** because
-the iterations do not act independently, so they are not parallizable. The iterations incorrectly try to increment the same `sum` variable.
+the iterations do not act independently, so they are not parallelizable. The iterations incorrectly try to increment the same `sum` variable.
 
 	sum = 0
 	parfor(i in 1:3) {
-	    sum = sum + i; # not parallizable - generates error
+	    sum = sum + i; # not parallelizable - generates error
 	}
 	print(sum)
 
 SystemML's `parfor` dependency analysis can occasionally result in false positives, as in the following example. This example creates a 2x30
-matrix. It then utilizes a `parfor` loop to write 10 2x3 matrices into the 2x30 matrix. This `parfor` statement is parallizable and correct,
+matrix. It then utilizes a `parfor` loop to write 10 2x3 matrices into the 2x30 matrix. This `parfor` statement is parallelizable and correct,
 but the dependency analysis generates a false positive dependency error for the variable `ms`.
 
 	ms = matrix(0, rows=2, cols=3*10)
-	parfor (v in 1:10) { # parallizable - false positive
+	parfor (v in 1:10) { # parallelizable - false positive
 	    mv = matrix(v, rows=2, cols=3)
 	    ms[,(v-1)*3+1:v*3] = mv
 	}
 
-If a false positive arises but you are certain that the `parfor` is parallizable, the `parfor` dependency check can be disabled via
+If a false positive arises but you are certain that the `parfor` is parallelizable, the `parfor` dependency check can be disabled via
 the `check = 0` option.
 
 	ms = matrix(0, rows=2, cols=3*10)
-	parfor (v in 1:10, check=0) { # parallizable
+	parfor (v in 1:10, check=0) { # parallelizable
 	    mv = matrix(v, rows=2, cols=3)
 	    ms[,(v-1)*3+1:v*3] = mv
 	}
@@ -437,7 +459,7 @@ The syntax for the UDF function declaration for functions defined in external pa
     implemented in ([userParam=value]*)
 
 
-**Table 2**: Parameters for UDF Function Definition Statements
+**Table 3**: Parameters for UDF Function Definition Statements
 
 Parameter Name | Description | Optional | Permissible Values
 -------------- | ----------- | -------- | ------------------
@@ -613,7 +635,7 @@ The builtin function `sum` operates on a matrix (say A of dimensionality (m x n)
 
 ### Matrix Construction, Manipulation, and Aggregation Built-In Functions
 
-**Table 3**: Matrix Construction, Manipulation, and Aggregation Built-In Functions
+**Table 4**: Matrix Construction, Manipulation, and Aggregation Built-In Functions
 
 Function | Description | Parameters | Example
 -------- | ----------- | ---------- | -------
@@ -637,7 +659,7 @@ sum() | Sum of all cells in matrix | Input: matrix <br/> Output: scalar | sum(X)
 
 ### Matrix and/or Scalar Comparison Built-In Functions
 
-**Table 4**: Matrix and/or Scalar Comparison Built-In Functions
+**Table 5**: Matrix and/or Scalar Comparison Built-In Functions
 
 Function | Description | Parameters | Example
 -------- | ----------- | ---------- | -------
@@ -648,7 +670,7 @@ ppred() | "parallel predicate".<br/> The relational operator specified in the th
 
 ### Casting Built-In Functions
 
-**Table 5**: Casting Built-In Functions
+**Table 6**: Casting Built-In Functions
 
 Function | Description | Parameters | Example
 -------- | ----------- | ---------- | -------
@@ -658,7 +680,7 @@ as.double(), <br/> as.integer(), <br/> as.logical() | A variable is cast as the
 
 ### Statistical Built-In Functions
 
-**Table 6**: Statistical Built-In Functions
+**Table 7**: Statistical Built-In Functions
 
 Function | Description | Parameters | Example
 -------- | ----------- | ---------- | -------
@@ -667,9 +689,9 @@ var() <br/> sd() | Return the variance/stdDev value of all cells in matrix | Inp
 moment() | Returns the kth central moment of values in a column matrix V, where k = 2, 3, or 4. It can be used to compute statistical measures like Variance, Kurtosis, and Skewness. This function also takes an optional weights parameter W. | Input: (X &lt;(n x 1) matrix&gt;, [W &lt;(n x 1) matrix&gt;),] k &lt;scalar&gt;) <br/> Output: &lt;scalar&gt; | A = rand(rows=100000,cols=1, pdf="normal") <br/> print("Variance from our (standard normal) random generator is approximately " + moment(A,2))
 colSums() <br/> colMeans() <br/> colVars() <br/> colSds() <br/> colMaxs() <br/> colMins() | Column-wise computations -- for each column, compute the sum/mean/variance/stdDev/max/min of cell values | Input: matrix <br/> Output: (1 x n) matrix | colSums(X) <br/> colMeans(X) <br/> colVars(X) <br/> colSds(X) <br/> colMaxs(X) <br/>colMins(X)
 cov() | Returns the covariance between two 1-dimensional column matrices X and Y. The function takes an optional weights parameter W. All column matrices X, Y, and W (when specified) must have the exact same dimension. | Input: (X &lt;(n x 1) matrix&gt;, Y &lt;(n x 1) matrix&gt; [, W &lt;(n x 1) matrix&gt;)]) <br/> Output: &lt;scalar&gt; | cov(X,Y) <br/> cov(X,Y,W)
-table() | Returns the contingency table of two vectors A and B. The resulting table F consists of max(A) rows and max(B) columns. <br/> More precisely, F[i,j] = \\|{ k \\| A[k] = i and B[k] = j, 1 \u2264 k \u2264 n }\\|, where A and B are two n-dimensional vectors. <br/> This function supports multiple other variants, which can be found below, at the end of this Table 6. | Input: (&lt;(n x 1) matrix&gt;, &lt;(n x 1) matrix&gt;), [&lt;(n x 1) matrix&gt;]) <br/> Output: &lt;matrix&gt; | F = table(A, B) <br/> F = table(A, B, C) <br/> And, several other forms (see below Table 6.)
-cdf()<br/> pnorm()<br/> pexp()<br/> pchisq()<br/> pf()<br/> pt()<br/> icdf()<br/> qnorm()<br/> qexp()<br/> qchisq()<br/> qf()<br/> qt() | p=cdf(target=q, ...) returns the cumulative probability P[X &lt;= q]. <br/> q=icdf(target=p, ...) returns the inverse cumulative probability i.e., it returns q such that the given target p = P[X&lt;=q]. <br/> For more details, please see the section "Probability Distribution Functions" below Table 6. | Input: (target=&lt;scalar&gt;, dist="...", ...) <br/> Output: &lt;scalar&gt; | p = cdf(target=q, dist="normal", mean=1.5, sd=2); is same as p=pnorm(target=q, mean=1.5, sd=2); <br/> q=icdf(target=p, dist="normal") is same as q=qnorm(target=p, mean=0,sd=1) <br/> More examples can be found in the section "Probability Distribution Functions" below Table 6.
-aggregate() | Splits/groups the values from X according to the corresponding values from G, and then applies the function fn on each group. <br/> The result F is a column matrix, in which each row contains the value computed from a distinct group in G. More specifically, F[k,1] = fn( {X[i,1] \\| 1&lt;=i&lt;=n and G[i,1] = k} ), where n = nrow(X) = nrow(G). <br/> Note that the distinct values in G are used as row indexes in the result matrix F. Therefore, nrow(F) = max(G). It is thus recommended that the values in G are consecutive and start from 1. <br/> This function supports multiple other variants, which can be found below, at the end of this Table 6. | Input:<br/> (target = X &lt;(n x 1) matrix, or matrix&gt;,<br/> &nbsp;&nbsp;&nbsp;groups = G &lt;(n x 1) matrix&gt;,<br/> &nbsp;&nbsp;&nbsp;fn= "..." <br/> &nbsp;&nbsp;&nbsp;[,weights= W&lt;(n x 1) matrix&gt;] <br/> &nbsp;&nbsp;&nbsp;[,ngroups=N] )<br/>Output: F &lt;matrix&gt; <br/> Note: X is a (n x 1) matrix unless ngroups is sp
 ecified with no weights, in which case X is a regular (n x m) matrix.<br/> The parameter fn takes one of the following functions: "count", "sum", "mean", "variance", "centralmoment". In the case of central moment, one must also provide the order of the moment that need to be computed (see example). | F = aggregate(target=X, groups=G, fn= "..." [,weights = W]) <br/> F = aggregate(target=X, groups=G1, fn= "sum"); <br/> F = aggregate(target=Y, groups=G2, fn= "mean", weights=W); <br/> F = aggregate(target=Z, groups=G3, fn= "centralmoment", order= "2"); <br/> And, several other forms (see below Table 6.)
+table() | Returns the contingency table of two vectors A and B. The resulting table F consists of max(A) rows and max(B) columns. <br/> More precisely, F[i,j] = \\|{ k \\| A[k] = i and B[k] = j, 1 \u2264 k \u2264 n }\\|, where A and B are two n-dimensional vectors. <br/> This function supports multiple other variants, which can be found below, at the end of this Table 7. | Input: (&lt;(n x 1) matrix&gt;, &lt;(n x 1) matrix&gt;), [&lt;(n x 1) matrix&gt;]) <br/> Output: &lt;matrix&gt; | F = table(A, B) <br/> F = table(A, B, C) <br/> And, several other forms (see below Table 7.)
+cdf()<br/> pnorm()<br/> pexp()<br/> pchisq()<br/> pf()<br/> pt()<br/> icdf()<br/> qnorm()<br/> qexp()<br/> qchisq()<br/> qf()<br/> qt() | p=cdf(target=q, ...) returns the cumulative probability P[X &lt;= q]. <br/> q=icdf(target=p, ...) returns the inverse cumulative probability i.e., it returns q such that the given target p = P[X&lt;=q]. <br/> For more details, please see the section "Probability Distribution Functions" below Table 7. | Input: (target=&lt;scalar&gt;, dist="...", ...) <br/> Output: &lt;scalar&gt; | p = cdf(target=q, dist="normal", mean=1.5, sd=2); is same as p=pnorm(target=q, mean=1.5, sd=2); <br/> q=icdf(target=p, dist="normal") is same as q=qnorm(target=p, mean=0,sd=1) <br/> More examples can be found in the section "Probability Distribution Functions" below Table 7.
+aggregate() | Splits/groups the values from X according to the corresponding values from G, and then applies the function fn on each group. <br/> The result F is a column matrix, in which each row contains the value computed from a distinct group in G. More specifically, F[k,1] = fn( {X[i,1] \\| 1&lt;=i&lt;=n and G[i,1] = k} ), where n = nrow(X) = nrow(G). <br/> Note that the distinct values in G are used as row indexes in the result matrix F. Therefore, nrow(F) = max(G). It is thus recommended that the values in G are consecutive and start from 1. <br/> This function supports multiple other variants, which can be found below, at the end of this Table 7. | Input:<br/> (target = X &lt;(n x 1) matrix, or matrix&gt;,<br/> &nbsp;&nbsp;&nbsp;groups = G &lt;(n x 1) matrix&gt;,<br/> &nbsp;&nbsp;&nbsp;fn= "..." <br/> &nbsp;&nbsp;&nbsp;[,weights= W&lt;(n x 1) matrix&gt;] <br/> &nbsp;&nbsp;&nbsp;[,ngroups=N] )<br/>Output: F &lt;matrix&gt; <br/> Note: X is a (n x 1) matrix unless ngroups is sp
 ecified with no weights, in which case X is a regular (n x m) matrix.<br/> The parameter fn takes one of the following functions: "count", "sum", "mean", "variance", "centralmoment". In the case of central moment, one must also provide the order of the moment that need to be computed (see example). | F = aggregate(target=X, groups=G, fn= "..." [,weights = W]) <br/> F = aggregate(target=X, groups=G1, fn= "sum"); <br/> F = aggregate(target=Y, groups=G2, fn= "mean", weights=W); <br/> F = aggregate(target=Z, groups=G3, fn= "centralmoment", order= "2"); <br/> And, several other forms (see below Table 7.)
 interQuartileMean() | Returns the mean of all x in X such that x&gt;quantile(X, 0.25) and x&lt;=quantile(X, 0.75). X, W are column matrices (vectors) of the same size. W contains the weights for data in X. | Input: (X &lt;(n x 1) matrix&gt; [, W &lt;(n x 1) matrix&gt;)]) <br/> Output: &lt;scalar&gt; | interQuartileMean(X) <br/> interQuartileMean(X, W)
 quantile () | The p-quantile for a random variable X is the value x such that Pr[X&lt;x] &lt;= p and Pr[X&lt;= x] &gt;= p <br/> let n=nrow(X), i=ceiling(p*n), quantile() will return X[i]. p is a scalar (0&lt;p&lt;1) that specifies the quantile to be computed. Optionally, a weight vector may be provided for X. | Input: (X &lt;(n x 1) matrix&gt;, [W &lt;(n x 1) matrix&gt;),] p &lt;scalar&gt;) <br/> Output: &lt;scalar&gt; | quantile(X, p) <br/> quantile(X, W, p)
 quantile () | Returns a column matrix with list of all quantiles requested in P. | Input: (X &lt;(n x 1) matrix&gt;, [W &lt;(n x 1) matrix&gt;),] P &lt;(q x 1) matrix&gt;) <br/> Output: matrix | quantile(X, P) <br/> quantile(X, W, P)
@@ -688,7 +710,7 @@ outer(vector1, vector2, "op") | Applies element wise binary operation "op" (for
 The built-in function table() supports different types of input parameters. These variations are described below:
 
   * Basic form: `F=table(A,B)`
-    As described above in Table 6.
+    As described above in Table 7.
   * Weighted form: `F=table(A,B,W)`
     Users can provide an optional third parameter C with the same dimensions as of A and B. In this case, the output F[i,j] = \u2211kC[k], where A[k] = i and B[k] = j (1 \u2264 k \u2264 n).
   * Scalar form
@@ -706,11 +728,11 @@ The built-in function table() supports different types of input parameters. Thes
 The built-in function aggregate() supports different types of input parameters. These variations are described below:
 
   * Basic form: `F=aggregate(target=X, groups=G, fn="sum")`
-    As described above in Table 6.
+    As described above in Table 7.
   * Weighted form: `F=aggregate(target=X, groups=G, weights=W, fn="sum")`
     Users can provide an optional parameter W with the same dimensions as of A and B. In this case, fn computes the weighted statistics over values from X, which are grouped by values from G.
   * Specified Output Size
-As noted in Table 6, the number of rows in the output matrix F is equal to the maximum value in the grouping matrix G. Therefore, the dimensions of F are known only after its execution is complete. When needed, users can precisely control the size of the output matrix via an additional argument, `ngroups`, as shown below: <br/>
+As noted in Table 7, the number of rows in the output matrix F is equal to the maximum value in the grouping matrix G. Therefore, the dimensions of F are known only after its execution is complete. When needed, users can precisely control the size of the output matrix via an additional argument, `ngroups`, as shown below: <br/>
     `F = aggregate(target=X, groups=G, fn="sum", ngroups=10);` <br/>
 The output F will have exactly 10 rows and 1 column. F may be a truncated or padded (with zeros) version of the output produced by `aggregate(target=X, groups=G, fn="sum")` \u2013 depending on the values of `ngroups` and `max(G)`. For example, if `max(G) < ngroups` then the last (`ngroups-max(G)`) rows will have zeros.
 
@@ -797,7 +819,7 @@ is same as
 
 ### Mathematical and Trigonometric Built-In Functions
 
-**Table 7**: Mathematical and Trigonometric Built-In Functions
+**Table 8**: Mathematical and Trigonometric Built-In Functions
 
 Function | Description | Parameters | Example
 -------- | ----------- | ---------- | -------
@@ -808,7 +830,7 @@ sign() | Returns a matrix representing the signs of the input matrix elements, w
 
 ### Linear Algebra Built-In Functions
 
-**Table 8**: Linear Algebra Built-In Functions
+**Table 9**: Linear Algebra Built-In Functions
 
 Function | Description | Parameters | Example
 -------- | ----------- | ---------- | -------
@@ -862,10 +884,10 @@ can span multiple part files.
 
 The binary format can only be read and written by SystemML.
 
-Let's look at a matrix and examples of its data represented in the supported formats with corresponding metadata. In Table 9, we have
+Let's look at a matrix and examples of its data represented in the supported formats with corresponding metadata. In the table below, we have
 a matrix consisting of 4 rows and 3 columns.
 
-**Table 9**: Matrix
+**Table 10**: Matrix
 
 <table>
 	<tr>
@@ -981,7 +1003,7 @@ that contains the scalar value 2.0.
 Metadata is represented as an MTD file that contains a single JSON object with the attributes described below.
 
 
-**Table 10**: MTD attributes
+**Table 11**: MTD attributes
 
 Parameter Name | Description | Optional | Permissible values | Data type valid for
 -------------- | ----------- | -------- | ------------------ | -------------------
@@ -999,7 +1021,7 @@ In addition, when reading or writing CSV files, the metadata may contain one or
 Note that this metadata can be specified as parameters to the `read` and `write` function calls.
 
 
-**Table 11**: Additional MTD attributes when reading/writing CSV files
+**Table 12**: Additional MTD attributes when reading/writing CSV files
 
 Parameter Name | Description | Optional | Permissible values | Data type valid for
 -------------- | ----------- | -------- | ------------------ | -------------------
@@ -1073,7 +1095,7 @@ Additionally, `readMM()` and `read.csv()` are supported and can be used instead
 #### Write Built-In Function
 
 The `write` method is used to persist `scalar` and `matrix` data to files in the local file system or HDFS. The syntax of `write` is shown below.
-The parameters are described in Table 12. Note that the set of supported parameters for `write` is NOT the same as for `read`.
+The parameters are described in Table 13. Note that the set of supported parameters for `write` is NOT the same as for `read`.
 SystemML writes an MTD file for the written data.
 
     write(identifier, "outputfile", [additional parameters])
@@ -1081,13 +1103,13 @@ SystemML writes an MTD file for the written data.
 The user can use constant string concatenation in the `"outputfile"` parameter to give the full path of the file, where `+` is used as the concatenation operator.
 
 
-**Table 12**: Parameters for `write()` method
+**Table 13**: Parameters for `write()` method
 
 Parameter Name | Description | Optional | Permissible Values
 -------------- | ----------- | -------- | ------------------
 `identifier` | Variable whose data is to be written to a file. Data can be `matrix` or `scalar`. | No | Any variable name
 `"outputfile"` | The path to the data file in the file system | No | Any valid filename
-`[additional parameters]` | See Tables 10 and 11 | |
+`[additional parameters]` | See Tables 11 and 12 | |
 
 ##### **Examples**
 
@@ -1180,7 +1202,7 @@ The transformations are specified to operate on individual columns. The set of a
 
 The following table indicates which transformations can be used simultaneously on a single column.
 
-**Table 13**: Data transformations that can be used simultaneously.
+**Table 14**: Data transformations that can be used simultaneously.
 
 <div style="float:left">
 <table>
@@ -1304,7 +1326,7 @@ The `transform()` function returns the actual transformed data in the form of a
 
 As an example of the `transform()` function, consider the following [`data.csv`](files/dml-language-reference/data.csv) file that represents a sample of homes data.
 
-**Table 14**: The [`data.csv`](files/dml-language-reference/data.csv) homes data set
+**Table 15**: The [`data.csv`](files/dml-language-reference/data.csv) homes data set
 
 zipcode | district | sqft | numbedrooms | numbathrooms | floors | view  | saleprice | askingprice
 --------|----------|------|-------------|--------------|--------|-------|-----------|------------
@@ -1448,7 +1470,7 @@ Note that the metadata generated during the training phase (located at `/user/ml
 
 ### Other Built-In Functions
 
-**Table 15**: Other Built-In Functions
+**Table 16**: Other Built-In Functions
 
 Function | Description | Parameters | Example
 -------- | ----------- | ---------- | -------


[33/50] [abbrv] incubator-systemml git commit: Upgraded to use jcuda8 (from the maven repo)

Posted by de...@apache.org.
Upgraded to use jcuda8 (from the maven repo)

Closes #291


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/be4eaaf2
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/be4eaaf2
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/be4eaaf2

Branch: refs/heads/gh-pages
Commit: be4eaaf2a9b27d0a611cedb8b1d53e9a0a6a9296
Parents: fd96a3e
Author: Nakul Jindal <na...@gmail.com>
Authored: Fri Mar 3 18:11:45 2017 -0800
Committer: Nakul Jindal <na...@gmail.com>
Committed: Fri Mar 3 18:11:46 2017 -0800

----------------------------------------------------------------------
 devdocs/gpu-backend.md | 61 +++++++++++++++++++--------------------------
 1 file changed, 26 insertions(+), 35 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/be4eaaf2/devdocs/gpu-backend.md
----------------------------------------------------------------------
diff --git a/devdocs/gpu-backend.md b/devdocs/gpu-backend.md
index c6f66d6..40311c7 100644
--- a/devdocs/gpu-backend.md
+++ b/devdocs/gpu-backend.md
@@ -19,52 +19,43 @@ limitations under the License.
 
 # Initial prototype for GPU backend
 
-A GPU backend implements two important abstract classes:
+The GPU backend implements two important abstract classes:
 1. `org.apache.sysml.runtime.controlprogram.context.GPUContext`
 2. `org.apache.sysml.runtime.controlprogram.context.GPUObject`
 
-The GPUContext is responsible for GPU memory management and initialization/destruction of Cuda handles.
+The `GPUContext` is responsible for GPU memory management and initialization/destruction of Cuda handles.
+Currently, an active instance of the `GPUContext` class is made available globally and is used to store handles
+of the allocated blocks on the GPU. A count is kept per block for the number of instructions that need it.
+When the count is 0, the block may be evicted on a call to `GPUObject.evict()`.
 
-A GPUObject (like RDDObject and BroadcastObject) is stored in CacheableData object. It gets call-backs from SystemML's bufferpool on following methods
+A `GPUObject` (like RDDObject and BroadcastObject) is stored in CacheableData object. It gets call-backs from SystemML's bufferpool on following methods
 1. void acquireDeviceRead()
-2. void acquireDenseDeviceModify(int numElemsToAllocate)
-3. void acquireHostRead()
-4. void acquireHostModify()
-5. void release(boolean isGPUCopyModified)
+2. void acquireDeviceModifyDense()
+3. void acquireDeviceModifySparse
+4. void acquireHostRead()
+5. void acquireHostModify()
+6. void releaseInput()
+7. void releaseOutput()
 
-## JCudaContext:
-The current prototype supports Nvidia's CUDA libraries using JCuda wrapper. The implementation for the above classes can be found in:
-1. `org.apache.sysml.runtime.controlprogram.context.JCudaContext`
-2. `org.apache.sysml.runtime.controlprogram.context.JCudaObject`
+Sparse matrices on GPU are represented in `CSR` format. In the SystemML runtime, they are represented in `MCSR` or modified `CSR` format.
+A conversion cost is incurred when sparse matrices are sent back and forth between host and device memory.
 
-### Setup instructions for JCudaContext:
+Concrete classes `JCudaContext` and `JCudaObject` (which extend `GPUContext` & `GPUObject` respectively) contain references to `org.jcuda.*`.
 
-1. Follow the instructions from `https://developer.nvidia.com/cuda-downloads` and install CUDA 7.5.
-2. Follow the instructions from `https://developer.nvidia.com/cudnn` and install CuDNN v4.
-3. Download install JCuda binaries version 0.7.5b and JCudnn version 0.7.5. Easiest option would be to use mavenized jcuda: 
-```python
-git clone https://github.com/MysterionRise/mavenized-jcuda.git
-mvn -Djcuda.version=0.7.5b -Djcudnn.version=0.7.5 clean package
-CURR_DIR=`pwd`
-JCUDA_PATH=$CURR_DIR"/target/lib/"
-JAR_PATH="."
-for j in `ls $JCUDA_PATH/*.jar`
-do
-        JAR_PATH=$JAR_PATH":"$j
-done
-export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JCUDA_PATH
-```
+The `LibMatrixCUDA` class contains methods to invoke CUDA libraries (where available) and invoke custom kernels. 
+Runtime classes (that extend `GPUInstruction`) redirect calls to functions in this class.
+Some functions in `LibMatrixCUDA` need finer control over GPU memory management primitives. These are provided by `JCudaObject`.
+
+### Setup instructions:
 
-Note for Windows users:
-* CuDNN v4 is available to download: `http://developer.download.nvidia.com/compute/redist/cudnn/v4/cudnn-7.0-win-x64-v4.0-prod.zip`
-* If above steps doesn't work for JCuda, copy the DLLs into C:\lib (or /lib) directory.
+1. Follow the instructions from `https://developer.nvidia.com/cuda-downloads` and install CUDA 8.0.
+2. Follow the instructions from `https://developer.nvidia.com/cudnn` and install CuDNN v5.1.
 
-To use SystemML's GPU backend, 
+To use SystemML's GPU backend when using the jar or uber-jar
 1. Add JCuda's jar into the classpath.
-2. Include CUDA, CuDNN and JCuda's libraries in LD_LIBRARY_PATH (or using -Djava.library.path).
-3. Use `-gpu` flag.
+2. Use `-gpu` flag.
 
 For example: to use GPU backend in standalone mode:
-```python
-java -classpath $JAR_PATH:systemml-0.10.0-incubating-SNAPSHOT-standalone.jar org.apache.sysml.api.DMLScript -f MyDML.dml -gpu -exec singlenode ... 
+```bash
+java -classpath $JAR_PATH:systemml-0.14.0-incubating-SNAPSHOT-standalone.jar org.apache.sysml.api.DMLScript -f MyDML.dml -gpu -exec singlenode ... 
 ```


[27/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1193] Update perftest runNaiveBayes.sh and doc for required probabilities parameter in naive-bayes-predict.dml

Posted by de...@apache.org.
[SYSTEMML-1193] Update perftest runNaiveBayes.sh and doc for required probabilities parameter in naive-bayes-predict.dml

Closes #353.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/0f92f401
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/0f92f401
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/0f92f401

Branch: refs/heads/gh-pages
Commit: 0f92f40182f32e7cf533e5cea28de7b6759666f3
Parents: 452a41a
Author: Glenn Weidner <gw...@us.ibm.com>
Authored: Mon Feb 13 14:56:36 2017 -0800
Committer: Glenn Weidner <gw...@us.ibm.com>
Committed: Mon Feb 13 14:56:36 2017 -0800

----------------------------------------------------------------------
 algorithms-classification.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/0f92f401/algorithms-classification.md
----------------------------------------------------------------------
diff --git a/algorithms-classification.md b/algorithms-classification.md
index 8d19d04..0ee43bf 100644
--- a/algorithms-classification.md
+++ b/algorithms-classification.md
@@ -1236,8 +1236,7 @@ val prediction = model.transform(X_test_df)
 SystemML Language Reference for details.
 
 **probabilities**: Location (on HDFS) to store class membership
-    probabilities for a held-out test set. Note that this is an
-    optional argument.
+    probabilities for a held-out test set.
 
 **accuracy**: Location (on HDFS) to store the training accuracy during
     learning and testing accuracy from a held-out test set


[25/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1241] Fix diag description in DML Language Reference

Posted by de...@apache.org.
[SYSTEMML-1241] Fix diag description in DML Language Reference

Fix incorrect description of diag() in DML Language Reference.
Make diag error message more descriptive.

Closes #387.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/16950600
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/16950600
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/16950600

Branch: refs/heads/gh-pages
Commit: 16950600dcf067ca729ab3378a0de7db1d29a472
Parents: 51da13e
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Fri Feb 10 10:57:41 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Fri Feb 10 10:57:41 2017 -0800

----------------------------------------------------------------------
 dml-language-reference.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/16950600/dml-language-reference.md
----------------------------------------------------------------------
diff --git a/dml-language-reference.md b/dml-language-reference.md
index f3fba3b..05625fd 100644
--- a/dml-language-reference.md
+++ b/dml-language-reference.md
@@ -835,7 +835,7 @@ sign() | Returns a matrix representing the signs of the input matrix elements, w
 Function | Description | Parameters | Example
 -------- | ----------- | ---------- | -------
 cholesky() | Computes the Cholesky decomposition of symmetric input matrix A | Input: (A &lt;matrix&gt;) <br/> Output: &lt;matrix&gt; | <span style="white-space: nowrap;">A = matrix("4 12 -16 12 37 -43</span> -16 -43 98", rows=3, cols=3) <br/> B = cholesky(A)<br/> Matrix B: [[2, 0, 0], [6, 1, 0], [-8, 5, 3]]
-diag() | Create diagonal matrix from (n x 1) or (1 x n) matrix, or take diagonal from square matrix | Input: (n x 1) or (1 x n) matrix, or (n x n) matrix <br/> Output: (n x n) matrix, or (n x 1) matrix | diag(X)
+diag() | Create diagonal matrix from (n x 1) matrix, or take diagonal from square matrix | Input: (n x 1) matrix, or (n x n) matrix <br/> Output: (n x n) matrix, or (n x 1) matrix | D = diag(matrix(1.0, rows=3, cols=1))<br/> E = diag(matrix(1.0, rows=3, cols=3))
 eigen() | Computes Eigen decomposition of input matrix A. The Eigen decomposition consists of two matrices V and w such that A = V %\*% diag(w) %\*% t(V). The columns of V are the eigenvectors of the original matrix A. And, the eigen values are given by w. <br/> It is important to note that this function can operate only on small-to-medium sized input matrix that can fit in the main memory. For larger matrices, an out-of-memory exception is raised. | Input : (A &lt;matrix&gt;) <br/> Output : [w &lt;(m x 1) matrix&gt;, V &lt;matrix&gt;] <br/> A is a square symmetric matrix with dimensions (m x m). This function returns two matrices w and V, where w is (m x 1) and V is of size (m x m). | [w, V] = eigen(A)
 lu() | Computes Pivoted LU decomposition of input matrix A. The LU decomposition consists of three matrices P, L, and U such that P %\*% A = L %\*% U, where P is a permutation matrix that is used to rearrange the rows in A before the decomposition can be computed. L is a lower-triangular matrix whereas U is an upper-triangular matrix. <br/> It is important to note that this function can operate only on small-to-medium sized input matrix that can fit in the main memory. For larger matrices, an out-of-memory exception is raised. | Input : (A &lt;matrix&gt;) <br/> Output : [&lt;matrix&gt;, &lt;matrix&gt;, &lt;matrix&gt;] <br/> A is a square matrix with dimensions m x m. This function returns three matrices P, L, and U, all of which are of size m x m. | [P, L, U] = lu(A)
 qr() | Computes QR decomposition of input matrix A using Householder reflectors. The QR decomposition of A consists of two matrices Q and R such that A = Q%\*%R where Q is an orthogonal matrix (i.e., Q%\*%t(Q) = t(Q)%\*%Q = I, identity matrix) and R is an upper triangular matrix. For efficiency purposes, this function returns the matrix of Householder reflector vectors H instead of Q (which is a large m x m potentially dense matrix). The Q matrix can be explicitly computed from H, if needed. In most applications of QR, one is interested in calculating Q %\*% B or t(Q) %\*% B \u2013 and, both can be computed directly using H instead of explicitly constructing the large Q matrix. <br/> It is important to note that this function can operate only on small-to-medium sized input matrix that can fit in the main memory. For larger matrices, an out-of-memory exception is raised. | Input : (A &lt;matrix&gt;) <br/> Output : [&lt;matrix&gt;, &lt;matrix&gt;] <br/> A is a (m x n) matrix, which can e
 ither be a square matrix (m=n) or a rectangular matrix (m != n). This function returns two matrices H and R of size (m x n) i.e., same size as of the input matrix A. | [H, R] = qr(A)


[09/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1187] Updated the documentation for removeEmpty with select and bugfix for relu_backward

Posted by de...@apache.org.
[SYSTEMML-1187] Updated the documentation for removeEmpty with select and
bugfix for relu_backward

Also, added a multi-input cbind external function.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/5b21588d
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/5b21588d
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/5b21588d

Branch: refs/heads/gh-pages
Commit: 5b21588d9281bbce13a6a9b432bafd71dcf26792
Parents: cc6f3c7
Author: Niketan Pansare <np...@us.ibm.com>
Authored: Sun Jan 22 19:12:13 2017 -0800
Committer: Niketan Pansare <np...@us.ibm.com>
Committed: Sun Jan 22 19:12:13 2017 -0800

----------------------------------------------------------------------
 dml-language-reference.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/5b21588d/dml-language-reference.md
----------------------------------------------------------------------
diff --git a/dml-language-reference.md b/dml-language-reference.md
index 80fc8ca..c828e70 100644
--- a/dml-language-reference.md
+++ b/dml-language-reference.md
@@ -628,7 +628,7 @@ nrow(), <br/> ncol(), <br/> length() | Return the number of rows, number of colu
 prod() | Return the product of all cells in matrix | Input: matrix <br/> Output: scalarj | prod(X)
 rand() | Generates a random matrix | Input: (rows=&lt;value&gt;, cols=&lt;value&gt;, min=&lt;value&gt;, max=&lt;value&gt;, sparsity=&lt;value&gt;, pdf=&lt;string&gt;, seed=&lt;value&gt;) <br/> rows/cols: Number of rows/cols (expression) <br/> min/max: Min/max value for cells (either constant value, or variable that evaluates to constant value) <br/> sparsity: fraction of non-zero cells (constant value) <br/> pdf: "uniform" (min, max) distribution, or "normal" (0,1) distribution; or "poisson" (lambda=1) distribution. string; default value is "uniform". Note that, for the Poisson distribution, users can provide the mean/lambda parameter as follows: <br/> rand(rows=1000,cols=1000, pdf="poisson", lambda=2.5). <br/> The default value for lambda is 1. <br/> seed: Every invocation of rand() internally generates a random seed with which the cell values are generated. One can optionally provide a seed when repeatability is desired.  <br/> Output: matrix | X�= rand(rows=10, cols=20, min=0, m
 ax=1, pdf="uniform", sparsity=0.2) <br/> The example generates a 10 x 20 matrix, with cell values uniformly chosen at random between 0 and 1, and approximately 20% of cells will have non-zero values.
 rbind() | Row-wise matrix concatenation. Concatenates the second matrix as additional rows to the first matrix | Input: (X &lt;matrix&gt;, Y &lt;matrix&gt;) <br/>Output: &lt;matrix&gt; <br/> X and Y are matrices, where the number of columns in X and the number of columns in Y are the same. | A = matrix(1, rows=2,cols=3) <br/> B = matrix(2, rows=2,cols=3) <br/> C = rbind(A,B) <br/> print("Dimensions of C: " + nrow(C) + " X " + ncol(C)) <br/> Output: <br/> Dimensions of C: 4 X 3
-removeEmpty() | Removes all empty rows or columns from the input matrix target X according to the specified margin. | Input : (target= X &lt;matrix&gt;, margin="...") <br/> Output : &lt;matrix&gt; <br/> Valid values for margin are "rows" or "cols". | A = removeEmpty(target=X, margin="rows")
+removeEmpty() | Removes all empty rows or columns from the input matrix target X according to the specified margin. Also, allows to apply a filter F before removing the empty rows/cols. | Input : (target= X &lt;matrix&gt;, margin="...", select=F) <br/> Output : &lt;matrix&gt; <br/> Valid values for margin are "rows" or "cols". | A = removeEmpty(target=X, margin="rows", select=F)
 replace() | Creates a copy of input matrix X, where all values that are equal to the scalar pattern s1 are replaced with the scalar replacement s2. | Input : (target= X &lt;matrix&gt;, pattern=&lt;scalar&gt;, replacement=&lt;scalar&gt;) <br/> Output : &lt;matrix&gt; <br/> If s1 is NaN, then all NaN values of X are treated as equal and hence replaced with s2. Positive and negative infinity are treated as different values. | A = replace(target=X, pattern=s1, replacement=s2)
 rev() | Reverses the rows in a matrix | Input : (&lt;matrix&gt;) <br/> Output : &lt;matrix&gt; | <span style="white-space: nowrap;">A = matrix("1 2 3 4", rows=2, cols=2)</span> <br/> <span style="white-space: nowrap;">B = matrix("1 2 3 4", rows=4, cols=1)</span> <br/> <span style="white-space: nowrap;">C = matrix("1 2 3 4", rows=1, cols=4)</span> <br/> revA = rev(A) <br/> revB = rev(B) <br/> revC = rev(C) <br/> Matrix revA: [[3, 4], [1, 2]]<br/> Matrix revB: [[4], [3], [2], [1]]<br/> Matrix revC: [[1, 2, 3, 4]]<br/>
 seq() | Creates a single column vector with values starting from &lt;from&gt;, to &lt;to&gt;, in increments of &lt;increment&gt; | Input: (&lt;from&gt;, &lt;to&gt;, &lt;increment&gt;) <br/> Output: &lt;matrix&gt; | S = seq (10, 200, 10)


[10/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1190] Allow Scala UDF to be passed to SystemML via external UDF mechanism

Posted by de...@apache.org.
[SYSTEMML-1190] Allow Scala UDF to be passed to SystemML via external UDF mechanism

The registration mechanism is inspired from Spark SQLContext's UDF. The
key construct is ml.udf.register("fn to be used in DML", scala UDF).

The restrictions for Scala UDF are as follows:
- Only types specified by DML language is supported for parameters and return types (i.e. Int, Double, Boolean, String, double[][]).
- At minimum, the function should have 1 argument and 1 return value.
- At max, the function can have 10 arguments and 10 return values.

Closes #349.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/45fab153
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/45fab153
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/45fab153

Branch: refs/heads/gh-pages
Commit: 45fab15340234a74b799b1c488e93f8037a59307
Parents: 5b21588
Author: Niketan Pansare <np...@us.ibm.com>
Authored: Mon Jan 23 13:27:35 2017 -0800
Committer: Niketan Pansare <np...@us.ibm.com>
Committed: Mon Jan 23 13:31:07 2017 -0800

----------------------------------------------------------------------
 spark-mlcontext-programming-guide.md | 39 +++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/45fab153/spark-mlcontext-programming-guide.md
----------------------------------------------------------------------
diff --git a/spark-mlcontext-programming-guide.md b/spark-mlcontext-programming-guide.md
index dcaa125..759d392 100644
--- a/spark-mlcontext-programming-guide.md
+++ b/spark-mlcontext-programming-guide.md
@@ -1636,6 +1636,45 @@ scala> for (i <- 1 to 5) {
 
 </div>
 
+## Passing Scala UDF to SystemML
+
+SystemML allows the users to pass a Scala UDF (with input/output types supported by SystemML)
+to the DML script via MLContext. The restrictions for the supported Scala UDFs are as follows:
+
+1. Only types specified by DML language is supported for parameters and return types (i.e. Int, Double, Boolean, String, double[][]).
+2. At minimum, the function should have 1 argument and 1 return value.
+3. At max, the function can have 10 arguments and 10 return values. 
+
+{% highlight scala %}
+import org.apache.sysml.api.mlcontext._
+import org.apache.sysml.api.mlcontext.ScriptFactory._
+val ml = new MLContext(sc)
+
+// Demonstrates how to pass a simple scala UDF to SystemML
+def addOne(x:Double):Double = x + 1
+ml.udf.register("addOne", addOne _)
+val script1 = dml("v = addOne(2.0); print(v)")
+ml.execute(script1)
+
+// Demonstrates operation on local matrices (double[][])
+def addOneToDiagonal(x:Array[Array[Double]]):Array[Array[Double]] = {  for(i <- 0 to x.length-1) x(i)(i) = x(i)(i) + 1; x }
+ml.udf.register("addOneToDiagonal", addOneToDiagonal _)
+val script2 = dml("m1 = matrix(0, rows=3, cols=3); m2 = addOneToDiagonal(m1); print(toString(m2));")
+ml.execute(script2)
+
+// Demonstrates multi-return function
+def multiReturnFn(x:Double):(Double, Int) = (x + 1, (x * 2).toInt)
+ml.udf.register("multiReturnFn", multiReturnFn _)
+val script3 = dml("[v1, v2] = multiReturnFn(2.0); print(v1)")
+ml.execute(script3)
+
+// Demonstrates multi-argument multi-return function
+def multiArgReturnFn(x:Double, y:Int):(Double, Int) = (x + 1, (x * y).toInt)
+ml.udf.register("multiArgReturnFn", multiArgReturnFn _)
+val script4 = dml("[v1, v2] = multiArgReturnFn(2.0, 1); print(v2)")
+ml.execute(script4)
+{% endhighlight %}
+
 ---
 
 # Jupyter (PySpark) Notebook Example - Poisson Nonnegative Matrix Factorization


[18/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1212] Link to main website in header of project docs

Posted by de...@apache.org.
[SYSTEMML-1212] Link to main website in header of project docs

Link logo and title to main website.

Closes #367.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/fc9914db
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/fc9914db
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/fc9914db

Branch: refs/heads/gh-pages
Commit: fc9914db92915ceec2c0bdf2228a98fb36edd948
Parents: 61f25f2
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Thu Feb 2 17:02:09 2017 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Thu Feb 2 17:02:09 2017 -0800

----------------------------------------------------------------------
 _layouts/global.html | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/fc9914db/_layouts/global.html
----------------------------------------------------------------------
diff --git a/_layouts/global.html b/_layouts/global.html
index 5aac166..6c87e0c 100644
--- a/_layouts/global.html
+++ b/_layouts/global.html
@@ -25,10 +25,10 @@
             <div class="container">
                 <div class="navbar-header">
                     <div class="navbar-brand brand projectlogo">
-                        <img class="logo" src="img/systemml-logo.png" alt="Apache SystemML (incubating)" title="Apache SystemML (incubating)"/>
+                        <a href="http://systemml.apache.org/"><img class="logo" src="img/systemml-logo.png" alt="Apache SystemML (incubating)" title="Apache SystemML (incubating)"/></a>
                     </div>
                     <div class="navbar-brand brand projecttitle">
-                        <a href="index.html">Apache SystemML<sup id="trademark">\u2122</sup> (incubating)</a><br/>
+                        <a href="http://systemml.apache.org/">Apache SystemML<sup id="trademark">\u2122</sup> (incubating)</a><br/>
                         <span class="version">{{site.SYSTEMML_VERSION}}</span>
                     </div>
                     <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target=".navbar-collapse">


[06/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1170] Clean Up Python Documentation For Next Release

Posted by de...@apache.org.
[SYSTEMML-1170] Clean Up Python Documentation For Next Release

Cleanup of Python documentation.

Closes #335.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/94cf7c15
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/94cf7c15
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/94cf7c15

Branch: refs/heads/gh-pages
Commit: 94cf7c15b161a729f50ffec84e761b343e3ab2f9
Parents: 8268255
Author: Mike Dusenberry <mw...@us.ibm.com>
Authored: Mon Jan 9 14:02:08 2017 -0800
Committer: Mike Dusenberry <mw...@us.ibm.com>
Committed: Mon Jan 9 14:02:08 2017 -0800

----------------------------------------------------------------------
 README.md                            |   3 +-
 beginners-guide-python.md            | 128 ++++++++++++++++++------------
 index.md                             |  13 +--
 spark-mlcontext-programming-guide.md |  66 +++++++--------
 4 files changed, 111 insertions(+), 99 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/94cf7c15/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
index 6906c8d..5a4b175 100644
--- a/README.md
+++ b/README.md
@@ -27,6 +27,7 @@ Jekyll (and optionally Pygments) can be installed on the Mac OS in the following
     $ brew install ruby
     $ gem install jekyll
     $ gem install jekyll-redirect-from
+    $ gem install bundler
     $ brew install python
     $ pip install Pygments
     $ gem install pygments.rb
@@ -38,4 +39,4 @@ documentation. From there, you can have Jekyll convert the markdown files to HTM
 Jekyll will serve up the generated documentation by default at http://127.0.0.1:4000. Modifications
 to *.md files will be converted to HTML and can be viewed in a web browser.
 
-    $ jekyll serve -w
\ No newline at end of file
+    $ jekyll serve -w

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/94cf7c15/beginners-guide-python.md
----------------------------------------------------------------------
diff --git a/beginners-guide-python.md b/beginners-guide-python.md
index c919f3f..8bd957a 100644
--- a/beginners-guide-python.md
+++ b/beginners-guide-python.md
@@ -54,7 +54,8 @@ If you already have an Apache Spark installation, you can skip this step.
 /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
 brew tap caskroom/cask
 brew install Caskroom/cask/java
-brew install apache-spark
+brew tap homebrew/versions
+brew install apache-spark16
 ```
 </div>
 <div data-lang="Linux" markdown="1">
@@ -70,37 +71,60 @@ brew install apache-spark16
 
 ### Install SystemML
 
-We are working towards uploading the python package on pypi. Until then, please use following commands: 
+We are working towards uploading the python package on PyPi. Until then, please use following
+commands: 
 
+<div class="codetabs">
+<div data-lang="Python 2" markdown="1">
 ```bash
 git checkout https://github.com/apache/incubator-systemml.git
 cd incubator-systemml
 mvn clean package -P distribution
 pip install target/systemml-0.12.0-incubating-SNAPSHOT-python.tgz
 ```
-
-The above commands will install Python package and place the corresponding Java binaries (along with algorithms) into the installed location.
-To find the location of the downloaded Java binaries, use the following command:
-
+</div>
+<div data-lang="Python 3" markdown="1">
 ```bash
-python -c 'import imp; import os; print os.path.join(imp.find_module("systemml")[1], "systemml-java")'
+git checkout https://github.com/apache/incubator-systemml.git
+cd incubator-systemml
+mvn clean package -P distribution
+pip3 install target/systemml-0.12.0-incubating-SNAPSHOT-python.tgz
 ```
+</div>
+</div>
 
-Note: the user is free to either use the prepackaged Java binaries 
-or download them from [SystemML website](http://systemml.apache.org/download.html) 
-or build them from the [source](https://github.com/apache/incubator-systemml).
-
+### Uninstall SystemML
 To uninstall SystemML, please use following command:
 
+<div class="codetabs">
+<div data-lang="Python 2" markdown="1">
 ```bash
-pip uninstall systemml-incubating
+pip uninstall systemml
 ```
+</div>
+<div data-lang="Python 3" markdown="1">
+```bash
+pip3 uninstall systemml
+```
+</div>
+</div>
 
 ### Start Pyspark shell
 
+<div class="codetabs">
+<div data-lang="Python 2" markdown="1">
 ```bash
-pyspark --master local[*]
+pyspark
 ```
+</div>
+<div data-lang="Python 3" markdown="1">
+```bash
+PYSPARK_PYTHON=python3 pyspark
+```
+</div>
+</div>
+
+---
 
 ## Matrix operations
 
@@ -118,20 +142,20 @@ m4.sum(axis=1).toNumPy()
 
 Output:
 
-```bash
+```python
 array([[-60.],
        [-60.],
        [-60.]])
 ```
 
 Let us now write a simple script to train [linear regression](https://apache.github.io/incubator-systemml/algorithms-regression.html#linear-regression) 
-model: $ \beta = solve(X^T X, X^T y) $. For simplicity, we will use direct-solve method and ignore regularization parameter as well as intercept. 
+model: $ \beta = solve(X^T X, X^T y) $. For simplicity, we will use direct-solve method and ignore
+regularization parameter as well as intercept. 
 
 ```python
 import numpy as np
 from sklearn import datasets
 import systemml as sml
-from pyspark.sql import SQLContext
 # Load the diabetes dataset
 diabetes = datasets.load_diabetes()
 # Use only one feature
@@ -158,7 +182,10 @@ Output:
 Residual sum of squares: 25282.12
 ```
 
-We can improve the residual error by adding an intercept and regularization parameter. To do so, we will use `mllearn` API described in the next section.
+We can improve the residual error by adding an intercept and regularization parameter. To do so, we
+will use `mllearn` API described in the next section.
+
+---
 
 ## Invoke SystemML's algorithms
 
@@ -206,7 +233,7 @@ algorithm on digits datasets.
 
 ```python
 # Scikit-learn way
-from sklearn import datasets, neighbors
+from sklearn import datasets
 from systemml.mllearn import LogisticRegression
 from pyspark.sql import SQLContext
 sqlCtx = SQLContext(sc)
@@ -233,7 +260,7 @@ LogisticRegression score: 0.922222
 To train the above algorithm on larger dataset, we can load the dataset into DataFrame and pass it to the `fit` method:
 
 ```python
-from sklearn import datasets, neighbors
+from sklearn import datasets
 from systemml.mllearn import LogisticRegression
 from pyspark.sql import SQLContext
 import pandas as pd
@@ -245,7 +272,7 @@ X_digits = digits.data
 y_digits = digits.target
 n_samples = len(X_digits)
 # Split the data into training/testing sets and convert to PySpark DataFrame
-df_train = sml.convertToLabeledDF(sqlContext, X_digits[:int(.9 * n_samples)], y_digits[:int(.9 * n_samples)])
+df_train = sml.convertToLabeledDF(sqlCtx, X_digits[:int(.9 * n_samples)], y_digits[:int(.9 * n_samples)])
 X_test = sqlCtx.createDataFrame(pd.DataFrame(X_digits[int(.9 * n_samples):]))
 logistic = LogisticRegression(sqlCtx)
 logistic.fit(df_train)
@@ -274,18 +301,18 @@ from pyspark.ml.feature import HashingTF, Tokenizer
 from pyspark.sql import SQLContext
 sqlCtx = SQLContext(sc)
 training = sqlCtx.createDataFrame([
-    (0L, "a b c d e spark", 1.0),
-    (1L, "b d", 2.0),
-    (2L, "spark f g h", 1.0),
-    (3L, "hadoop mapreduce", 2.0),
-    (4L, "b spark who", 1.0),
-    (5L, "g d a y", 2.0),
-    (6L, "spark fly", 1.0),
-    (7L, "was mapreduce", 2.0),
-    (8L, "e spark program", 1.0),
-    (9L, "a e c l", 2.0),
-    (10L, "spark compile", 1.0),
-    (11L, "hadoop software", 2.0)
+    (0, "a b c d e spark", 1.0),
+    (1, "b d", 2.0),
+    (2, "spark f g h", 1.0),
+    (3, "hadoop mapreduce", 2.0),
+    (4, "b spark who", 1.0),
+    (5, "g d a y", 2.0),
+    (6, "spark fly", 1.0),
+    (7, "was mapreduce", 2.0),
+    (8, "e spark program", 1.0),
+    (9, "a e c l", 2.0),
+    (10, "spark compile", 1.0),
+    (11, "hadoop software", 2.0)
 ], ["id", "text", "label"])
 tokenizer = Tokenizer(inputCol="text", outputCol="words")
 hashingTF = HashingTF(inputCol="words", outputCol="features", numFeatures=20)
@@ -293,10 +320,10 @@ lr = LogisticRegression(sqlCtx)
 pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
 model = pipeline.fit(training)
 test = sqlCtx.createDataFrame([
-    (12L, "spark i j k"),
-    (13L, "l m n"),
-    (14L, "mapreduce spark"),
-    (15L, "apache hadoop")], ["id", "text"])
+    (12, "spark i j k"),
+    (13, "l m n"),
+    (14, "mapreduce spark"),
+    (15, "apache hadoop")], ["id", "text"])
 prediction = model.transform(test)
 prediction.show()
 ```
@@ -304,27 +331,28 @@ prediction.show()
 Output:
 
 ```bash
-+--+---------------+--------------------+--------------------+--------------------+---+----------+
-|id|           text|               words|            features|         probability| ID|prediction|
-+--+---------------+--------------------+--------------------+--------------------+---+----------+
-|12|    spark i j k|ArrayBuffer(spark...|(20,[5,6,7],[2.0,...|[0.99999999999975...|1.0|       1.0|
-|13|          l m n|ArrayBuffer(l, m, n)|(20,[8,9,10],[1.0...|[1.37552128844736...|2.0|       2.0|
-|14|mapreduce spark|ArrayBuffer(mapre...|(20,[5,10],[1.0,1...|[0.99860290938153...|3.0|       1.0|
-|15|  apache hadoop|ArrayBuffer(apach...|(20,[9,14],[1.0,1...|[5.41688748236143...|4.0|       2.0|
-+--+---------------+--------------------+--------------------+--------------------+---+----------+
++-------+---+---------------+------------------+--------------------+--------------------+----------+
+|__INDEX| id|           text|             words|            features|         probability|prediction|
++-------+---+---------------+------------------+--------------------+--------------------+----------+
+|    1.0| 12|    spark i j k|  [spark, i, j, k]|(20,[5,6,7],[2.0,...|[0.99999999999975...|       1.0|
+|    2.0| 13|          l m n|         [l, m, n]|(20,[8,9,10],[1.0...|[1.37552128844736...|       2.0|
+|    3.0| 14|mapreduce spark|[mapreduce, spark]|(20,[5,10],[1.0,1...|[0.99860290938153...|       1.0|
+|    4.0| 15|  apache hadoop|  [apache, hadoop]|(20,[9,14],[1.0,1...|[5.41688748236143...|       2.0|
++-------+---+---------------+------------------+--------------------+--------------------+----------+
 ```
 
+---
+
 ## Invoking DML/PyDML scripts using MLContext
 
 The below example demonstrates how to invoke the algorithm [scripts/algorithms/MultiLogReg.dml](https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/MultiLogReg.dml)
 using Python [MLContext API](https://apache.github.io/incubator-systemml/spark-mlcontext-programming-guide).
 
 ```python
-from sklearn import datasets, neighbors
-from pyspark.sql import DataFrame, SQLContext
+from sklearn import datasets
+from pyspark.sql import SQLContext
 import systemml as sml
 import pandas as pd
-import os, imp
 sqlCtx = SQLContext(sc)
 digits = datasets.load_digits()
 X_digits = digits.data
@@ -334,8 +362,8 @@ n_samples = len(X_digits)
 X_df = sqlCtx.createDataFrame(pd.DataFrame(X_digits[:int(.9 * n_samples)]))
 y_df = sqlCtx.createDataFrame(pd.DataFrame(y_digits[:int(.9 * n_samples)]))
 ml = sml.MLContext(sc)
-# Get the path of MultiLogReg.dml
-scriptPath = os.path.join(imp.find_module("systemml")[1], 'systemml-java', 'scripts', 'algorithms', 'MultiLogReg.dml')
-script = sml.dml(scriptPath).input(X=X_df, Y_vec=y_df).output("B_out")
+# Run the MultiLogReg.dml script at the given URL
+scriptUrl = "https://raw.githubusercontent.com/apache/incubator-systemml/master/scripts/algorithms/MultiLogReg.dml"
+script = sml.dml(scriptUrl).input(X=X_df, Y_vec=y_df).output("B_out")
 beta = ml.execute(script).get('B_out').toNumPy()
 ```

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/94cf7c15/index.md
----------------------------------------------------------------------
diff --git a/index.md b/index.md
index 6b91654..fe8361a 100644
--- a/index.md
+++ b/index.md
@@ -42,13 +42,11 @@ To download SystemML, visit the [downloads](http://systemml.apache.org/download)
 
 ## Running SystemML
 
+* **[Beginner's Guide For Python Users](beginners-guide-python)** - Beginner's Guide for Python users.
 * **[Spark MLContext](spark-mlcontext-programming-guide)** - Spark MLContext is a programmatic API
 for running SystemML from Spark via Scala, Python, or Java.
-  * See the [Spark MLContext Programming Guide](spark-mlcontext-programming-guide) with the
-  following examples:
-    * [**Spark Shell (Scala)**](spark-mlcontext-programming-guide#spark-shell-example---new-api)
-    * [**Zeppelin Notebook (Scala)**](spark-mlcontext-programming-guide#zeppelin-notebook-example---linear-regression-algorithm---old-api)
-    * [**Jupyter Notebook (PySpark)**](spark-mlcontext-programming-guide#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization---old-api)
+  * [**Spark Shell Example (Scala)**](spark-mlcontext-programming-guide#spark-shell-example)
+  * [**Jupyter Notebook Example (PySpark)**](spark-mlcontext-programming-guide#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization)
 * **[Spark Batch](spark-batch-mode)** - Algorithms are automatically optimized to run across Spark clusters.
   * See [Invoking SystemML in Spark Batch Mode](spark-batch-mode) for detailed information.
 * **[Hadoop Batch](hadoop-batch-mode)** - Algorithms are automatically optimized when distributed across Hadoop clusters.
@@ -62,16 +60,13 @@ machine in R-like and Python-like declarative languages.
 
 ## Language Guides
 
+* [Python API Reference](python-reference) - API Reference Guide for Python users.
 * [DML Language Reference](dml-language-reference) -
 DML is a high-level R-like declarative language for machine learning.
 * **PyDML Language Reference** **(Coming Soon)** -
 PyDML is a high-level Python-like declarative language for machine learning.
 * [Beginner's Guide to DML and PyDML](beginners-guide-to-dml-and-pydml) -
 An introduction to the basics of DML and PyDML.
-* [Beginner's Guide for Python users](beginners-guide-python) -
-Beginner's Guide for Python users.
-* [Reference Guide for Python users](python-reference) -
-Reference Guide for Python users.
 
 ## ML Algorithms
 

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/94cf7c15/spark-mlcontext-programming-guide.md
----------------------------------------------------------------------
diff --git a/spark-mlcontext-programming-guide.md b/spark-mlcontext-programming-guide.md
index fbc8f5b..dcaa125 100644
--- a/spark-mlcontext-programming-guide.md
+++ b/spark-mlcontext-programming-guide.md
@@ -35,14 +35,10 @@ such as Scala, Java, and Python. As a result, it offers a convenient way to inte
 Shell and from Notebooks such as Jupyter and Zeppelin.
 
 **NOTE: A new MLContext API has been redesigned for future SystemML releases. The old API is available
-in all versions of SystemML but will be deprecated and removed, so please migrate to the new API.**
+in previous versions of SystemML but is deprecated and will be removed soon, so please migrate to the new API.**
 
 
-# Spark Shell Example - NEW API
-
-**NOTE: The new MLContext API will be available in future SystemML releases. It can be used
-by building the project using Maven ('mvn clean package', or 'mvn clean package -P distribution').
-For SystemML version 0.10.0 and earlier, please see the documentation regarding the old API.**
+# Spark Shell Example
 
 ## Start Spark Shell with SystemML
 
@@ -1644,25 +1640,8 @@ scala> for (i <- 1 to 5) {
 
 # Jupyter (PySpark) Notebook Example - Poisson Nonnegative Matrix Factorization
 
-Similar to the Scala API, SystemML also provides a Python MLContext API.  In addition to the
-regular `SystemML.jar` file, you'll need to install the Python API as follows:
-
-  * Latest release:
-    * Python 2:
-
-      ```
-      pip install systemml
-      # Bleeding edge: pip install git+git://github.com/apache/incubator-systemml.git#subdirectory=src/main/python
-      ```
-
-    * Python 3:
-
-      ```
-      pip3 install systemml
-      # Bleeding edge: pip3 install git+git://github.com/apache/incubator-systemml.git#subdirectory=src/main/python
-      ```
-  * Don't forget to download the `SystemML.jar` file, which can be found in the latest release, or
-  in a nightly build.
+Similar to the Scala API, SystemML also provides a Python MLContext API.  Before usage, you'll need
+**[to install it first](beginners-guide-python#download--setup)**.
 
 Here, we'll explore the use of SystemML via PySpark in a [Jupyter notebook](http://jupyter.org/).
 This Jupyter notebook example can be nicely viewed in a rendered state
@@ -1671,17 +1650,18 @@ and can be [downloaded here](https://raw.githubusercontent.com/apache/incubator-
 
 From the directory with the downloaded notebook, start Jupyter with PySpark:
 
-  * Python 2:
-
-    ```
-    PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path SystemML.jar --jars SystemML.jar
-    ```
-
-  * Python 3:
-
-    ```
-    PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark --master local[*] --driver-class-path SystemML.jar --jars SystemML.jar
-    ```
+<div class="codetabs">
+<div data-lang="Python 2" markdown="1">
+```bash
+PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
+```
+</div>
+<div data-lang="Python 3" markdown="1">
+```bash
+PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
+```
+</div>
+</div>
 
 This will open Jupyter in a browser:
 
@@ -1797,6 +1777,9 @@ plt.title('PNMF Training Loss')
 
 # Spark Shell Example - OLD API
 
+### ** **NOTE: This API is old and has been deprecated.** **
+**Please use the [new MLContext API](spark-mlcontext-programming-guide#spark-shell-example) instead.**
+
 ## Start Spark Shell with SystemML
 
 To use SystemML with the Spark Shell, the SystemML jar can be referenced using the Spark Shell's `--jars` option.
@@ -2216,11 +2199,13 @@ val (min, max, mean) = minMaxMean(sysMlMatrix, numRows, numCols, ml)
 
 </div>
 
-
-* * *
+---
 
 # Zeppelin Notebook Example - Linear Regression Algorithm - OLD API
 
+### ** **NOTE: This API is old and has been deprecated.** **
+**Please use the [new MLContext API](spark-mlcontext-programming-guide#spark-shell-example) instead.**
+
 Next, we'll consider an example of a SystemML linear regression algorithm run from Spark through an Apache Zeppelin notebook.
 Instructions to clone and build Zeppelin can be found at the [GitHub Apache Zeppelin](https://github.com/apache/incubator-zeppelin)
 site. This example also will look at the Spark ML linear regression algorithm.
@@ -2701,10 +2686,13 @@ Training time per iter: 0.2334166666666667 seconds
 {% endhighlight %}
 
 
-* * *
+---
 
 # Jupyter (PySpark) Notebook Example - Poisson Nonnegative Matrix Factorization - OLD API
 
+### ** **NOTE: This API is old and has been deprecated.** **
+**Please use the [new MLContext API](spark-mlcontext-programming-guide#jupyter-pyspark-notebook-example---poisson-nonnegative-matrix-factorization) instead.**
+
 Here, we'll explore the use of SystemML via PySpark in a [Jupyter notebook](http://jupyter.org/).
 This Jupyter notebook example can be nicely viewed in a rendered state
 [on GitHub](https://github.com/apache/incubator-systemml/blob/master/samples/jupyter-notebooks/SystemML-PySpark-Recommendation-Demo.ipynb),