You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemml.apache.org by du...@apache.org on 2015/12/02 02:05:04 UTC
[26/47] incubator-systemml git commit: [SYSML-301] Update Algorithm Ref MathJax to render on GitHub

[SYSML-301] Update Algorithm Ref MathJax to render on GitHub

Fix MD syntax to properly render as GitHub pages


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/b966a815
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/b966a815
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/b966a815

Branch: refs/heads/gh-pages
Commit: b966a815c45754f3f637edd8e1f32856ef1b270c
Parents: 8fd0f74
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Fri Sep 11 15:46:37 2015 -0700
Committer: Luciano Resende <lr...@apache.org>
Committed: Fri Sep 11 15:46:37 2015 -0700

----------------------------------------------------------------------
 _layouts/global.html                 |  3 +-
 algorithms-classification.md         | 16 +++++-----
 algorithms-descriptive-statistics.md | 52 +++++++++++++++----------------
 algorithms-regression.md             | 44 +++++++++++++-------------
 index.md                             |  4 +--
 5 files changed, 60 insertions(+), 59 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/b966a815/_layouts/global.html
----------------------------------------------------------------------
diff --git a/_layouts/global.html b/_layouts/global.html
index 2a75531..7b2deea 100644
--- a/_layouts/global.html
+++ b/_layouts/global.html
@@ -25,7 +25,7 @@
         <div class="navbar navbar-fixed-top" id="topbar">
             <div class="navbar-inner">
                 <div class="container">
-                    <div class="brand" style="padding: 15px 0px; font-size: 20px; font-style: italic; font-weight: bold;"><a href="index.html">SystemML - {{site.SYSTEMML_VERSION}}</a>
+                    <div class="brand" style="padding: 15px 0px; font-size: 20px; font-style: italic; font-weight: bold;"><a href="index.html">SystemML {{site.SYSTEMML_VERSION}}</a>
                     </div>
                     <ul class="nav">
                         <li><a href="index.html">Home</a></li>
@@ -36,6 +36,7 @@
                             
                                 <li><a href="http://www.github.com/SparkTC/systemml">SystemML GitHub README</a></li>
                                 <li><a href="quick-start-guide.html">Quick Start Guide</a></li>
+                                <!-- <li><a href="programming-guide.html">Programming Guide</a></li> -->
                                 <li><a href="algorithms-reference.html">Algorithms Reference</a></li>
                                 <li><a href="dml-language-reference.html">DML Language Reference</a></li>
                                 <li class="divider"></li>

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/b966a815/algorithms-classification.md
----------------------------------------------------------------------
diff --git a/algorithms-classification.md b/algorithms-classification.md
index da46ded..8a16cd0 100644
--- a/algorithms-classification.md
+++ b/algorithms-classification.md
@@ -33,7 +33,7 @@ Just as linear regression estimates the mean value $\mu_i$ of a
 numerical response variable, logistic regression does the same for
 category label probabilities. In linear regression, the mean of $y_i$ is
 estimated as a linear combination of the features:
-$\mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m}$.
+$$\mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m}$$.
 In logistic regression, the label probability has to lie between 0
 and 1, so a link function is applied to connect it to
 $\beta_0 + x_i\beta_{1:m}$. If there are just two possible category
@@ -46,10 +46,10 @@ Prob[y_i\,{=}\,0\mid x_i; \beta] \,=\,
 \frac{1}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}}$$
 
 Here category label 0
-serves as the *baseline*, and function $\exp(\beta_0 + x_i\beta_{1:m})$
+serves as the *baseline*, and function $$\exp(\beta_0 + x_i\beta_{1:m})$$
 shows how likely we expect to see “$y_i = 1$” in comparison to the
 baseline. Like in a loaded coin, the predicted odds of seeing 1 versus 0
-are $\exp(\beta_0 + x_i\beta_{1:m})$ to 1, with each feature $x_{i,j}$
+are $$\exp(\beta_0 + x_i\beta_{1:m})$$ to 1, with each feature $$x_{i,j}$$
 multiplying its own factor $\exp(\beta_j x_{i,j})$ to the odds. Given a
 large collection of pairs $(x_i, y_i)$, $i=1\ldots n$, logistic
 regression seeks to find the $\beta_j$’s that maximize the product of
@@ -63,11 +63,11 @@ $k \geq 3$ possible categories. Again we identify one category as the
 baseline, for example the $k$-th category. Instead of a coin, here we
 have a loaded multisided die, one side per category. Each non-baseline
 category $l = 1\ldots k\,{-}\,1$ has its own vector
-$(\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l})$ of regression
+$$(\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l})$$ of regression
 parameters with the intercept, making up a matrix $B$ of size
 $(m\,{+}\,1)\times(k\,{-}\,1)$. The predicted odds of seeing
 non-baseline category $l$ versus the baseline $k$ are
-$\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)$
+$$\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)$$
 to 1, and the predicted probabilities are: 
 
 $$
@@ -101,7 +101,7 @@ $$
 
 The optional regularization term is added to
 mitigate overfitting and degeneracy in the data; to reduce bias, the
-intercepts $\beta_{0,l}$ are not regularized. Once the $\beta_{j,l}$’s
+intercepts $$\beta_{0,l}$$ are not regularized. Once the $\beta_{j,l}$’s
 are accurately estimated, we can make predictions about the category
 label $y$ for a new feature vector $x$ using
 Eqs. (1) and (2).
@@ -137,7 +137,7 @@ represent the (same) baseline category and are converted to label
 $\max(\texttt{Y})\,{+}\,1$.
 
 **B**: Location to store the matrix of estimated regression parameters (the
-$\beta_{j, l}$’s), with the intercept parameters $\beta_{0, l}$ at
+$$\beta_{j, l}$$’s), with the intercept parameters $\beta_{0, l}$ at
 position B\[$m\,{+}\,1$, $l$\] if available.
 The size of B is $(m\,{+}\,1)\times (k\,{-}\,1)$ with the
 intercepts or $m \times (k\,{-}\,1)$ without the intercepts, one column
@@ -221,7 +221,7 @@ Newton method for logistic regression described in [[Lin2008]](algorithms-biblio
 For convenience, let us make some changes in notation:
 
   * Convert the input vector of observed category labels into an indicator
-matrix $Y$ of size $n \times k$ such that $Y_{i, l} = 1$ if the $i$-th
+matrix $Y$ of size $n \times k$ such that $$Y_{i, l} = 1$$ if the $i$-th
 category label is $l$ and $Y_{i, l} = 0$ otherwise.
   * Append an extra column of all ones, i.e. $(1, 1, \ldots, 1)^T$, as the
 $m\,{+}\,1$-st column to the feature matrix $X$ to represent the

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/b966a815/algorithms-descriptive-statistics.md
----------------------------------------------------------------------
diff --git a/algorithms-descriptive-statistics.md b/algorithms-descriptive-statistics.md
index dd276af..6c56344 100644
--- a/algorithms-descriptive-statistics.md
+++ b/algorithms-descriptive-statistics.md
@@ -184,7 +184,7 @@ order, preserving duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$.
 **Figure 1**: The computation of quartiles, median, and interquartile mean from the
 empirical distribution function of the 10-point
 sample {2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8}.  Each vertical step in
-the graph has height $1{/}n = 0.1$.  Values $q_{25\%}$, $q_{50\%}$, and $q_{75\%}$ denote
+the graph has height $1{/}n = 0.1$.  Values $$q_{25\%}$$, $$q_{50\%}$$, and $$q_{75\%}$$ denote
 the $1^{\textrm{st}}$, $2^{\textrm{nd}}$, and $3^{\textrm{rd}}$ quartiles correspondingly;
 value $\mu$ denotes the median.  Values $\phi_1$ and $\phi_2$ show the partial contribution
 of border points (quartiles) $v_3=3.7$ and $v_8=6.4$ into the interquartile mean.
@@ -214,7 +214,7 @@ median, we sort the sample in the increasing order, preserving
 duplicates: $v^s_1 \leq v^s_2 \leq \ldots \leq v^s_n$. If $n$ is odd,
 the median equals $v^s_i$ where $i = (n\,{+}\,1)\,{/}\,2$, same as the
 $50^{\textrm{th}}$ percentile of the sample. If $n$ is even, there are
-two “middle” values $v^s_{n/2}$ and $v^s_{n/2\,+\,1}$, so we compute the
+two “middle” values $$v^s_{n/2}$$ and $$v^s_{n/2\,+\,1}$$, so we compute the
 median as the mean of these two values. (For even $n$ we compute the
 $50^{\textrm{th}}$ percentile as $v^s_{n/2}$, not as the median.)
 Example: the median of sample {2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1,
@@ -269,7 +269,7 @@ quantitative (scale) data feature.
 around their mean, expressed in units that are the square of those of
 the feature itself. Computed as the sum of squared differences between
 the values in the sample and their mean, divided by one less than the
-number of values: $\sum_{i=1}^n (v_i - \bar{v})^2\,/\,(n\,{-}\,1)$ where
+number of values: $$\sum_{i=1}^n (v_i - \bar{v})^2\,/\,(n\,{-}\,1)$$ where
 $\bar{v}=\left(\sum_{i=1}^n v_i\right)/n$. Example: the variance of
 sample {2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8}
 equals 3.24. Note that at least two values ($n\geq 2$) are required to
@@ -357,8 +357,8 @@ Skewness is computed as the $3^{\textrm{rd}}$ central moment divided by
 the cube of the standard deviation. We estimate the
 $3^{\textrm{rd}}$ central moment as the sum of cubed differences between
 the values in the feature column and their sample mean, divided by the
-number of values: $\sum_{i=1}^n (v_i - \bar{v})^3 / n$ where
-$\bar{v}=\left(\sum_{i=1}^n v_i\right)/n$. The standard deviation is
+number of values: $$\sum_{i=1}^n (v_i - \bar{v})^3 / n$$ where
+$$\bar{v}=\left(\sum_{i=1}^n v_i\right)/n$$. The standard deviation is
 computed as described above in *standard deviation*. To avoid division
 by 0, at least two different sample values are required. Example: for
 sample {2.2, 3.2, 3.7, 4.4, 5.3, 5.7, 6.1, 6.4, 7.2, 7.8} with the
@@ -411,7 +411,7 @@ divided by the $4^{\textrm{th}}$ power of the standard deviation,
 minus 3. We estimate the $4^{\textrm{th}}$ central moment as the sum of
 the $4^{\textrm{th}}$ powers of differences between the values in the
 feature column and their sample mean, divided by the number of values:
-$\sum_{i=1}^n (v_i - \bar{v})^4 / n$ where
+$$\sum_{i=1}^n (v_i - \bar{v})^4 / n$$ where
 $\bar{v}=\left(\sum_{i=1}^n v_i\right)/n$. The standard deviation is
 computed as described above, see *standard deviation*.
 
@@ -634,7 +634,7 @@ Below we list all bivariate statistics computed by script
 `bivar-stats.dml`. The statistics are collected into
 several groups by the type of their input features. We refer to the two
 input features as $v_1$ and $v_2$ unless specified otherwise; the value
-pairs are $(v_{1,i}, v_{2,i})$ for $i=1,\ldots,n$, where $n$ is the
+pairs are $$(v_{1,i}, v_{2,i})$$ for $i=1,\ldots,n$, where $n$ is the
 number of rows in `X`, i.e. the sample size.
 
 
@@ -653,7 +653,7 @@ $$r
 $$
 
 Commonly denoted by $r$, correlation ranges between $-1$ and $+1$,
-reaching ${\pm}1$ when all value pairs $(v_{1,i}, v_{2,i})$ lie on the
+reaching ${\pm}1$ when all value pairs $$(v_{1,i}, v_{2,i})$$ lie on the
 same line. Correlation near 0 means that a line is not a good way to
 represent the dependence between the two features; however, this does
 not imply independence. The sign indicates direction of the linear
@@ -665,9 +665,9 @@ not change if we transform $v_1$ and $v_2$ to $a + b v_1$ and
 $c + d v_2$ where $a, b, c, d$ are constants and $b, d > 0$.
 
 Suppose that we use simple linear regression to represent one feature
-given the other, say represent $v_{2,i} \approx \alpha + \beta v_{1,i}$
+given the other, say represent $$v_{2,i} \approx \alpha + \beta v_{1,i}$$
 by selecting $\alpha$ and $\beta$ to minimize the least-squares error
-$\sum_{i=1}^n (v_{2,i} - \alpha - \beta v_{1,i})^2$. Then the best error
+$$\sum_{i=1}^n (v_{2,i} - \alpha - \beta v_{1,i})^2$$. Then the best error
 equals
 
 $$\min_{\alpha, \beta} \,\,\sum_{i=1}^n \big(v_{2,i} - \alpha - \beta v_{1,i}\big)^2 \,\,=\,\,
@@ -694,7 +694,7 @@ But we do not know these (hypothesized) probabilities; we only know the
 sample frequency counts. Let $n_{a,b}$ be the frequency count of pair
 $(a, b)$, let $n_a$ and $n_b$ be the frequency counts of $a$ alone and
 of $b$ alone. Under independence, difference
-$n_{a,b}{/}n - (n_a{/}n)(n_b{/}n)$ is unlikely to be exactly 0 due to
+$$n_{a,b}{/}n - (n_a{/}n)(n_b{/}n)$$ is unlikely to be exactly 0 due to
 sample randomness, yet it is unlikely to be too far from 0. For some
 pairs $(a,b)$ it may deviate from 0 farther than for other pairs.
 Pearson’s $\chi^2$ is an aggregate measure that combines
@@ -703,7 +703,7 @@ squares of these differences across all value pairs:
 $$\chi^2 \,\,=\,\, \sum_{a,\,b} \Big(\frac{n_a n_b}{n}\Big)^{-1} \Big(n_{a,b} - \frac{n_a n_b}{n}\Big)^2
 \,=\,\, \sum_{a,\,b} \frac{(O_{a,b} - E_{a,b})^2}{E_{a,b}}$$
 
-where $O_{a,b} = n_{a,b}$ are the *observed* frequencies and
+where $$O_{a,b} = n_{a,b}$$ are the *observed* frequencies and
 $E_{a,b} = (n_a n_b){/}n$ are the *expected* frequencies for all
 pairs $(a,b)$. Under independence (plus other standard assumptions) the
 sample $\chi^2$ closely follows a well-known distribution, making it a
@@ -802,10 +802,10 @@ $$\eta^2 \,=\, 1 - \frac{\sum_{i=1}^{n} \big(y_i - \hat{y}[x_i]\big)^2}{\sum_{i=
 \hat{y}[x] = \frac{1}{\mathop{\mathrm{freq}}(x)}\sum_{i=1}^n  
 \,\left\{\!\!\begin{array}{rl} y_i & \textrm{if $x_i = x$}\\ 0 & \textrm{otherwise}\end{array}\right.\!\!\!$$
 
-and $\bar{y} = (1{/}n)\sum_{i=1}^n y_i$ is the mean. Value $\hat{y}[x]$
+and $$\bar{y} = (1{/}n)\sum_{i=1}^n y_i$$ is the mean. Value $\hat{y}[x]$
 is the average of $y_i$ among all records where $x_i = x$; it can also
 be viewed as the “predictor” of $y$ given $x$. Then
-$\sum_{i=1}^{n} (y_i - \hat{y}[x_i])^2$ is the residual error
+$$\sum_{i=1}^{n} (y_i - \hat{y}[x_i])^2$$ is the residual error
 sum-of-squares and $\sum_{i=1}^{n} (y_i - \bar{y})^2$ is the total
 sum-of-squares for $y$. Hence, $\eta^2$ measures the accuracy of
 predicting $y$ with $x$, just like the “R-squared” statistic measures
@@ -887,10 +887,10 @@ coefficient is geared towards features having small value domains
 and large counts for the values. Given the two input vectors, we form a
 contingency table $T$ of pairwise frequency counts, as well as a vector
 of frequency counts for each feature: $f_1$ and $f_2$. Here in
-$T_{i,j}$, $f_{1,i}$, $f_{2,j}$ indices $i$ and $j$ refer to the
+$$T_{i,j}$$, $$f_{1,i}$$, $$f_{2,j}$$ indices $i$ and $j$ refer to the
 order-preserving integer encoding of the feature values. We use prefix
 sums over $f_1$ and $f_2$ to compute the values’ average ranks:
-$r_{1,i} = \sum_{j=1}^{i-1} f_{1,j} + (f_{1,i}\,{+}\,1){/}2$, and
+$$r_{1,i} = \sum_{j=1}^{i-1} f_{1,j} + (f_{1,i}\,{+}\,1){/}2$$, and
 analogously for $r_2$. Finally, we compute rank variances for $r_1, r_2$
 weighted by counts $f_1, f_2$ and their covariance weighted by $T$,
 before applying the standard formula for Pearson’s correlation
@@ -899,7 +899,7 @@ coefficient:
 $$\rho \,\,=\,\, \frac{Cov_T(r_1, r_2)}{\sqrt{Var_{f_1}(r_1)Var_{f_2}(r_2)}}
 \,\,=\,\, \frac{\sum_{i,j} T_{i,j} (r_{1,i} - \bar{r}_1) (r_{2,j} - \bar{r}_2)}{\sqrt{\sum_i f_{1,i} (r_{1,i} - \bar{r}_1)^{2\mathstrut} \cdot \sum_j f_{2,j} (r_{2,j} - \bar{r}_2)^{2\mathstrut}}}$$
 
-where $\bar{r_1} = \sum_i r_{1,i} f_{1,i}{/}n$, analogously
+where $$\bar{r_1} = \sum_i r_{1,i} f_{1,i}{/}n$$, analogously
 for $\bar{r}_2$. The value of $\rho$ lies between $-1$ and $+1$, with
 sign indicating the prevalent direction of the association: $\rho > 0$
 ($\rho < 0$) means that one feature tends to increase (decrease) when
@@ -1226,9 +1226,9 @@ $$y_{i,j} \,=\, \alpha_i + \beta x_{i,j} + {\varepsilon}_{i,j}\,, \quad\textrm{w
 Here $i = 1\ldots k$ is a stratum number and
 $j = 1\ldots n_i$ is a record number in stratum $i$; by $n_i$ we denote
 the number of records available in stratum $i$. The noise
-term $\varepsilon_{i,j}$ is assumed to have the same variance in all
-strata. When $n_i\,{>}\,0$, we can estimate the means of $x_{i, j}$ and
-$y_{i, j}$ in stratum $i$ as
+term $$\varepsilon_{i,j}$$ is assumed to have the same variance in all
+strata. When $n_i\,{>}\,0$, we can estimate the means of $$x_{i, j}$$ and
+$$y_{i, j}$$ in stratum $i$ as
 
 $$\bar{x}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,x_{i, j}\Big) / n_i\,;\quad
 \bar{y}_i \,= \Big(\sum\nolimits_{j=1}^{n_i} \,y_{i, j}\Big) / n_i$$
@@ -1259,8 +1259,8 @@ estimates for $Var(X)$ and
 $Var(Y)$ tend to be smaller
 than the non-stratified ones (with the global mean instead of
 $\bar{x_i}$ and $\bar{y_i}$) since $\bar{x_i}$ and $\bar{y_i}$ fit
-closer to $x_{i,j}$ and $y_{i,j}$ than the global means. The stratified
-variance estimates the uncertainty in $x_{i,j}$ and $y_{i,j}$ given
+closer to $$x_{i,j}$$ and $$y_{i,j}$$ than the global means. The stratified
+variance estimates the uncertainty in $$x_{i,j}$$ and $$y_{i,j}$$ given
 their stratum $i$.
 
 Minimizing over $\beta$ the error sum-of-squares 
@@ -1274,13 +1274,13 @@ $$\mathrm{RSS} \,\,=\, \,
 \,\,=\,\,  V_y \,\big(1 \,-\, V_{x,y}^2 / (V_x V_y)\big)$$
 
 The quantity
-$\hat{R}^2 = V_{x,y}^2 / (V_x V_y)$, called *$R$-squared*, estimates the
-fraction of stratified variance in $y_{i,j}$ explained by covariate
-$x_{i, j}$ in the linear regression model. We
+$$\hat{R}^2 = V_{x,y}^2 / (V_x V_y)$$, called *$R$-squared*, estimates the
+fraction of stratified variance in $$y_{i,j}$$ explained by covariate
+$$x_{i, j}$$ in the linear regression model. We
 define *stratified correlation* as the square root of $\hat{R}^2$ taken
 with the sign of $V_{x,y}$. We also use RSS to estimate the residual
 standard deviation $\sigma$ in the linear regression model that models the
-prediction error of $y_{i,j}$ given $x_{i,j}$ and the stratum:
+prediction error of $$y_{i,j}$$ given $$x_{i,j}$$ and the stratum:
 
 $$\hat{\beta}\, =\, \frac{V_{x,y}}{V_x}; \,\,\,\, \hat{R} \,=\, \frac{V_{x,y}}{\sqrt{V_x V_y}};
 \,\,\,\, \hat{R}^2 \,=\, \frac{V_{x,y}^2}{V_x V_y};

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/b966a815/algorithms-regression.md
----------------------------------------------------------------------
diff --git a/algorithms-regression.md b/algorithms-regression.md
index be39a27..0302a18 100644
--- a/algorithms-regression.md
+++ b/algorithms-regression.md
@@ -618,7 +618,7 @@ binomial distributions. Here $\mu$ is the Bernoulli mean.
 | Name                  | Link Function |
 | --------------------- | ------------- |
 | Logit   | $\displaystyle \eta = 1 / \big(1 + e^{-\mu}\big)^{\mathstrut}$
-| Probit  | $\displaystyle \mu  = \frac{1}{\sqrt{2\pi}}\int\nolimits_{-\infty_{\mathstrut}}^{\,\eta\mathstrut} e^{-\frac{t^2}{2}} dt$
+| Probit  | $$\displaystyle \mu  = \frac{1}{\sqrt{2\pi}}\int\nolimits_{-\infty_{\mathstrut}}^{\,\eta\mathstrut} e^{-\frac{t^2}{2}} dt$$
 | Cloglog | $\displaystyle \eta = \log \big(- \log(1 - \mu)\big)^{\mathstrut}$
 | Cauchit | $\displaystyle \eta = \tan\pi(\mu - 1/2)$
 
@@ -688,7 +688,7 @@ matrix $Y$ having 1 or 2 columns. If a power distribution family is
 selected (`dfam=1`), matrix $Y$ must have 1 column that
 provides $y_i$ for each $x_i$ in the corresponding row of matrix $X$.
 When dfam=2 and $Y$ has 1 column, we assume the Bernoulli
-distribution for $y_i\in\{y_{\mathrm{neg}}, 1\}$ with $y_{\mathrm{neg}}$
+distribution for $$y_i\in\{y_{\mathrm{neg}}, 1\}$$ with $y_{\mathrm{neg}}$
 from the input parameter `yneg`. When `dfam=2` and
 $Y$ has 2 columns, we assume the binomial distribution; for each row $i$
 in $X$, cells $Y[i, 1]$ and $Y[i, 2]$ provide the positive and the
@@ -872,7 +872,7 @@ fractional, but the actual $y_i$ is always integer.
 
 If $y_i$ is categorical, i.e. a vector of label counts for record $i$,
 then $\mu_i$ is a vector of non-negative real numbers, one number
-$\mu_{i,l}$ per each label $l$. In this case we divide the $\mu_{i,l}$
+$$\mu_{i,l}$$ per each label $l$. In this case we divide the $$\mu_{i,l}$$
 by their sum $\sum_l \mu_{i,l}$ to obtain predicted label
 probabilities . The output matrix $M$ is the
 $n \times (k\,{+}\,1)$-matrix of these probabilities, where $n$ is the
@@ -1185,7 +1185,7 @@ extra goodness-of-fit measure. To compute these statistics, we use:
 which $y_{i,j}$ is the number of times label $j$ was observed in
 record $i$
   * the model-estimated probability matrix $P$ of the same dimensions that
-satisfies $\sum_{j=1}^{k+1} p_{i,j} = 1$ for all $i=1,\ldots,n$ and
+satisfies $$\sum_{j=1}^{k+1} p_{i,j} = 1$$ for all $i=1,\ldots,n$ and
 where $p_{i,j}$ is the model probability of observing label $j$ in
 record $i$
   * the $n\,{\times}\,1$-vector $N$ where $N_i$ is the aggregated count of
@@ -1259,7 +1259,7 @@ The number of
 degrees of freedom \#d.f. for the $\chi^2$ distribution is $n - m$ for
 numerical data and $(n - m)k$ for categorical data, where
 $k = \mathop{\texttt{ncol}}(Y) - 1$. Given the dispersion parameter
-`disp` the $X^2$ statistic is scaled by division: $X^2_{\texttt{disp}} = X^2 / \texttt{disp}$. If the
+`disp` the $X^2$ statistic is scaled by division: $$X^2_{\texttt{disp}} = X^2 / \texttt{disp}$$. If the
 dispersion is accurate, $X^2 / \texttt{disp}$ should be close to \#d.f.
 In fact, $X^2 / \textrm{\#d.f.}$ over the *training* data is the
 dispersion estimator used in our `GLM.dml` script,
@@ -1271,7 +1271,7 @@ the training data and the test data.
 NOTE: For categorical data, both Pearson’s $X^2$ and the deviance $G^2$
 are unreliable (i.e. do not approach the $\chi^2$ distribution) unless
 the predicted means of multi-label counts
-$\mu_{i,j} = N_i \hspace{0.5pt} p_{i,j}$ are fairly large: all
+$$\mu_{i,j} = N_i \hspace{0.5pt} p_{i,j}$$ are fairly large: all
 ${\geq}\,1$ and 80% are at least $5$ [[Cochran1954]](algorithms-bibliography.html). They should not
 be used for “one label per record” categoricals.
 
@@ -1288,7 +1288,7 @@ $$
 
 The “saturated” model sets the mean
 $\mu_i^{\mathrm{sat}}$ to equal $y_i$ for every record (for categorical
-data, $p_{i,j}^{sat} = y_{i,j} / N_i$), which represents the
+data, $$p_{i,j}^{sat} = y_{i,j} / N_i$$), which represents the
 “perfect fit.” For records with $y_{i,j} \in \{0, N_i\}$ or otherwise at
 a boundary, by continuity we set $0 \log 0 = 0$. The GLM likelihood
 functions defined in (5) become simplified in
@@ -1310,31 +1310,31 @@ Pearson’s $X^2$, see above.
 The rest of the statistics are computed separately for each column
 of $Y$. As explained above, $Y$ has two or more columns in bi- and
 multinomial case, either at input or after conversion. Moreover, each
-$y_{i,j}$ in record $i$ with $N_i \geq 2$ is counted as $N_i$ separate
-observations $y_{i,j,l}$ of 0 or 1 (where $l=1,\ldots,N_i$) with
-$y_{i,j}$ ones and $N_i-y_{i,j}$ zeros. For power distributions,
+$$y_{i,j}$$ in record $i$ with $N_i \geq 2$ is counted as $N_i$ separate
+observations $$y_{i,j,l}$$ of 0 or 1 (where $l=1,\ldots,N_i$) with
+$$y_{i,j}$$ ones and $$N_i-y_{i,j}$$ zeros. For power distributions,
 including linear regression, $Y$ has only one column and all $N_i = 1$,
 so the statistics are computed for all $Y$ with each record counted
-once. Below we denote $N = \sum_{i=1}^n N_i \,\geq n$. Here is the total
+once. Below we denote $$N = \sum_{i=1}^n N_i \,\geq n$$. Here is the total
 average and the residual average (residual bias) of $y_{i,j,l}$ for each
 $Y$-column:
 
 $$\texttt{AVG_TOT_Y}_j   \,=\, \frac{1}{N} \sum_{i=1}^n  y_{i,j}; \quad
 \texttt{AVG_RES_Y}_j   \,=\, \frac{1}{N} \sum_{i=1}^n \, (y_{i,j} - \mu_{i,j})$$
 
-Dividing by $N$ (rather than $n$) gives the averages for $y_{i,j,l}$
-(rather than $y_{i,j}$). The total variance, and the standard deviation,
-for individual observations $y_{i,j,l}$ is estimated from the total
-variance for response values $y_{i,j}$ using independence assumption:
-$Var \,y_{i,j} = Var \sum_{l=1}^{N_i} y_{i,j,l} = \sum_{l=1}^{N_i} Var y_{i,j,l}$.
+Dividing by $N$ (rather than $n$) gives the averages for $$y_{i,j,l}$$
+(rather than $$y_{i,j}$$). The total variance, and the standard deviation,
+for individual observations $$y_{i,j,l}$$ is estimated from the total
+variance for response values $$y_{i,j}$$ using independence assumption:
+$$Var \,y_{i,j} = Var \sum_{l=1}^{N_i} y_{i,j,l} = \sum_{l=1}^{N_i} Var y_{i,j,l}$$.
 This allows us to estimate the sum of squares for $y_{i,j,l}$ via the
-sum of squares for $y_{i,j}$: 
+sum of squares for $$y_{i,j}$$: 
 
 $$\texttt{STDEV_TOT_Y}_j \,=\, 
 \Bigg[\frac{1}{N-1} \sum_{i=1}^n  \Big( y_{i,j} -  \frac{N_i}{N} \sum_{i'=1}^n  y_{i'\!,j}\Big)^2\Bigg]^{1/2}$$
 
 Analogously, we estimate the standard deviation of the residual
-$y_{i,j,l} - \mu_{i,j,l}$: 
+$$y_{i,j,l} - \mu_{i,j,l}$$: 
 
 $$\texttt{STDEV_RES_Y}_j \,=\, 
 \Bigg[\frac{1}{N-m'} \,\sum_{i=1}^n  \Big( y_{i,j} - \mu_{i,j} -  \frac{N_i}{N} \sum_{i'=1}^n  (y_{i'\!,j} - \mu_{i'\!,j})\Big)^2\Bigg]^{1/2}$$
@@ -1363,8 +1363,8 @@ $m$ with the intercept or $m+1$ without the intercept.
 
 | Statistic             | Formula |
 | --------------------- | ------------- |
-| $\texttt{PLAIN_R2}_j$ | $ \displaystyle 1 - \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n  y_{i',j} \Big)^{2}} $
-| $\texttt{ADJUSTED_R2}_j$ | $ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m}}  \, \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n  y_{i',j} \Big)^{2}} $
+| $\texttt{PLAIN_R2}_j$ | $$ \displaystyle 1 - \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n  y_{i',j} \Big)^{2}} $$
+| $\texttt{ADJUSTED_R2}_j$ | $$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m}}  \, \frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n  y_{i',j} \Big)^{2}} $$
  
 
 * * *
@@ -1374,8 +1374,8 @@ $m$ with the intercept or $m+1$ without the intercept.
 
 | Statistic             | Formula |
 | --------------------- | ------------- |
-| $\texttt{PLAIN_R2_NOBIAS}_j$ | $ \displaystyle 1 - \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n  (y_{i',j} \,{-}\, \mu_{i',j}) \Big)^{2}}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}} $
-| $\texttt{ADJUSTED_R2_NOBIAS}_j$ | $ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m'}} \, \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n  (y_{i',j} \,{-}\, \mu_{i',j}) \Big)^{2}}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}} $
+| $\texttt{PLAIN_R2_NOBIAS}_j$ | $$ \displaystyle 1 - \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n  (y_{i',j} \,{-}\, \mu_{i',j}) \Big)^{2}}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}} $$
+| $\texttt{ADJUSTED_R2_NOBIAS}_j$ | $$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m'}} \, \frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n  (y_{i',j} \,{-}\, \mu_{i',j}) \Big)^{2}}{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}} \sum\limits_{i'=1}^n y_{i',j} \Big)^{2}} $$
 
 
 * * *

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/b966a815/index.md
----------------------------------------------------------------------
diff --git a/index.md b/index.md
index 49f69d8..70efd1d 100644
--- a/index.md
+++ b/index.md
@@ -5,18 +5,18 @@ title: SystemML Overview
 description: SystemML documentation homepage
 ---
 
-SystemML is a flexible, scalable machine learning (ML) library written in Java.
+SystemML is a flexible, scalable machine learning (ML) language written in Java.
 SystemML's distinguishing characteristics are: (1) algorithm customizability,
 (2) multiple execution modes, including Standalone, Hadoop Batch, and Spark Batch,
 and (3) automatic optimization.
 
-
 ## SystemML Documentation
 
 For more information about SystemML, please consult the following references:
 
 * [SystemML GitHub README](http://www.github.com/SparkTC/systemml)
 * [Quick Start Guide](quick-start-guide.html)
+<!-- * [Programming Guide](programming-guide.html) -->
 * [Algorithms Reference](algorithms-reference.html)
 * [DML (Declarative Machine Learning) Language Reference](dml-language-reference.html)
 * PYDML (Python-Like Declarative Machine Learning) Language Reference - **Coming Soon**