You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemml.apache.org by de...@apache.org on 2017/04/07 18:58:47 UTC

[43/50] [abbrv] incubator-systemml git commit: [SYSTEMML-1393] Exclude alg ref and lang ref dirs from doc site

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/GLM.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/GLM.tex b/Algorithms Reference/GLM.tex
deleted file mode 100644
index 8555a5b..0000000
--- a/Algorithms Reference/GLM.tex	
+++ /dev/null
@@ -1,431 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Generalized Linear Models (GLM)}
-\label{sec:GLM}
-
-\noindent{\bf Description}
-\smallskip
-
-Generalized Linear Models~\cite{Gill2000:GLM,McCullagh1989:GLM,Nelder1972:GLM}
-extend the methodology of linear and logistic regression to a variety of
-distributions commonly assumed as noise effects in the response variable.
-As before, we are given a collection
-of records $(x_1, y_1)$, \ldots, $(x_n, y_n)$ where $x_i$ is a numerical vector of
-explanatory (feature) variables of size~\mbox{$\dim x_i = m$}, and $y_i$ is the
-response (dependent) variable observed for this vector.  GLMs assume that some
-linear combination of the features in~$x_i$ determines the \emph{mean}~$\mu_i$
-of~$y_i$, while the observed $y_i$ is a random outcome of a noise distribution
-$\Prob[y\mid \mu_i]\,$\footnote{$\Prob[y\mid \mu_i]$ is given by a density function
-if $y$ is continuous.}
-with that mean~$\mu_i$:
-\begin{equation*}
-x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} 
-\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim \Prob[y\mid \mu_i]
-\end{equation*}
-
-In linear regression the response mean $\mu_i$ \emph{equals} some linear combination
-over~$x_i$, denoted above by~$\eta_i$.
-In logistic regression with $y\in\{0, 1\}$ (Bernoulli) the mean of~$y$ is the same
-as $\Prob[y=1]$ and equals $1/(1+e^{-\eta_i})$, the logistic function of~$\eta_i$.
-In GLM, $\mu_i$ and $\eta_i$ can be related via any given smooth monotone function
-called the \emph{link function}: $\eta_i = g(\mu_i)$.  The unknown linear combination
-parameters $\beta_j$ are assumed to be the same for all records.
-
-The goal of the regression is to estimate the parameters~$\beta_j$ from the observed
-data.  Once the~$\beta_j$'s are accurately estimated, we can make predictions
-about~$y$ for a new feature vector~$x$.  To do so, compute $\eta$ from~$x$ and use
-the inverted link function $\mu = g^{-1}(\eta)$ to compute the mean $\mu$ of~$y$;
-then use the distribution $\Prob[y\mid \mu]$ to make predictions about~$y$.
-Both $g(\mu)$ and $\Prob[y\mid \mu]$ are user-provided.  Our GLM script supports
-a standard set of distributions and link functions, see below for details.
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}GLM.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} Y=}path/file
-{\tt{} B=}path/file
-{\tt{} fmt=}format
-{\tt{} O=}path/file
-{\tt{} Log=}path/file
-{\tt{} dfam=}int
-{\tt{} vpow=}double
-{\tt{} link=}int
-{\tt{} lpow=}double
-{\tt{} yneg=}double
-{\tt{} icpt=}int
-{\tt{} reg=}double
-{\tt{} tol=}double
-{\tt{} disp=}double
-{\tt{} moi=}int
-{\tt{} mii=}int
-
-}
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the matrix of feature vectors; each row constitutes
-an example.
-\item[{\tt Y}:]
-Location to read the response matrix, which may have 1 or 2 columns
-\item[{\tt B}:]
-Location to store the estimated regression parameters (the $\beta_j$'s), with the
-intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\item[{\tt O}:] (default:\mbox{ }{\tt " "})
-Location to write certain summary statistics described in Table~\ref{table:GLM:stats},
-by default it is standard output.
-\item[{\tt Log}:] (default:\mbox{ }{\tt " "})
-Location to store iteration-specific variables for monitoring and debugging purposes,
-see Table~\ref{table:GLM:log} for details.
-\item[{\tt dfam}:] (default:\mbox{ }{\tt 1})
-Distribution family code to specify $\Prob[y\mid \mu]$, see Table~\ref{table:commonGLMs}:\\
-{\tt 1} = power distributions with $\Var(y) = \mu^{\alpha}$;
-{\tt 2} = binomial or Bernoulli
-\item[{\tt vpow}:] (default:\mbox{ }{\tt 0.0})
-When {\tt dfam=1}, this provides the~$q$ in $\Var(y) = a\mu^q$, the power
-dependence of the variance of~$y$ on its mean.  In particular, use:\\
-{\tt 0.0} = Gaussian,
-{\tt 1.0} = Poisson,
-{\tt 2.0} = Gamma,
-{\tt 3.0} = inverse Gaussian
-\item[{\tt link}:] (default:\mbox{ }{\tt 0})
-Link function code to determine the link function~$\eta = g(\mu)$:\\
-{\tt 0} = canonical link (depends on the distribution family), see Table~\ref{table:commonGLMs};\\
-{\tt 1} = power functions,
-{\tt 2} = logit,
-{\tt 3} = probit,
-{\tt 4} = cloglog,
-{\tt 5} = cauchit
-\item[{\tt lpow}:] (default:\mbox{ }{\tt 1.0})
-When {\tt link=1}, this provides the~$s$ in $\eta = \mu^s$, the power link
-function; {\tt lpow=0.0} gives the log link $\eta = \log\mu$.  Common power links:\\
-{\tt -2.0} = $1/\mu^2$,
-{\tt -1.0} = reciprocal,
-{\tt 0.0} = log,
-{\tt 0.5} = sqrt,
-{\tt 1.0} = identity
-\item[{\tt yneg}:] (default:\mbox{ }{\tt 0.0})
-When {\tt dfam=2} and the response matrix $Y$ has 1~column,
-this specifies the $y$-value used for Bernoulli ``No'' label.
-All other $y$-values are treated as the ``Yes'' label.
-For example, {\tt yneg=-1.0} may be used when $y\in\{-1, 1\}$;
-either {\tt yneg=1.0} or {\tt yneg=2.0} may be used when $y\in\{1, 2\}$.
-\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
-Intercept and shifting/rescaling of the features in~$X$:\\
-{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\
-{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
-{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
-\item[{\tt reg}:] (default:\mbox{ }{\tt 0.0})
-L2-regularization parameter (lambda)
-\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
-Tolerance (epsilon) used in the convergence criterion: we terminate the outer iterations
-when the deviance changes by less than this factor; see below for details
-\item[{\tt disp}:] (default:\mbox{ }{\tt 0.0})
-Dispersion parameter, or {\tt 0.0} to estimate it from data
-\item[{\tt moi}:] (default:\mbox{ }{\tt 200})
-Maximum number of outer (Fisher scoring) iterations
-\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
-Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
-limit provided
-\end{Description}
-
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|ll|}
-\hline
-Name & Meaning \\
-\hline
-{\tt TERMINATION\_CODE}  & A positive integer indicating success/failure as follows: \\
-                         & $1 = {}$Converged successfully;
-                           $2 = {}$Maximum \# of iterations reached; \\
-                         & $3 = {}$Input ({\tt X}, {\tt Y}) out of range;
-                           $4 = {}$Distribution/link not supported \\
-{\tt BETA\_MIN}          & Smallest beta value (regression coefficient), excluding the intercept \\
-{\tt BETA\_MIN\_INDEX}   & Column index for the smallest beta value \\
-{\tt BETA\_MAX}          & Largest beta value (regression coefficient), excluding the intercept \\
-{\tt BETA\_MAX\_INDEX}   & Column index for the largest beta value \\
-{\tt INTERCEPT}          & Intercept value, or NaN if there is no intercept (if {\tt icpt=0}) \\
-{\tt DISPERSION}         & Dispersion used to scale deviance, provided in {\tt disp} input argument \\
-                         & or estimated (same as {\tt DISPERSION\_EST}) if {\tt disp} argument is${} \leq 0$ \\
-{\tt DISPERSION\_EST}    & Dispersion estimated from the dataset \\
-{\tt DEVIANCE\_UNSCALED} & Deviance from the saturated model, assuming dispersion${} = 1.0$ \\
-{\tt DEVIANCE\_SCALED}   & Deviance from the saturated model, scaled by {\tt DISPERSION} value \\
-\hline
-\end{tabular}}
-\caption{Besides~$\beta$, GLM regression script computes a few summary statistics listed above.
-They are provided in CSV format, one comma-separated name-value pair per each line.}
-\label{table:GLM:stats}
-\end{table}
-
-
-
-
-
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|ll|}
-\hline
-Name & Meaning \\
-\hline
-{\tt NUM\_CG\_ITERS}     & Number of inner (Conj.\ Gradient) iterations in this outer iteration \\
-{\tt IS\_TRUST\_REACHED} & $1 = {}$trust region boundary was reached, $0 = {}$otherwise \\
-{\tt POINT\_STEP\_NORM}  & L2-norm of iteration step from old point ($\beta$-vector) to new point \\
-{\tt OBJECTIVE}          & The loss function we minimize (negative partial log-likelihood) \\
-{\tt OBJ\_DROP\_REAL}    & Reduction in the objective during this iteration, actual value \\
-{\tt OBJ\_DROP\_PRED}    & Reduction in the objective predicted by a quadratic approximation \\
-{\tt OBJ\_DROP\_RATIO}   & Actual-to-predicted reduction ratio, used to update the trust region \\
-{\tt GRADIENT\_NORM}     & L2-norm of the loss function gradient (omitted if point is rejected) \\
-{\tt LINEAR\_TERM\_MIN}  & The minimum value of $X \pxp \beta$, used to check for overflows \\
-{\tt LINEAR\_TERM\_MAX}  & The maximum value of $X \pxp \beta$, used to check for overflows \\
-{\tt IS\_POINT\_UPDATED} & $1 = {}$new point accepted; $0 = {}$new point rejected, old point restored \\
-{\tt TRUST\_DELTA}       & Updated trust region size, the ``delta'' \\
-\hline
-\end{tabular}}
-\caption{
-The {\tt Log} file for GLM regression contains the above \mbox{per-}iteration
-variables in CSV format, each line containing triple (Name, Iteration\#, Value) with Iteration\#
-being~0 for initial values.}
-\label{table:GLM:log}
-\end{table}
-
-\begin{table}[t]\hfil
-\begin{tabular}{|ccccccc|}
-\hline
-\multicolumn{4}{|c}{INPUT PARAMETERS}              & Distribution  & Link      & Cano- \\
-{\tt dfam} & {\tt vpow} & {\tt link} & {\tt\ lpow} & family        & function  & nical?\\
-\hline
-{\tt 1}    & {\tt 0.0}  & {\tt 1}    & {\tt -1.0}  & Gaussian      & inverse   &       \\
-{\tt 1}    & {\tt 0.0}  & {\tt 1}    & {\tt\ 0.0}  & Gaussian      & log       &       \\
-{\tt 1}    & {\tt 0.0}  & {\tt 1}    & {\tt\ 1.0}  & Gaussian      & identity  & Yes   \\
-{\tt 1}    & {\tt 1.0}  & {\tt 1}    & {\tt\ 0.0}  & Poisson       & log       & Yes   \\
-{\tt 1}    & {\tt 1.0}  & {\tt 1}    & {\tt\ 0.5}  & Poisson       & sq.root   &       \\
-{\tt 1}    & {\tt 1.0}  & {\tt 1}    & {\tt\ 1.0}  & Poisson       & identity  &       \\
-{\tt 1}    & {\tt 2.0}  & {\tt 1}    & {\tt -1.0}  & Gamma         & inverse   & Yes   \\
-{\tt 1}    & {\tt 2.0}  & {\tt 1}    & {\tt\ 0.0}  & Gamma         & log       &       \\
-{\tt 1}    & {\tt 2.0}  & {\tt 1}    & {\tt\ 1.0}  & Gamma         & identity  &       \\
-{\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt -2.0}  & Inverse Gauss & $1/\mu^2$ & Yes   \\
-{\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt -1.0}  & Inverse Gauss & inverse   &       \\
-{\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt\ 0.0}  & Inverse Gauss & log       &       \\
-{\tt 1}    & {\tt 3.0}  & {\tt 1}    & {\tt\ 1.0}  & Inverse Gauss & identity  &       \\
-\hline
-{\tt 2}    & {\tt  *}   & {\tt 1}    & {\tt\ 0.0}  & Binomial      & log       &       \\
-{\tt 2}    & {\tt  *}   & {\tt 1}    & {\tt\ 0.5}  & Binomial      & sq.root   &       \\
-{\tt 2}    & {\tt  *}   & {\tt 2}    & {\tt\  *}   & Binomial      & logit     & Yes   \\
-{\tt 2}    & {\tt  *}   & {\tt 3}    & {\tt\  *}   & Binomial      & probit    &       \\
-{\tt 2}    & {\tt  *}   & {\tt 4}    & {\tt\  *}   & Binomial      & cloglog   &       \\
-{\tt 2}    & {\tt  *}   & {\tt 5}    & {\tt\  *}   & Binomial      & cauchit   &       \\
-\hline
-\end{tabular}\hfil
-\caption{Common GLM distribution families and link functions.
-(Here ``{\tt *}'' stands for ``any value.'')}
-\label{table:commonGLMs}
-\end{table}
-
-\noindent{\bf Details}
-\smallskip
-
-In GLM, the noise distribution $\Prob[y\mid \mu]$ of the response variable~$y$
-given its mean~$\mu$ is restricted to have the \emph{exponential family} form
-\begin{equation}
-Y \sim\, \Prob[y\mid \mu] \,=\, \exp\left(\frac{y\theta - b(\theta)}{a}
-+ c(y, a)\right),\,\,\textrm{where}\,\,\,\mu = \E(Y) = b'(\theta).
-\label{eqn:GLM}
-\end{equation}
-Changing the mean in such a distribution simply multiplies all \mbox{$\Prob[y\mid \mu]$}
-by~$e^{\,y\hspace{0.2pt}\theta/a}$ and rescales them so that they again integrate to~1.
-Parameter $\theta$ is called \emph{canonical}, and the function $\theta = b'^{\,-1}(\mu)$
-that relates it to the mean is called the~\emph{canonical link}; constant~$a$ is called
-\emph{dispersion} and rescales the variance of~$y$.  Many common distributions can be put
-into this form, see Table~\ref{table:commonGLMs}.  The canonical parameter~$\theta$
-is often chosen to coincide with~$\eta$, the linear combination of the regression features;
-other choices for~$\eta$ are possible too.
-
-Rather than specifying the canonical link, GLM distributions are commonly defined
-by their variance $\Var(y)$ as the function of the mean~$\mu$.  It can be shown
-from Eq.~(\ref{eqn:GLM}) that $\Var(y) = a\,b''(\theta) = a\,b''(b'^{\,-1}(\mu))$.
-For example, for the Bernoulli distribution $\Var(y) = \mu(1-\mu)$, for the Poisson
-distribution \mbox{$\Var(y) = \mu$}, and for the Gaussian distribution
-$\Var(y) = a\cdot 1 = \sigma^2$.
-It turns out that for many common distributions $\Var(y) = a\mu^q$, a power function.
-We support all distributions where $\Var(y) = a\mu^q$, as well as the Bernoulli and
-the binomial distributions.
-
-For distributions with $\Var(y) = a\mu^q$ the canonical link is also a power function,
-namely $\theta = \mu^{1-q}/(1-q)$, except for the Poisson ($q = 1$) whose canonical link is
-$\theta = \log\mu$.  We support all power link functions in the form $\eta = \mu^s$,
-dropping any constant factor, with $\eta = \log\mu$ for $s=0$.  The binomial distribution
-has its own family of link functions, which includes logit (the canonical link),
-probit, cloglog, and cauchit (see Table~\ref{table:binomial_links}); we support these
-only for the binomial and Bernoulli distributions.  Links and distributions are specified
-via four input parameters: {\tt dfam}, {\tt vpow}, {\tt link}, and {\tt lpow} (see
-Table~\ref{table:commonGLMs}).
-
-\begin{table}[t]\hfil
-\begin{tabular}{|cc|cc|}
-\hline
-Name & Link function & Name & Link function \\
-\hline
-Logit   & $\displaystyle \eta = 1 / \big(1 + e^{-\mu}\big)^{\mathstrut}$ &
-Cloglog & $\displaystyle \eta = \log \big(\!- \log(1 - \mu)\big)^{\mathstrut}$ \\
-Probit  & $\displaystyle \mu  = \frac{1}{\sqrt{2\pi}}\int\nolimits_{-\infty_{\mathstrut}}^{\,\eta\mathstrut}
-          \!\!\!\!\! e^{-\frac{t^2}{2}} dt$ & 
-Cauchit & $\displaystyle \eta = \tan\pi(\mu - 1/2)$ \\
-\hline
-\end{tabular}\hfil
-\caption{The supported non-power link functions for the Bernoulli and the binomial
-distributions.  (Here $\mu$~is the Bernoulli mean.)}
-\label{table:binomial_links}
-\end{table}
-
-The observed response values are provided to the regression script as matrix~$Y$
-having 1 or 2 columns.  If a power distribution family is selected ({\tt dfam=1}),
-matrix $Y$ must have 1~column that provides $y_i$ for each~$x_i$ in the corresponding
-row of matrix~$X$.  When {\tt dfam=2} and $Y$ has 1~column, we assume the Bernoulli
-distribution for $y_i\in\{y_{\mathrm{neg}}, y_{\mathrm{pos}}\}$ with $y_{\mathrm{neg}}$
-from the input parameter {\tt yneg} and with $y_{\mathrm{pos}} \neq y_{\mathrm{neg}}$.  
-When {\tt dfam=2} and $Y$ has 2~columns, we assume the
-binomial distribution; for each row~$i$ in~$X$, cells $Y[i, 1]$ and $Y[i, 2]$ provide
-the positive and the negative binomial counts respectively.  Internally we convert
-the 1-column Bernoulli into the 2-column binomial with 0-versus-1 counts.
-
-We estimate the regression parameters via L2-regularized negative log-likelihood
-minimization:
-\begin{equation*}
-f(\beta; X, Y) \,\,=\,\, -\sum\nolimits_{i=1}^n \big(y_i\theta_i - b(\theta_i)\big)
-\,+\,(\lambda/2) \sum\nolimits_{j=1}^m \beta_j^2\,\,\to\,\,\min
-\end{equation*}
-where $\theta_i$ and $b(\theta_i)$ are from~(\ref{eqn:GLM}); note that $a$
-and $c(y, a)$ are constant w.r.t.~$\beta$ and can be ignored here.
-The canonical parameter $\theta_i$ depends on both $\beta$ and~$x_i$:
-\begin{equation*}
-\theta_i \,\,=\,\, b'^{\,-1}(\mu_i) \,\,=\,\, b'^{\,-1}\big(g^{-1}(\eta_i)\big) \,\,=\,\,
-\big(b'^{\,-1}\circ g^{-1}\big)\left(\beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j}\right)
-\end{equation*}
-The user-provided (via {\tt reg}) regularization coefficient $\lambda\geq 0$ can be used
-to mitigate overfitting and degeneracy in the data.  Note that the intercept is never
-regularized.
-
-Our iterative minimizer for $f(\beta; X, Y)$ uses the Fisher scoring approximation
-to the difference $\varDelta f(z; \beta) = f(\beta + z; X, Y) \,-\, f(\beta; X, Y)$,
-recomputed at each iteration:
-\begin{gather*}
-\varDelta f(z; \beta) \,\,\,\approx\,\,\, 1/2 \cdot z^T A z \,+\, G^T z,
-\,\,\,\,\textrm{where}\,\,\,\, A \,=\, X^T\!\diag(w) X \,+\, \lambda I\\
-\textrm{and}\,\,\,\,G \,=\, - X^T u \,+\, \lambda\beta,
-\,\,\,\textrm{with $n\,{\times}\,1$ vectors $w$ and $u$ given by}\\
-\forall\,i = 1\ldots n: \,\,\,\,
-w_i = \big[v(\mu_i)\,g'(\mu_i)^2\big]^{-1}
-\!\!\!\!\!\!,\,\,\,\,\,\,\,\,\,
-u_i = (y_i - \mu_i)\big[v(\mu_i)\,g'(\mu_i)\big]^{-1}
-\!\!\!\!\!\!.\,\,\,\,
-\end{gather*}
-Here $v(\mu_i)=\Var(y_i)/a$, the variance of $y_i$ as the function of the mean, and
-$g'(\mu_i) = d \eta_i/d \mu_i$ is the link function derivative.  The Fisher scoring
-approximation is minimized by trust-region conjugate gradient iterations (called the
-\emph{inner} iterations, with the Fisher scoring iterations as the \emph{outer}
-iterations), which approximately solve the following problem:
-\begin{equation*}
-1/2 \cdot z^T A z \,+\, G^T z \,\,\to\,\,\min\,\,\,\,\textrm{subject to}\,\,\,\,
-\|z\|_2 \leq \delta
-\end{equation*}
-The conjugate gradient algorithm closely follows Algorithm~7.2 on page~171
-of~\cite{Nocedal2006:Optimization}.
-The trust region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$
-and updated as described in~\cite{Nocedal2006:Optimization}.
-The user can specify the maximum number of the outer and the inner iterations with
-input parameters {\tt moi} and {\tt mii}, respectively.  The Fisher scoring algorithm
-terminates successfully if $2|\varDelta f(z; \beta)| < (D_1(\beta) + 0.1)\hspace{0.5pt}\eps$
-where $\eps > 0$ is a tolerance supplied by the user via {\tt tol}, and $D_1(\beta)$ is
-the unit-dispersion deviance estimated as
-\begin{equation*}
-D_1(\beta) \,\,=\,\, 2 \cdot \big(\Prob[Y \mid \!
-\begin{smallmatrix}\textrm{saturated}\\\textrm{model}\end{smallmatrix}, a\,{=}\,1]
-\,\,-\,\,\Prob[Y \mid X, \beta, a\,{=}\,1]\,\big)
-\end{equation*}
-The deviance estimate is also produced as part of the output.  Once the Fisher scoring
-algorithm terminates, if requested by the user, we estimate the dispersion~$a$ from
-Eq.~\ref{eqn:GLM} using Pearson residuals
-\begin{equation}
-\hat{a} \,\,=\,\, \frac{1}{n-m}\cdot \sum_{i=1}^n \frac{(y_i - \mu_i)^2}{v(\mu_i)}
-\label{eqn:dispersion}
-\end{equation}
-and use it to adjust our deviance estimate: $D_{\hat{a}}(\beta) = D_1(\beta)/\hat{a}$.
-If input argument {\tt disp} is {\tt 0.0} we estimate $\hat{a}$, otherwise we use its
-value as~$a$.  Note that in~(\ref{eqn:dispersion}) $m$~counts the intercept
-($m \leftarrow m+1$) if it is present.
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-The estimated regression parameters (the $\hat{\beta}_j$'s) are populated into
-a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
-input argument.  What this matrix contains, and its size, depends on the input
-argument {\tt icpt}, which specifies the user's intercept and rescaling choice:
-\begin{Description}
-\item[{\tt icpt=0}:] No intercept, matrix~$B$ has size $m\,{\times}\,1$, with
-$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to~$m = {}$ncol$(X)$.
-\item[{\tt icpt=1}:] There is intercept, but no shifting/rescaling of~$X$; matrix~$B$
-has size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from 1 to~$m$,
-and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept coefficient.
-\item[{\tt icpt=2}:] There is intercept, and the features in~$X$ are shifted to
-mean${} = 0$ and rescaled to variance${} = 1$; then there are two versions of
-the~$\hat{\beta}_j$'s, one for the original features and another for the
-shifted/rescaled features.  Now matrix~$B$ has size $(m\,{+}\,1) \times 2$, with
-$B[\cdot, 1]$ for the original features and $B[\cdot, 2]$ for the shifted/rescaled
-features, in the above format.  Note that $B[\cdot, 2]$ are iteratively estimated
-and $B[\cdot, 1]$ are obtained from $B[\cdot, 2]$ by complementary shifting and
-rescaling.
-\end{Description}
-Our script also estimates the dispersion $\hat{a}$ (or takes it from the user's input)
-and the deviances $D_1(\hat{\beta})$ and $D_{\hat{a}}(\hat{\beta})$, see
-Table~\ref{table:GLM:stats} for details.  A log file with variables monitoring
-progress through the iterations can also be made available, see Table~\ref{table:GLM:log}.
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-\hml -f GLM.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
-  B=/user/biadmin/B.mtx fmt=csv dfam=2 link=2 yneg=-1.0 icpt=2 reg=0.01 tol=0.00000001
-  disp=1.0 moi=100 mii=10 O=/user/biadmin/stats.csv Log=/user/biadmin/log.csv
-
-}
-
-\smallskip
-\noindent{\bf See Also}
-\smallskip
-
-In case of binary classification problems, consider using L2-SVM or binary logistic
-regression; for multiclass classification, use multiclass~SVM or multinomial logistic
-regression.  For the special cases of linear regression and logistic regression, it
-may be more efficient to use the corresponding specialized scripts instead of~GLM.

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/GLMpredict.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/GLMpredict.tex b/Algorithms Reference/GLMpredict.tex
deleted file mode 100644
index ceb249d..0000000
--- a/Algorithms Reference/GLMpredict.tex	
+++ /dev/null
@@ -1,474 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Regression Scoring and Prediction}
-
-\noindent{\bf Description}
-\smallskip
-
-Script {\tt GLM-predict.dml} is intended to cover all linear model based regressions,
-including linear regression, binomial and multinomial logistic regression, and GLM
-regressions (Poisson, gamma, binomial with probit link~etc.).  Having just one scoring
-script for all these regressions simplifies maintenance and enhancement while ensuring
-compatible interpretations for output statistics.
-
-The script performs two functions, prediction and scoring.  To perform prediction,
-the script takes two matrix inputs: a collection of records $X$ (without the response
-attribute) and the estimated regression parameters~$B$, also known as~$\beta$.  
-To perform scoring, in addition to $X$ and~$B$, the script takes the matrix of actual
-response values~$Y$ that are compared to the predictions made with $X$ and~$B$.  Of course
-there are other, non-matrix, input arguments that specify the model and the output
-format, see below for the full list.
-
-We assume that our test/scoring dataset is given by $n\,{\times}\,m$-matrix $X$ of
-numerical feature vectors, where each row~$x_i$ represents one feature vector of one
-record; we have \mbox{$\dim x_i = m$}.  Each record also includes the response
-variable~$y_i$ that may be numerical, single-label categorical, or multi-label categorical.
-A single-label categorical $y_i$ is an integer category label, one label per record;
-a multi-label $y_i$ is a vector of integer counts, one count for each possible label,
-which represents multiple single-label events (observations) for the same~$x_i$.  Internally
-we convert single-label categoricals into multi-label categoricals by replacing each
-label~$l$ with an indicator vector~$(0,\ldots,0,1_l,0,\ldots,0)$.  In prediction-only
-tasks the actual $y_i$'s are not needed to the script, but they are needed for scoring.
-
-To perform prediction, the script matrix-multiplies $X$ and $B$, adding the intercept
-if available, then applies the inverse of the model's link function.  
-All GLMs assume that the linear combination of the features in~$x_i$ and the betas
-in~$B$ determines the means~$\mu_i$ of the~$y_i$'s (in numerical or multi-label
-categorical form) with $\dim \mu_i = \dim y_i$.  The observed $y_i$ is assumed to follow
-a specified GLM family distribution $\Prob[y\mid \mu_i]$ with mean(s)~$\mu_i$:
-\begin{equation*}
-x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x_{i,j} 
-\,\,\,\,\mapsto\,\,\,\, \mu_i \,\,\,\,\mapsto \,\,\,\, y_i \sim \Prob[y\mid \mu_i]
-\end{equation*}
-If $y_i$ is numerical, the predicted mean $\mu_i$ is a real number.  Then our script's
-output matrix $M$ is the $n\,{\times}\,1$-vector of these means~$\mu_i$.
-Note that $\mu_i$ predicts the mean of $y_i$, not the actual~$y_i$.  For example,
-in Poisson distribution, the mean is usually fractional, but the actual~$y_i$ is
-always integer.
-
-If $y_i$ is categorical, i.e.\ a vector of label counts for record~$i$, then $\mu_i$
-is a vector of non-negative real numbers, one number $\mu_{i,l}$ per each label~$l$.
-In this case we divide the $\mu_{i,l}$ by their sum $\sum_l \mu_{i,l}$ to obtain
-predicted label probabilities~\mbox{$p_{i,l}\in [0, 1]$}.  The output matrix $M$ is
-the $n \times (k\,{+}\,1)$-matrix of these probabilities, where $n$ is the number of
-records and $k\,{+}\,1$ is the number of categories\footnote{We use $k+1$ because
-there are $k$ non-baseline categories and one baseline category, with regression
-parameters $B$ having $k$~columns.}.  Note again that we do not predict the labels
-themselves, nor their actual counts per record, but we predict the labels' probabilities. 
-
-Going from predicted probabilities to predicted labels, in the single-label categorical
-case, requires extra information such as the cost of false positive versus
-false negative errors.  For example, if there are 5 categories and we \emph{accurately}
-predicted their probabilities as $(0.1, 0.3, 0.15, 0.2, 0.25)$, just picking the
-highest-probability label would be wrong 70\% of the time, whereas picking the
-lowest-probability label might be right if, say, it represents a diagnosis of cancer
-or another rare and serious outcome.  Hence, we keep this step outside the scope of
-{\tt GLM-predict.dml} for now.
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}GLM-predict.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} Y=}path/file
-{\tt{} B=}path/file
-{\tt{} M=}path/file
-{\tt{} O=}path/file
-{\tt{} dfam=}int
-{\tt{} vpow=}double
-{\tt{} link=}int
-{\tt{} lpow=}double
-{\tt{} disp=}double
-{\tt{} fmt=}format
-
-}
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the $n\,{\times}\,m$-matrix $X$ of feature vectors, each row
-constitutes one feature vector (one record)
-\item[{\tt Y}:] (default:\mbox{ }{\tt " "})
-Location to read the response matrix $Y$ needed for scoring (but optional for prediction),
-with the following dimensions: \\
-    $n \:{\times}\: 1$: acceptable for all distributions ({\tt dfam=1} or {\tt 2} or {\tt 3}) \\
-    $n \:{\times}\: 2$: for binomial ({\tt dfam=2}) if given by (\#pos, \#neg) counts \\
-    $n \:{\times}\: k\,{+}\,1$: for multinomial ({\tt dfam=3}) if given by category counts
-\item[{\tt M}:] (default:\mbox{ }{\tt " "})
-Location to write, if requested, the matrix of predicted response means (for {\tt dfam=1}) or
-probabilities (for {\tt dfam=2} or {\tt 3}):\\
-    $n \:{\times}\: 1$: for power-type distributions ({\tt dfam=1}) \\
-    $n \:{\times}\: 2$: for binomial distribution ({\tt dfam=2}), col\#~2 is the ``No'' probability \\
-    $n \:{\times}\: k\,{+}\,1$: for multinomial logit ({\tt dfam=3}), col\#~$k\,{+}\,1$ is for the baseline
-\item[{\tt B}:]
-Location to read matrix $B$ of the \mbox{betas}, i.e.\ estimated GLM regression parameters,
-with the intercept at row\#~$m\,{+}\,1$ if available:\\
-    $\dim(B) \,=\, m \:{\times}\: k$: do not add intercept \\
-    $\dim(B) \,=\, m\,{+}\,1 \:{\times}\: k$: add intercept as given by the last $B$-row \\
-    if $k > 1$, use only $B${\tt [, 1]} unless it is Multinomial Logit ({\tt dfam=3})
-\item[{\tt O}:] (default:\mbox{ }{\tt " "})
-Location to store the CSV-file with goodness-of-fit statistics defined in
-Table~\ref{table:GLMpred:stats}, the default is to print them to the standard output
-\item[{\tt dfam}:] (default:\mbox{ }{\tt 1})
-GLM distribution family code to specify the type of distribution $\Prob[y\,|\,\mu]$
-that we assume: \\
-{\tt 1} = power distributions with $\Var(y) = \mu^{\alpha}$, see Table~\ref{table:commonGLMs};\\
-{\tt 2} = binomial; 
-{\tt 3} = multinomial logit
-\item[{\tt vpow}:] (default:\mbox{ }{\tt 0.0})
-Power for variance defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt dfam}$\,{\neq}\,1$):
-when {\tt dfam=1}, this provides the~$q$ in $\Var(y) = a\mu^q$, the power
-dependence of the variance of~$y$ on its mean.  In particular, use:\\
-{\tt 0.0} = Gaussian,
-{\tt 1.0} = Poisson,
-{\tt 2.0} = Gamma,
-{\tt 3.0} = inverse Gaussian
-\item[{\tt link}:] (default:\mbox{ }{\tt 0})
-Link function code to determine the link function~$\eta = g(\mu)$, ignored for
-multinomial logit ({\tt dfam=3}):\\
-{\tt 0} = canonical link (depends on the distribution family), see Table~\ref{table:commonGLMs};\\
-{\tt 1} = power functions,
-{\tt 2} = logit,
-{\tt 3} = probit,
-{\tt 4} = cloglog,
-{\tt 5} = cauchit
-\item[{\tt lpow}:] (default:\mbox{ }{\tt 1.0})
-Power for link function defined as (mean)${}^{\textrm{power}}$ (ignored if {\tt link}$\,{\neq}\,1$):
-when {\tt link=1}, this provides the~$s$ in $\eta = \mu^s$, the power link
-function; {\tt lpow=0.0} gives the log link $\eta = \log\mu$.  Common power links:\\
-{\tt -2.0} = $1/\mu^2$,
-{\tt -1.0} = reciprocal,
-{\tt 0.0} = log,
-{\tt 0.5} = sqrt,
-{\tt 1.0} = identity
-\item[{\tt disp}:] (default:\mbox{ }{\tt 1.0})
-Dispersion value, when available; must be positive
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix {\tt M} file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|lccl|}
-\hline
-Name & \hspace{-0.6em}CID\hspace{-0.5em} & \hspace{-0.3em}Disp?\hspace{-0.6em} & Meaning \\
-\hline
-{\tt LOGLHOOD\_Z}          &   & + & Log-likelihood $Z$-score (in st.\ dev.'s from the mean) \\
-{\tt LOGLHOOD\_Z\_PVAL}    &   & + & Log-likelihood $Z$-score p-value, two-sided \\
-{\tt PEARSON\_X2}          &   & + & Pearson residual $X^2$-statistic \\
-{\tt PEARSON\_X2\_BY\_DF}  &   & + & Pearson $X^2$ divided by degrees of freedom \\
-{\tt PEARSON\_X2\_PVAL}    &   & + & Pearson $X^2$ p-value \\
-{\tt DEVIANCE\_G2}         &   & + & Deviance from the saturated model $G^2$-statistic \\
-{\tt DEVIANCE\_G2\_BY\_DF} &   & + & Deviance $G^2$ divided by degrees of freedom \\
-{\tt DEVIANCE\_G2\_PVAL}   &   & + & Deviance $G^2$ p-value \\
-{\tt AVG\_TOT\_Y}          & + &   & $Y$-column average for an individual response value \\
-{\tt STDEV\_TOT\_Y}        & + &   & $Y$-column st.\ dev.\ for an individual response value \\
-{\tt AVG\_RES\_Y}          & + &   & $Y$-column residual average of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\
-{\tt STDEV\_RES\_Y}        & + &   & $Y$-column residual st.\ dev.\ of $Y - \mathop{\mathrm{pred.}\,\mathrm{mean}}(Y|X)$ \\
-{\tt PRED\_STDEV\_RES}     & + & + & Model-predicted $Y$-column residual st.\ deviation\\
-{\tt PLAIN\_R2}            & + &   & Plain $R^2$ of $Y$-column residual with bias included \\
-{\tt ADJUSTED\_R2}         & + &   & Adjusted $R^2$ of $Y$-column residual w.\ bias included \\
-{\tt PLAIN\_R2\_NOBIAS}    & + &   & Plain $R^2$ of $Y$-column residual, bias subtracted \\
-{\tt ADJUSTED\_R2\_NOBIAS} & + &   & Adjusted $R^2$ of $Y$-column residual, bias subtracted \\
-\hline
-\end{tabular}}
-\caption{The above goodness-of-fit statistics are provided in CSV format, one per each line, with four
-columns: (Name, [CID], [Disp?], Value).  The columns are: 
-``Name'' is the string identifier for the statistic, see the table;
-``CID'' is an optional integer value that specifies the $Y$-column index for \mbox{per-}column statistics
-(note that a bi-/multinomial one-column {\tt Y}-input is converted into multi-column);
-``Disp?'' is an optional Boolean value ({\tt TRUE} or {\tt FALSE}) that tells us
-whether or not scaling by the input dispersion parameter {\tt disp} has been applied to this
-statistic;
-``Value''  is the value of the statistic.}
-\label{table:GLMpred:stats}
-\end{table}
-
-\noindent{\bf Details}
-\smallskip
-
-The output matrix $M$ of predicted means (or probabilities) is computed by matrix-multiplying $X$
-with the first column of~$B$ or with the whole~$B$ in the multinomial case, adding the intercept
-if available (conceptually, appending an extra column of ones to~$X$); then applying the inverse
-of the model's link function.  The difference between ``means'' and ``probabilities'' in the
-categorical case becomes significant when there are ${\geq}\,2$ observations per record
-(with the multi-label records) or when the labels such as $-1$ and~$1$ are viewed and averaged
-as numerical response values (with the single-label records).  To avoid any \mbox{mix-up} or
-information loss, we separately return the predicted probability of each category label for each
-record.
-
-When the ``actual'' response values $Y$ are available, the summary statistics are computed
-and written out as described in Table~\ref{table:GLMpred:stats}.  Below we discuss each of
-these statistics in detail.  Note that in the categorical case (binomial and multinomial)
-$Y$ is internally represented as the matrix of observation counts for each label in each record,
-rather than just the label~ID for each record.  The input~$Y$ may already be a matrix of counts,
-in which case it is used as-is.  But if $Y$ is given as a vector of response labels, each
-response label is converted into an indicator vector $(0,\ldots,0,1_l,0,\ldots,0)$ where~$l$
-is the label~ID for this record.  All negative (e.g.~$-1$) or zero label~IDs are converted to
-the $1 + {}$maximum label~ID.  The largest label~ID is viewed as the ``baseline'' as explained
-in the section on Multinomial Logistic Regression.  We assume that there are $k\geq 1$
-non-baseline categories and one (last) baseline category.
-
-We also estimate residual variances for each response value, although we do not output them,
-but use them only inside the summary statistics, scaled and unscaled by the input dispersion
-parameter {\tt disp}, as described below.
-
-\smallskip
-{\tt LOGLHOOD\_Z} and {\tt LOGLHOOD\_Z\_PVAL} statistics measure how far the log-likelihood
-of~$Y$ deviates from its expected value according to the model.  The script implements them
-only for the binomial and the multinomial distributions, returning NaN for all other distributions.
-Pearson's~$X^2$ and deviance~$G^2$ often perform poorly with bi- and multinomial distributions
-due to low cell counts, hence we need this extra goodness-of-fit measure.  To compute these
-statistics, we use:
-\begin{Itemize}
-\item the $n\times (k\,{+}\,1)$-matrix~$Y$ of multi-label response counts, in which $y_{i,j}$
-is the number of times label~$j$ was observed in record~$i$;
-\item the model-estimated probability matrix~$P$ of the same dimensions that satisfies
-$\sum_{j=1}^{k+1} p_{i,j} = 1$ for all~$i=1,\ldots,n$ and where $p_{i,j}$ is the model
-probability of observing label~$j$ in record~$i$;
-\item the $n\,{\times}\,1$-vector $N$ where $N_i$ is the aggregated count of observations
-in record~$i$ (all $N_i = 1$ if each record has only one response label).
-\end{Itemize}
-We start by computing the multinomial log-likelihood of $Y$ given~$P$ and~$N$, as well as
-the expected log-likelihood given a random~$Y$ and the variance of this log-likelihood if
-$Y$ indeed follows the proposed distribution:
-\begin{align*}
-\ell (Y) \,\,&=\,\, \log \Prob[Y \,|\, P, N] \,\,=\,\, \sum_{i=1}^{n} \,\sum_{j=1}^{k+1}  \,y_{i,j}\log p_{i,j} \\
-\E_Y \ell (Y)  \,\,&=\,\, \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\mu_{i,j} \log p_{i,j} 
-    \,\,=\,\, \sum_{i=1}^{n}\, N_i \,\sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j} \\
-\Var_Y \ell (Y) \,&=\, \sum_{i=1}^{n} \,N_i \left(\sum_{j=1}^{k+1} \,p_{i,j} \big(\log p_{i,j}\big)^2
-    - \Bigg( \sum_{j=1}^{k+1} \,p_{i,j} \log p_{i,j}\Bigg) ^ {\!\!2\,} \right)
-\end{align*}
-Then we compute the $Z$-score as the difference between the actual and the expected
-log-likelihood $\ell(Y)$ divided by its expected standard deviation, and its two-sided
-p-value in the Normal distribution assumption ($\ell(Y)$ should approach normality due
-to the Central Limit Theorem):
-\begin{equation*}
-Z   \,=\, \frac {\ell(Y) - \E_Y \ell(Y)}{\sqrt{\Var_Y \ell(Y)}};\quad
-\mathop{\textrm{p-value}}(Z) \,=\, \Prob \Big[\,\big|\mathop{\textrm{Normal}}(0,1)\big| \, > \, |Z|\,\Big]
-\end{equation*}
-A low p-value would indicate ``underfitting'' if $Z\ll 0$ or ``overfitting'' if $Z\gg 0$.  Here
-``overfitting'' means that higher-probability labels occur more often than their probabilities
-suggest.
-
-We also apply the dispersion input ({\tt disp}) to compute the ``scaled'' version of the $Z$-score
-and its p-value.  Since $\ell(Y)$ is a linear function of~$Y$, multiplying the GLM-predicted
-variance of~$Y$ by {\tt disp} results in multiplying $\Var_Y \ell(Y)$ by the same {\tt disp}.  This, in turn,
-translates into dividing the $Z$-score by the square root of the dispersion:
-\begin{equation*}
-Z_{\texttt{disp}}  \,=\, \big(\ell(Y) \,-\, \E_Y \ell(Y)\big) \,\big/\, \sqrt{\texttt{disp}\cdot\Var_Y \ell(Y)}
-\,=\, Z / \sqrt{\texttt{disp}}
-\end{equation*}
-Finally, we recalculate the p-value with this new $Z$-score.
-
-\smallskip
-{\tt PEARSON\_X2}, {\tt PEARSON\_X2\_BY\_DF}, and {\tt PEARSON\_X2\_PVAL}:
-Pearson's residual $X^2$-statistic is a commonly used goodness-of-fit measure for linear models~\cite{McCullagh1989:GLM}.
-The idea is to measure how well the model-predicted means and variances match the actual behavior
-of response values.  For each record $i$, we estimate the mean $\mu_i$ and the variance $v_i$
-(or $\texttt{disp}\cdot v_i$) and use them to normalize the residual: 
-$r_i = (y_i - \mu_i) / \sqrt{v_i}$.  These normalized residuals are then squared, aggregated
-by summation, and tested against an appropriate $\chi^2$~distribution.  The computation of~$X^2$
-is slightly different for categorical data (bi- and multinomial) than it is for numerical data,
-since $y_i$ has multiple correlated dimensions~\cite{McCullagh1989:GLM}:
-\begin{equation*}
-X^2\,\textrm{(numer.)} \,=\,  \sum_{i=1}^{n}\, \frac{(y_i - \mu_i)^2}{v_i};\quad
-X^2\,\textrm{(categ.)} \,=\,  \sum_{i=1}^{n}\, \sum_{j=1}^{k+1} \,\frac{(y_{i,j} - N_i 
-\hspace{0.5pt} p_{i,j})^2}{N_i \hspace{0.5pt} p_{i,j}}
-\end{equation*}
-The number of degrees of freedom~\#d.f.\ for the $\chi^2$~distribution is $n - m$ for numerical data and
-$(n - m)k$ for categorical data, where $k = \mathop{\texttt{ncol}}(Y) - 1$.  Given the dispersion
-parameter {\tt disp}, the $X^2$ statistic is scaled by division: \mbox{$X^2_{\texttt{disp}} = X^2 / \texttt{disp}$}.
-If the dispersion is accurate, $X^2 / \texttt{disp}$ should be close to~\#d.f.  In fact, $X^2 / \textrm{\#d.f.}$
-over the \emph{training} data is the dispersion estimator used in our {\tt GLM.dml} script, 
-see~(\ref{eqn:dispersion}).  Here we provide $X^2 / \textrm{\#d.f.}$ and $X^2_{\texttt{disp}} / \textrm{\#d.f.}$
-as {\tt PEARSON\_X2\_BY\_DF} to enable dispersion comparison between the training data and
-the test data.
-
-NOTE: For categorical data, both Pearson's $X^2$ and the deviance $G^2$ are unreliable (i.e.\ do not
-approach the $\chi^2$~distribution) unless the predicted means of multi-label counts
-$\mu_{i,j} = N_i \hspace{0.5pt} p_{i,j}$ are fairly large: all ${\geq}\,1$ and 80\% are
-at least~$5$~\cite{Cochran1954:chisq}.  They should not be used for ``one label per record'' categoricals.
-
-\smallskip
-{\tt DEVIANCE\_G2}, {\tt DEVIANCE\_G2\_BY\_DF}, and {\tt DEVIANCE\_G2\_PVAL}:
-Deviance $G^2$ is the log of the likelihood ratio between the ``saturated'' model and the
-linear model being tested for the given dataset, multiplied by two:
-\begin{equation}
-G^2 \,=\, 2 \,\log \frac{\Prob[Y \mid \textrm{saturated model}\hspace{0.5pt}]}%
-{\Prob[Y \mid \textrm{tested linear model}\hspace{0.5pt}]}
-\label{eqn:GLMpred:deviance}
-\end{equation}
-The ``saturated'' model sets the mean $\mu_i^{\mathrm{sat}}$ to equal~$y_i$ for every record
-(for categorical data, $p^{\mathrm{sat}}_{i,j} = y_{i,j} / N_i$), which represents the ``perfect fit.''
-For records with $y_{i,j} \in \{0, N_i\}$ or otherwise at a boundary, by continuity we set
-$0 \log 0 = 0$.  The GLM~likelihood functions defined in~(\ref{eqn:GLM}) become simplified
-in ratio~(\ref{eqn:GLMpred:deviance}) due to canceling out the term $c(y, a)$ since it is the same
-in both models.
-
-The log of a likelihood ratio between two nested models, times two, is known to approach
-a $\chi^2$ distribution as $n\to\infty$ if both models have fixed parameter spaces.  
-But this is not the case for the ``saturated'' model: it adds more parameters with each record.  
-In practice, however, $\chi^2$ distributions are used to compute the p-value of~$G^2$~\cite{McCullagh1989:GLM}.  
-The number of degrees of freedom~\#d.f.\ and the treatment of dispersion are the same as for
-Pearson's~$X^2$, see above.
-
-\Paragraph{Column-wise statistics.}  The rest of the statistics are computed separately
-for each column of~$Y$.  As explained above, $Y$~has two or more columns in bi- and multinomial case,
-either at input or after conversion.  Moreover, each $y_{i,j}$ in record~$i$ with $N_i \geq 2$ is
-counted as $N_i$ separate observations $y_{i,j,l}$ of 0 or~1 (where $l=1,\ldots,N_i$) with
-$y_{i,j}$~ones and $N_i-y_{i,j}$ zeros.
-For power distributions, including linear regression, $Y$~has only one column and all
-$N_i = 1$, so the statistics are computed for all~$Y$ with each record counted once.
-Below we denote $N = \sum_{i=1}^n N_i \,\geq n$.
-Here is the total average and the residual average (residual bias) of~$y_{i,j,l}$ for each $Y$-column:
-\begin{equation*}
-\texttt{AVG\_TOT\_Y}_j   \,=\, \frac{1}{N} \sum_{i=1}^n  y_{i,j}; \quad
-\texttt{AVG\_RES\_Y}_j   \,=\, \frac{1}{N} \sum_{i=1}^n \, (y_{i,j} - \mu_{i,j})
-\end{equation*}
-Dividing by~$N$ (rather than~$n$) gives the averages for~$y_{i,j,l}$ (rather than~$y_{i,j}$).
-The total variance, and the standard deviation, for individual observations~$y_{i,j,l}$ is
-estimated from the total variance for response values~$y_{i,j}$ using independence assumption:
-$\Var y_{i,j} = \Var \sum_{l=1}^{N_i} y_{i,j,l} = \sum_{l=1}^{N_i} \Var y_{i,j,l}$.
-This allows us to estimate the sum of squares for~$y_{i,j,l}$ via the sum of squares for~$y_{i,j}$:
-\begin{equation*}
-\texttt{STDEV\_TOT\_Y}_j \,=\, 
-\Bigg[\frac{1}{N-1} \sum_{i=1}^n  \Big( y_{i,j} -  \frac{N_i}{N} \sum_{i'=1}^n  y_{i'\!,j}\Big)^2\Bigg]^{1/2}
-\end{equation*}
-Analogously, we estimate the standard deviation of the residual $y_{i,j,l} - \mu_{i,j,l}$:
-\begin{equation*}
-\texttt{STDEV\_RES\_Y}_j \,=\, 
-\Bigg[\frac{1}{N-m'} \,\sum_{i=1}^n  \Big( y_{i,j} - \mu_{i,j} -  \frac{N_i}{N} \sum_{i'=1}^n  (y_{i'\!,j} - \mu_{i'\!,j})\Big)^2\Bigg]^{1/2}
-\end{equation*}
-Here $m'=m$ if $m$ includes the intercept as a feature and $m'=m+1$ if it does not.
-The estimated standard deviations can be compared to the model-predicted residual standard deviation
-computed from the predicted means by the GLM variance formula and scaled by the dispersion:
-\begin{equation*}
-\texttt{PRED\_STDEV\_RES}_j \,=\, \Big[\frac{\texttt{disp}}{N} \, \sum_{i=1}^n \, v(\mu_{i,j})\Big]^{1/2}
-\end{equation*}
-We also compute the $R^2$ statistics for each column of~$Y$, see Table~\ref{table:GLMpred:R2} for details.
-We compute two versions of~$R^2$: in one version the residual sum-of-squares (RSS) includes any bias in
-the residual that might be present (due to the lack of, or inaccuracy in, the intercept); in the other
-version of~RSS the bias is subtracted by ``centering'' the residual.  In both cases we subtract the bias in the total
-sum-of-squares (in the denominator), and $m'$ equals $m$~with the intercept or $m+1$ without the intercept.
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|c|c|}
-\multicolumn{2}{c}{$R^2$ where the residual sum-of-squares includes the bias contribution:} \\
-\hline
-\multicolumn{1}{|l|}{\tt PLAIN\_R2${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2${}_j \,\,= {}$} \\
-$ \displaystyle 1 - 
-\frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}%
-{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ & 
-$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m}}  \, 
-\frac{\sum\limits_{i=1}^n \,(y_{i,j} - \mu_{i,j})^2}%
-{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ \\
-\hline
-\multicolumn{2}{c}{} \\
-\multicolumn{2}{c}{$R^2$ where the residual sum-of-squares is centered so that the bias is subtracted:} \\
-\hline
-\multicolumn{1}{|l|}{\tt PLAIN\_R2\_NOBIAS${}_j \,\,= {}$} & \multicolumn{1}{l|}{\tt ADJUSTED\_R2\_NOBIAS${}_j \,\,= {}$} \\
-$ \displaystyle 1 - 
-\frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\!
-    \sum\limits_{i'=1}^n  (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}%
-{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ &
-$ \displaystyle 1 - {\textstyle\frac{N_{\mathstrut} - 1}{N^{\mathstrut} - m'}} \, 
-\frac{\sum\limits_{i=1}^n \Big(y_{i,j} \,{-}\, \mu_{i,j} \,{-}\, \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! 
-    \sum\limits_{i'=1}^n  (y_{i'\!,j} \,{-}\, \mu_{i'\!,j}) \Big)^{\! 2}}%
-{\sum\limits_{i=1}^n \Big(y_{i,j} - \frac{N_{i\mathstrut}}{N^{\mathstrut}}\!\! \sum\limits_{i'=1}^n  \! y_{i'\!,j} \Big)^{\! 2}} $ \\
-\hline
-\end{tabular}}
-\caption{The $R^2$ statistics we compute in {\tt GLM-predict.dml}}
-\label{table:GLMpred:R2}
-\end{table}
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-The matrix of predicted means (if the response is numerical) or probabilities (if the response
-is categorical), see ``Description'' subsection above for more information.  Given {\tt Y}, we
-return some statistics in CSV format as described in Table~\ref{table:GLMpred:stats} and in the
-above text.
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-Note that in the examples below the value for ``{\tt disp}'' input argument
-is set arbitrarily.  The correct dispersion value should be computed from the training
-data during model estimation, or omitted if unknown (which sets it to~{\tt 1.0}).
-
-\smallskip\noindent
-Linear regression example:
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0 disp=5.67
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
-  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}\smallskip\noindent
-Linear regression example, prediction only (no {\tt Y} given):
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=1 vpow=0.0 link=1 lpow=1.0
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
-
-}\smallskip\noindent
-Binomial logistic regression example:
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=2 link=2 disp=3.0004464
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
-  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}\smallskip\noindent
-Binomial probit regression example:
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=2 link=3 disp=3.0004464
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
-  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}\smallskip\noindent
-Multinomial logistic regression example:
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=3 
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Probabilities.mtx fmt=csv
-  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}\smallskip\noindent
-Poisson regression with the log link example:
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=1 vpow=1.0 link=1 lpow=0.0 disp=3.45
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
-  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}\smallskip\noindent
-Gamma regression with the inverse (reciprocal) link example:
-\par\hangindent=\parindent\noindent{\tt
-\hml -f GLM-predict.dml -nvargs dfam=1 vpow=2.0 link=1 lpow=-1.0 disp=1.99118
-  X=/user/biadmin/X.mtx B=/user/biadmin/B.mtx M=/user/biadmin/Means.mtx fmt=csv
-  Y=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/KaplanMeier.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/KaplanMeier.tex b/Algorithms Reference/KaplanMeier.tex
deleted file mode 100644
index 6ea6fbc..0000000
--- a/Algorithms Reference/KaplanMeier.tex	
+++ /dev/null
@@ -1,289 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{Kaplan-Meier Survival Analysis}
-\label{sec:kaplan-meier}
-
-\noindent{\bf Description}
-\smallskip
-
-
-Survival analysis examines the time needed for a particular event of interest to occur.
-In medical research, for example, the prototypical such event is the death of a patient but the methodology can be applied to other application areas, e.g., completing a task by an individual in a psychological experiment or the failure of electrical components in engineering.   
-Kaplan-Meier or (product limit) method is a simple non-parametric approach for estimating survival probabilities from both censored and uncensored survival times.\\
-
- 
-
-\smallskip
-\noindent{\bf Usage}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}KM.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} TE=}path/file
-{\tt{} GI=}path/file
-{\tt{} SI=}path/file
-{\tt{} O=}path/file
-{\tt{} M=}path/file
-{\tt{} T=}path/file
-{\tt{} alpha=}double
-{\tt{} etype=}greenwood$\mid$peto
-{\tt{} ctype=}plain$\mid$log$\mid$log-log
-{\tt{} ttype=}none$\mid$log-rank$\mid$wilcoxon
-{\tt{} fmt=}format
-
-}
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location (on HDFS) to read the input matrix of the survival data containing: 
-\begin{Itemize}
-	\item timestamps,
-	\item whether event occurred (1) or data is censored (0),
-	\item a number of factors (i.e., categorical features) for grouping and/or stratifying
-\end{Itemize}
-\item[{\tt TE}:]
-Location (on HDFS) to read the 1-column matrix $TE$ that contains the column indices of the input matrix $X$ corresponding to timestamps (first entry) and event information (second entry) 
-\item[{\tt GI}:]
-Location (on HDFS) to read the 1-column matrix $GI$ that contains the column indices of the input matrix $X$ corresponding to the factors (i.e., categorical features) to be used for grouping
-\item[{\tt SI}:]
-Location (on HDFS) to read the 1-column matrix $SI$ that contains the column indices of the input matrix $X$ corresponding to the factors (i.e., categorical features) to be used for grouping
-\item[{\tt O}:]
-Location (on HDFS) to write the matrix containing the results of the Kaplan-Meier analysis $KM$
-\item[{\tt M}:]
-Location (on HDFS) to write Matrix $M$ containing the following statistics: total number of events, median and its confidence intervals; if survival data for multiple groups and strata are provided each row of $M$ contains the above statistics per group and stratum.
-\item[{\tt T}:]
-If survival data from multiple groups is available and {\tt ttype=log-rank} or {\tt ttype=wilcoxon}, location (on HDFS) to write the two matrices that contains the result of the (stratified) test for comparing these groups; see below for details.
-\item[{\tt alpha}:](default:\mbox{ }{\tt 0.05})
-Parameter to compute $100(1-\alpha)\%$ confidence intervals for the survivor function and its median 
-\item[{\tt etype}:](default:\mbox{ }{\tt "greenwood"})
-Parameter to specify the error type according to "greenwood" or "peto"
-\item[{\tt ctype}:](default:\mbox{ }{\tt "log"})
-Parameter to modify the confidence interval; "plain" keeps the lower and upper bound of the confidence interval unmodified,	"log" corresponds to logistic transformation and "log-log" corresponds to the complementary log-log transformation
-\item[{\tt ttype}:](default:\mbox{ }{\tt "none"})
-If survival data for multiple groups is available specifies which test to perform for comparing 
-survival data across multiple groups: "none", "log-rank" or "wilcoxon" test
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\end{Description}
-
-
-\noindent{\bf Details}
-\smallskip
-
-The Kaplan-Meier estimate is a non-parametric maximum likelihood estimate (MLE) of the survival function $S(t)$, i.e., the probability of survival from the time origin to a given future time. 
-As an illustration suppose that there are $n$ individuals with observed survival times $t_1,t_2,\ldots t_n$ out of which there are $r\leq n$ distinct death times $t_{(1)}\leq t_{(2)}\leq t_{(r)}$---since some of the observations may be censored, in the sense that the end-point of interest has not been observed for those individuals, and there may be more than one individual with the same survival time.
-Let $S(t_j)$ denote the probability of survival until time $t_j$, $d_j$ be the number of events at time $t_j$, and $n_j$ denote the number of individual at risk (i.e., those who die at time $t_j$ or later). 
-Assuming that the events occur independently, in Kaplan-Meier method the probability of surviving from $t_j$ to $t_{j+1}$ is estimated from $S(t_j)$ and given by
-\begin{equation*}
-\hat{S}(t) = \prod_{j=1}^{k} \left( \frac{n_j-d_j}{n_j} \right),
-\end{equation*}   
-for $t_k\leq t<t_{k+1}$, $k=1,2,\ldots r$, $\hat{S}(t)=1$ for $t<t_{(1)}$, and $t_{(r+1)}=\infty$. 
-Note that the value of $\hat{S}(t)$ is constant between times of event and therefore
-the estimate is a step function with jumps at observed event times.
-If there are no censored data this estimator would simply reduce to the empirical survivor function defined as $\frac{n_j}{n}$. Thus, the Kaplan-Meier estimate can be seen as the generalization of the empirical survivor function that handles censored observations.
-
-The methodology used in our {\tt KM.dml} script closely follows~\cite[Sec.~2]{collett2003:kaplanmeier}.
-For completeness we briefly discuss the equations used in our implementation.
-
-% standard error of the survivor function
-\textbf{Standard error of the survivor function.}
-The standard error of the estimated survivor function (controlled by parameter {\tt etype}) can be calculated as  
-\begin{equation*}
-\text{se} \{\hat{S}(t)\} \approx \hat{S}(t) {\bigg\{ \sum_{j=1}^{k} \frac{d_j}{n_j(n_j -   d_j)}\biggr\}}^2,
-\end{equation*}
-for $t_{(k)}\leq t<t_{(k+1)}$.
-This equation is known as the {\it Greenwood's} formula.
-An alternative approach is to apply the {\it Petos's} expression %~\cite{PetoPABCHMMPS1979:kaplanmeier} 
-\begin{equation*}
-\text{se}\{\hat{S}(t)\}=\frac{\hat{S}(t)\sqrt{1-\hat{S}(t)}}{\sqrt{n_k}},
-\end{equation*}
-for $t_{(k)}\leq t<t_{(k+1)}$. 
-%Note that this estimate is known to be conservative producing larger standard errors than they ought to be. The Greenwood estimate is therefore recommended for general use. 
-Once the standard error of $\hat{S}$ has been found we compute the following types of confidence intervals (controlled by parameter {\tt cctype}): 
-The ``plain'' $100(1-\alpha)\%$ confidence interval for $S(t)$ is computed using 
-\begin{equation*}
-\hat{S}(t)\pm z_{\alpha/2} \text{se}\{\hat{S}(t)\}, 
-\end{equation*} 
-where $z_{\alpha/2}$ is the upper $\alpha/2$-point of the standard normal distribution. 
-Alternatively, we can apply the ``log'' transformation using 
-\begin{equation*}
-\hat{S}(t)^{\exp[\pm z_{\alpha/2} \text{se}\{\hat{S}(t)\}/\hat{S}(t)]}
-\end{equation*}
-or the ``log-log'' transformation using 
-\begin{equation*}
-\hat{S}(t)^{\exp [\pm z_{\alpha/2} \text{se} \{\log [-\log \hat{S}(t)]\}]}.
-\end{equation*}
-
-% standard error of the median of survival times
-\textbf{Median, its standard error and confidence interval.}
-Denote by $\hat{t}(50)$ the estimated median of $\hat{S}$, i.e.,
-$\hat{t}(50)=\min \{ t_i \mid \hat{S}(t_i) < 0.5\}$,
-where $t_i$ is the observed survival time for individual $i$.
-The standard error of $\hat{t}(50)$ is given by
-\begin{equation*}
-\text{se}\{ \hat{t}(50) \} = \frac{1}{\hat{f}\{\hat{t}(50)\}} \text{se}[\hat{S}\{ \hat{t}(50) \}],
-\end{equation*}
-where $\hat{f}\{ \hat{t}(50) \}$ can be found from
-\begin{equation*}
-\hat{f}\{ \hat{t}(50) \} = \frac{\hat{S}\{ \hat{u}(50) \} -\hat{S}\{ \hat{l}(50) \} }{\hat{l}(50) - \hat{u}(50)}. 
-\end{equation*}
-Above, $\hat{u}(50)$ is the largest survival time for which $\hat{S}$ exceeds $0.5+\epsilon$, i.e., $\hat{u}(50)=\max \bigl\{ t_{(j)} \mid \hat{S}(t_{(j)}) \geq 0.5+\epsilon \bigr\}$,
-and $\hat{l}(50)$ is the smallest survivor time for which $\hat{S}$ is less than $0.5-\epsilon$,
-i.e., $\hat{l}(50)=\min \bigl\{ t_{(j)} \mid \hat{S}(t_{(j)}) \leq 0.5+\epsilon \bigr\}$,
-for small $\epsilon$.
-
-
-% comparing two or more groups of data
-\textbf{Log-rank test and Wilcoxon test.}
-Our implementation supports comparison of survival data from several groups using two non-parametric procedures (controlled by parameter {\tt ttype}): the {\it log-rank test} and the {\it Wilcoxon test} (also known as the {\it Breslow test}). 
-Assume that the survival times in $g\geq 2$ groups of survival data are to be compared. 
-Consider the {\it null hypothesis} that there is no difference in the survival times of the individuals in different groups. One way to examine the null hypothesis is to consider the difference between the observed number of deaths with the numbers expected under the null hypothesis.  
-In both tests we define the $U$-statistics ($U_{L}$ for the log-rank test and $U_{W}$ for the Wilcoxon test) to compare the observed and the expected number of deaths in $1,2,\ldots,g-1$ groups as follows:
-\begin{align*}
-U_{Lk} &= \sum_{j=1}^{r}\left( d_{kj} - \frac{n_{kj}d_j}{n_j} \right), \\
-U_{Wk} &= \sum_{j=1}^{r}n_j\left( d_{kj} - \frac{n_{kj}d_j}{n_j} \right),
-\end{align*}
-where $d_{kj}$ is the of number deaths at time $t_{(j)}$ in group $k$, 
-$n_{kj}$ is the number of individuals at risk at time $t_{(j)}$ in group $k$, and 
-$k=1,2,\ldots,g-1$ to form the vectors $U_L$ and $U_W$ with $(g-1)$ components.
-The covariance (variance) between $U_{Lk}$ and $U_{Lk'}$ (when $k=k'$) is computed as
-\begin{equation*}
-V_{Lkk'}=\sum_{j=1}^{r} \frac{n_{kj}d_j(n_j-d_j)}{n_j(n_j-1)} \left( \delta_{kk'}-\frac{n_{k'j}}{n_j} \right),
-\end{equation*}
-for $k,k'=1,2,\ldots,g-1$, with
-\begin{equation*}
-\delta_{kk'} = 
-\begin{cases}
-1 & \text{if } k=k'\\
-0 & \text{otherwise.}
-\end{cases}
-\end{equation*}
-These terms are combined in a {\it variance-covariance} matrix $V_L$ (referred to as the $V$-statistic).
-Similarly, the variance-covariance matrix for the Wilcoxon test $V_W$ is a matrix where the entry at position $(k,k')$ is given by
-\begin{equation*}
-V_{Wkk'}=\sum_{j=1}^{r} n_j^2 \frac{n_{kj}d_j(n_j-d_j)}{n_j(n_j-1)} \left( \delta_{kk'}-\frac{n_{k'j}}{n_j} \right).
-\end{equation*}
-
-Under the null hypothesis of no group differences, the test statistics $U_L^\top V_L^{-1} U_L$ for the log-rank test and  $U_W^\top V_W^{-1} U_W$ for the Wilcoxon test have a Chi-squared distribution on $(g-1)$ degrees of freedom.
-Our {\tt KM.dml} script also provides a stratified version of the log-rank or Wilcoxon test if requested.
-In this case, the values of the $U$- and $V$- statistics are computed for each stratum and then combined over all strata.
-
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-  
-Blow we list the results of the survival analysis computed by {\tt KM.dml}. 
-The calculated statistics are stored in matrix $KM$ with the following schema:
-\begin{itemize}
-	\item Column 1: timestamps 
-	\item Column 2: number of individuals at risk
-	\item Column 3: number of events
-	\item Column 4: Kaplan-Meier estimate of the survivor function $\hat{S}$ 
-	\item Column 5: standard error of $\hat{S}$
-	\item Column 6: lower bound of $100(1-\alpha)\%$ confidence interval for $\hat{S}$
-	\item Column 7: upper bound of $100(1-\alpha)\%$ confidence interval for $\hat{S}$
-\end{itemize}
-Note that if survival data for multiple groups and/or strata is available, each collection of 7 columns in $KM$ stores the results per group and/or per stratum. 
-In this case $KM$ has $7g+7s$ columns, where $g\geq 1$ and $s\geq 1$ denote the number of groups and strata, respectively. 
-
-
-Additionally, {\tt KM.dml} stores the following statistics in the 1-row matrix $M$ whose number of columns depends on the number of groups ($g$) and strata ($s$) in the data. Below $k$ denotes the number of factors used for grouping and $l$ denotes the number of factors used for stratifying. 
-\begin{itemize}
-	\item Columns 1 to $k$: unique combination of values in the $k$ factors used for grouping 
-	\item Columns $k+1$ to $k+l$: unique combination of values in the $l$ factors used for stratifying  
-	\item Column $k+l+1$: total number of records 
-	\item Column $k+l+2$: total number of events
-    \item Column $k+l+3$: median of $\hat{S}$
-    \item Column $k+l+4$: lower bound of $100(1-\alpha)\%$ confidence interval for the median of $\hat{S}$
-    \item Column $k+l+5$: upper bound of $100(1-\alpha)\%$ confidence interval for the median of $\hat{S}$. 
-\end{itemize}
-If there is only 1 group and 1 stratum available $M$ will be a 1-row matrix with 5 columns where
-\begin{itemize}
-	\item Column 1: total number of records
-	\item Column 2: total number of events
-	\item Column 3: median of $\hat{S}$
-	\item Column 4: lower bound of $100(1-\alpha)\%$ confidence interval for the median of $\hat{S}$
-	\item Column 5: upper bound of $100(1-\alpha)\%$ confidence interval for the median of $\hat{S}$.
-\end{itemize} 
-
-If a comparison of the survival data across multiple groups needs to be performed, {\tt KM.dml} computes two matrices $T$ and $T\_GROUPS\_OE$ that contain a summary of the test. The 1-row matrix $T$ stores the following statistics: 
-\begin{itemize}
-	\item Column 1: number of groups in the survival data
- 	\item Column 2: degree of freedom for Chi-squared distributed test statistic
-	\item Column 3: value of test statistic
-	\item Column 4: $P$-value.
-\end{itemize}
-Matrix $T\_GROUPS\_OE$ contains the following statistics for each of $g$ groups:
-\begin{itemize}
-	\item Column 1: number of events
-	\item Column 2: number of observed death times ($O$)
-	\item Column 3: number of expected death times ($E$)
-	\item Column 4: $(O-E)^2/E$
-	\item Column 5: $(O-E)^2/V$.
-\end{itemize}
-
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f KM.dml -nvargs X=/user/biadmin/X.mtx TE=/user/biadmin/TE
-	GI=/user/biadmin/GI SI=/user/biadmin/SI O=/user/biadmin/kaplan-meier.csv
-	M=/user/biadmin/model.csv alpha=0.01 etype=greenwood ctype=plain fmt=csv
-	
-}\smallskip
-
-{\hangindent=\parindent\noindent\tt
-	\hml -f KM.dml -nvargs X=/user/biadmin/X.mtx TE=/user/biadmin/TE
-	GI=/user/biadmin/GI SI=/user/biadmin/SI O=/user/biadmin/kaplan-meier.csv
-	M=/user/biadmin/model.csv T=/user/biadmin/test.csv alpha=0.01 etype=peto 
-	ctype=log ttype=log-rank fmt=csv
-	
-}
-
-%
-%\smallskip
-%\noindent{\bf References}
-%\begin{itemize}
-%	\item
-%	R.~Peto, M.C.~Pike, P.~Armitage, N.E.~Breslow, D.R.~Cox, S.V.~Howard, N.~Mantel, K.~McPherson, J.~Peto, and P.G.~Smith.
-%	\newblock Design and analysis of randomized clinical trials requiring prolonged observation of each patient.
-%	\newblock {\em British Journal of Cancer}, 35:1--39, 1979.
-%\end{itemize}
-
-%@book{collett2003:kaplanmeier,
-%	title={Modelling Survival Data in Medical Research, Second Edition},
-%	author={Collett, D.},
-%	isbn={9781584883258},
-%	lccn={2003040945},
-%	series={Chapman \& Hall/CRC Texts in Statistical Science},
-%	year={2003},
-%	publisher={Taylor \& Francis}
-%}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/358cfc9f/Algorithms Reference/Kmeans.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/Kmeans.tex b/Algorithms Reference/Kmeans.tex
deleted file mode 100644
index 2b5492c..0000000
--- a/Algorithms Reference/Kmeans.tex	
+++ /dev/null
@@ -1,371 +0,0 @@
-\begin{comment}
-
- Licensed to the Apache Software Foundation (ASF) under one
- or more contributor license agreements.  See the NOTICE file
- distributed with this work for additional information
- regarding copyright ownership.  The ASF licenses this file
- to you under the Apache License, Version 2.0 (the
- "License"); you may not use this file except in compliance
- with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing,
- software distributed under the License is distributed on an
- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- KIND, either express or implied.  See the License for the
- specific language governing permissions and limitations
- under the License.
-
-\end{comment}
-
-\subsection{K-Means Clustering}
-
-\noindent{\bf Description}
-\smallskip
-
-Given a collection of $n$ records with a pairwise similarity measure,
-the goal of clustering is to assign a category label to each record so that
-similar records tend to get the same label.  In contrast to multinomial
-logistic regression, clustering is an \emph{unsupervised}\/ learning problem
-with neither category assignments nor label interpretations given in advance.
-In $k$-means clustering, the records $x_1, x_2, \ldots, x_n$ are numerical
-feature vectors of $\dim x_i = m$ with the squared Euclidean distance 
-$\|x_i - x_{i'}\|_2^2$ as the similarity measure.  We want to partition
-$\{x_1, \ldots, x_n\}$ into $k$ clusters $\{S_1, \ldots, S_k\}$ so that
-the aggregated squared distance from records to their cluster means is
-minimized:
-\begin{equation}
-\textrm{WCSS}\,\,=\,\, \sum_{i=1}^n \,\big\|x_i - \mean(S_j: x_i\in S_j)\big\|_2^2 \,\,\to\,\,\min
-\label{eqn:WCSS}
-\end{equation}
-The aggregated distance measure in~(\ref{eqn:WCSS}) is called the
-\emph{within-cluster sum of squares}~(WCSS).  It can be viewed as a measure
-of residual variance that remains in the data after the clustering assignment,
-conceptually similar to the residual sum of squares~(RSS) in linear regression.
-However, unlike for the RSS, the minimization of~(\ref{eqn:WCSS}) is an NP-hard 
-problem~\cite{AloiseDHP2009:kmeans}.
-
-Rather than searching for the global optimum in~(\ref{eqn:WCSS}), a heuristic algorithm
-called Lloyd's algorithm is typically used.  This iterative algorithm maintains
-and updates a set of $k$~\emph{centroids} $\{c_1, \ldots, c_k\}$, one centroid per cluster.
-It defines each cluster $S_j$ as the set of all records closer to~$c_j$ than
-to any other centroid.  Each iteration of the algorithm reduces the WCSS in two steps:
-\begin{Enumerate}
-\item Assign each record to the closest centroid, making $\mean(S_j)\neq c_j$;
-\label{step:kmeans:recluster}
-\item Reset each centroid to its cluster's mean: $c_j := \mean(S_j)$.
-\label{step:kmeans:recenter}
-\end{Enumerate}
-After Step~\ref{step:kmeans:recluster} the centroids are generally different from the cluster
-means, so we can compute another ``within-cluster sum of squares'' based on the centroids:
-\begin{equation}
-\textrm{WCSS\_C}\,\,=\,\, \sum_{i=1}^n \,\big\|x_i - \mathop{\textrm{centroid}}(S_j: x_i\in S_j)\big\|_2^2
-\label{eqn:WCSS:C}
-\end{equation}
-This WCSS\_C after Step~\ref{step:kmeans:recluster} is less than the means-based WCSS
-before Step~\ref{step:kmeans:recluster} (or equal if convergence achieved), and in
-Step~\ref{step:kmeans:recenter} the WCSS cannot exceed the WCSS\_C for \emph{the same}
-clustering; hence the WCSS reduction.
-
-Exact convergence is reached when each record becomes closer to its
-cluster's mean than to any other cluster's mean, so there are no more re-assignments
-and the centroids coincide with the means.  In practice, iterations may be stopped
-when the reduction in WCSS (or in WCSS\_C) falls below a minimum threshold, or upon
-reaching the maximum number of iterations.  The initialization of the centroids is also
-an important part of the algorithm.  The smallest WCSS obtained by the algorithm is not
-the global minimum and varies depending on the initial centroids.  We implement multiple
-parallel runs with different initial centroids and report the best result.
-
-\Paragraph{Scoring} 
-Our scoring script evaluates the clustering output by comparing it with a known category
-assignment.  Since cluster labels have no prior correspondence to the categories, we
-cannot count ``correct'' and ``wrong'' cluster assignments.  Instead, we quantify them in
-two ways:
-\begin{Enumerate}
-\item Count how many same-category and different-category pairs of records end up in the
-same cluster or in different clusters;
-\item For each category, count the prevalence of its most common cluster; for each
-cluster, count the prevalence of its most common category.
-\end{Enumerate}
-The number of categories and the number of clusters ($k$) do not have to be equal.  
-A same-category pair of records clustered into the same cluster is viewed as a
-``true positive,'' a different-category pair clustered together is a ``false positive,''
-a same-category pair clustered apart is a ``false negative''~etc.
-
-
-\smallskip
-\noindent{\bf Usage: K-means Script}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}Kmeans.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} C=}path/file
-{\tt{} k=}int
-{\tt{} runs=}int
-{\tt{} maxi=}int
-{\tt{} tol=}double
-{\tt{} samp=}int
-{\tt{} isY=}int
-{\tt{} Y=}path/file
-{\tt{} fmt=}format
-{\tt{} verb=}int
-
-}
-
-\smallskip
-\noindent{\bf Usage: K-means Scoring/Prediction}
-\smallskip
-
-{\hangindent=\parindent\noindent\it%
-{\tt{}-f }path/\/{\tt{}Kmeans-predict.dml}
-{\tt{} -nvargs}
-{\tt{} X=}path/file
-{\tt{} C=}path/file
-{\tt{} spY=}path/file
-{\tt{} prY=}path/file
-{\tt{} fmt=}format
-{\tt{} O=}path/file
-
-}
-
-\smallskip
-\noindent{\bf Arguments}
-\begin{Description}
-\item[{\tt X}:]
-Location to read matrix $X$ with the input data records as rows
-\item[{\tt C}:] (default:\mbox{ }{\tt "C.mtx"})
-Location to store the output matrix with the best available cluster centroids as rows
-\item[{\tt k}:]
-Number of clusters (and centroids)
-\item[{\tt runs}:] (default:\mbox{ }{\tt 10})
-Number of parallel runs, each run with different initial centroids
-\item[{\tt maxi}:] (default:\mbox{ }{\tt 1000})
-Maximum number of iterations per run
-\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
-Tolerance (epsilon) for single-iteration WCSS\_C change ratio
-\item[{\tt samp}:] (default:\mbox{ }{\tt 50})
-Average number of records per centroid in data samples used in the centroid
-initialization procedure
-\item[{\tt Y}:] (default:\mbox{ }{\tt "Y.mtx"})
-Location to store the one-column matrix $Y$ with the best available mapping of
-records to clusters (defined by the output centroids)
-\item[{\tt isY}:] (default:\mbox{ }{\tt 0})
-{\tt 0} = do not write matrix~$Y$,  {\tt 1} = write~$Y$
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
-see read/write functions in SystemML Language Reference for details.
-\item[{\tt verb}:] (default:\mbox{ }{\tt 0})
-{\tt 0} = do not print per-iteration statistics for each run, {\tt 1} = print them
-(the ``verbose'' option)
-\end{Description}
-\smallskip
-\noindent{\bf Arguments --- Scoring/Prediction}
-\begin{Description}
-\item[{\tt X}:] (default:\mbox{ }{\tt " "})
-Location to read matrix $X$ with the input data records as rows,
-optional when {\tt prY} input is provided
-\item[{\tt C}:] (default:\mbox{ }{\tt " "})
-Location to read matrix $C$ with cluster centroids as rows, optional
-when {\tt prY} input is provided; NOTE: if both {\tt X} and {\tt C} are
-provided, {\tt prY} is an output, not input
-\item[{\tt spY}:] (default:\mbox{ }{\tt " "})
-Location to read a one-column matrix with the externally specified ``true''
-assignment of records (rows) to categories, optional for prediction without
-scoring
-\item[{\tt prY}:] (default:\mbox{ }{\tt " "})
-Location to read (or write, if {\tt X} and {\tt C} are present) a
-column-vector with the predicted assignment of rows to clusters;
-NOTE: No prior correspondence is assumed between the predicted
-cluster labels and the externally specified categories
-\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
-Matrix file output format for {\tt prY}, such as {\tt text}, {\tt mm},
-or {\tt csv}; see read/write functions in SystemML Language Reference
-for details
-\item[{\tt O}:] (default:\mbox{ }{\tt " "})
-Location to write the output statistics defined in 
-Table~\ref{table:kmeans:predict:stats}, by default print them to the
-standard output
-\end{Description}
-
-
-\begin{table}[t]\small\centerline{%
-\begin{tabular}{|lcl|}
-\hline
-Name & CID & Meaning \\
-\hline
-{\tt TSS}             &     & Total Sum of Squares (from the total mean) \\
-{\tt WCSS\_M}         &     & Within-Cluster  Sum of Squares (means as centers) \\
-{\tt WCSS\_M\_PC}     &     & Within-Cluster  Sum of Squares (means), in \% of TSS \\
-{\tt BCSS\_M}         &     & Between-Cluster Sum of Squares (means as centers) \\
-{\tt BCSS\_M\_PC}     &     & Between-Cluster Sum of Squares (means), in \% of TSS \\
-\hline
-{\tt WCSS\_C}         &     & Within-Cluster  Sum of Squares (centroids as centers) \\
-{\tt WCSS\_C\_PC}     &     & Within-Cluster  Sum of Squares (centroids), \% of TSS \\
-{\tt BCSS\_C}         &     & Between-Cluster Sum of Squares (centroids as centers) \\
-{\tt BCSS\_C\_PC}     &     & Between-Cluster Sum of Squares (centroids), \% of TSS \\
-\hline
-{\tt TRUE\_SAME\_CT}  &     & Same-category pairs predicted as Same-cluster, count \\
-{\tt TRUE\_SAME\_PC}  &     & Same-category pairs predicted as Same-cluster, \% \\
-{\tt TRUE\_DIFF\_CT}  &     & Diff-category pairs predicted as Diff-cluster, count \\
-{\tt TRUE\_DIFF\_PC}  &     & Diff-category pairs predicted as Diff-cluster, \% \\
-{\tt FALSE\_SAME\_CT} &     & Diff-category pairs predicted as Same-cluster, count \\
-{\tt FALSE\_SAME\_PC} &     & Diff-category pairs predicted as Same-cluster, \% \\
-{\tt FALSE\_DIFF\_CT} &     & Same-category pairs predicted as Diff-cluster, count \\
-{\tt FALSE\_DIFF\_PC} &     & Same-category pairs predicted as Diff-cluster, \% \\
-\hline
-{\tt SPEC\_TO\_PRED}  & $+$ & For specified category, the best predicted cluster id \\
-{\tt SPEC\_FULL\_CT}  & $+$ & For specified category, its full count \\
-{\tt SPEC\_MATCH\_CT} & $+$ & For specified category, best-cluster matching count \\
-{\tt SPEC\_MATCH\_PC} & $+$ & For specified category, \% of matching to full count \\
-{\tt PRED\_TO\_SPEC}  & $+$ & For predicted cluster, the best specified category id \\
-{\tt PRED\_FULL\_CT}  & $+$ & For predicted cluster, its full count \\
-{\tt PRED\_MATCH\_CT} & $+$ & For predicted cluster, best-category matching count \\
-{\tt PRED\_MATCH\_PC} & $+$ & For predicted cluster, \% of matching to full count \\
-\hline
-\end{tabular}}
-\caption{The {\tt O}-file for {\tt Kmeans-predict} provides the output statistics
-in CSV format, one per line, in the following format: (NAME, [CID], VALUE).  Note:
-the 1st group statistics are given if {\tt X} input is available;
-the 2nd group statistics are given if {\tt X} and {\tt C} inputs are available;
-the 3rd and 4th group statistics are given if {\tt spY} input is available;
-only the 4th group statistics contain a nonempty CID value;
-when present, CID contains either the specified category label or the
-predicted cluster label.}
-\label{table:kmeans:predict:stats}
-\end{table}
-
-
-\noindent{\bf Details}
-\smallskip
-
-Our clustering script proceeds in 3~stages: centroid initialization,
-parallel $k$-means iterations, and the best-available output generation.
-Centroids are initialized at random from the input records (the rows of~$X$),
-biased towards being chosen far apart from each other.  The initialization
-method is based on the {\tt k-means++} heuristic from~\cite{ArthurVassilvitskii2007:kmeans},
-with one important difference: to reduce the number of passes through~$X$,
-we take a small sample of $X$ and run the {\tt k-means++} heuristic over
-this sample.  Here is, conceptually, our centroid initialization algorithm
-for one clustering run:
-\begin{Enumerate}
-\item Sample the rows of~$X$ uniformly at random, picking each row with probability
-$p = ks / n$ where
-\begin{Itemize}
-\item $k$~is the number of centroids, 
-\item $n$~is the number of records, and
-\item $s$~is the {\tt samp} input parameter.
-\end{Itemize}
-If $ks \geq n$, the entire $X$ is used in place of its sample.
-\item Choose the first centroid uniformly at random from the sampled rows.
-\item Choose each subsequent centroid from the sampled rows, at random, with
-probability proportional to the squared Euclidean distance between the row and
-the nearest already-chosen centroid.
-\end{Enumerate}
-The sampling of $X$ and the selection of centroids are performed independently
-and in parallel for each run of the $k$-means algorithm.  When we sample the
-rows of~$X$, rather than tossing a random coin for each row, we compute the
-number of rows to skip until the next sampled row as $\lceil \log(u) / \log(1 - p) \rceil$
-where $u\in (0, 1)$ is uniformly random.  This time-saving trick works because
-\begin{equation*}
-\Prob [k-1 < \log_{1-p}(u) < k] \,\,=\,\, p(1-p)^{k-1} \,\,=\,\,
-\Prob [\textrm{skip $k-1$ rows}]
-\end{equation*}
-However, it requires us to estimate the maximum sample size, which we set
-near~$ks + 10\sqrt{ks}$ to make it generous enough.
-
-Once we selected the initial centroid sets, we start the $k$-means iterations
-independently in parallel for all clustering runs.  The number of clustering runs
-is given as the {\tt runs} input parameter.  Each iteration of each clustering run
-performs the following steps:
-\begin{Itemize}
-\item Compute the centroid-dependent part of squared Euclidean distances from
-all records (rows of~$X$) to each of the $k$~centroids using matrix product;
-\item Take the minimum of the above for each record;
-\item Update the current within-cluster sum of squares (WCSS) value, with centroids
-substituted instead of the means for efficiency;
-\item Check the convergence criterion:\hfil
-$\textrm{WCSS}_{\mathrm{old}} - \textrm{WCSS}_{\mathrm{new}} < \eps\cdot\textrm{WCSS}_{\mathrm{new}}$\linebreak
-as well as the number of iterations limit;
-\item Find the closest centroid for each record, sharing equally any records with multiple
-closest centroids;
-\item Compute the number of records closest to each centroid, checking for ``runaway''
-centroids with no records left (in which case the run fails);
-\item Compute the new centroids by averaging the records in their clusters.
-\end{Itemize}
-When a termination condition is satisfied, we store the centroids and the WCSS value
-and exit this run.  A run has to satisfy the WCSS convergence criterion to be considered
-successful.  Upon the termination of all runs, we select the smallest WCSS value among
-the successful runs, and write out this run's centroids.  If requested, we also compute
-the cluster assignment of all records in~$X$, using integers from 1 to~$k$ as the cluster
-labels.  The scoring script can then be used to compare the cluster assignment with
-an externally specified category assignment.
-
-\smallskip
-\noindent{\bf Returns}
-\smallskip
-
-We output the $k$ centroids for the best available clustering, i.~e.\ whose WCSS
-is the smallest of all successful runs.
-The centroids are written as the rows of the $k\,{\times}\,m$-matrix into the output
-file whose path/name was provided as the ``{\tt C}'' input argument.  If the input
-parameter ``{\tt isY}'' was set to~{\tt 1}, we also output the one-column matrix with
-the cluster assignment for all the records.  This assignment is written into the
-file whose path/name was provided as the ``{\tt Y}'' input argument.
-The best WCSS value, as well as some information about the performance of the other
-runs, is printed during the script execution.  The scoring script {\tt Kmeans-predict}
-prints all its results in a self-explanatory manner, as defined in
-Table~\ref{table:kmeans:predict:stats}.
-
-
-\smallskip
-\noindent{\bf Examples}
-\smallskip
-
-{\hangindent=\parindent\noindent\tt
-\hml -f Kmeans.dml -nvargs X=/user/biadmin/X.mtx k=5 C=/user/biadmin/centroids.mtx fmt=csv
-
-}
-
-{\hangindent=\parindent\noindent\tt
-\hml -f Kmeans.dml -nvargs X=/user/biadmin/X.mtx k=5 runs=100 maxi=5000 
-tol=0.00000001 samp=20 C=/user/biadmin/centroids.mtx isY=1 Y=/user/biadmin/Yout.mtx verb=1
-
-}
-\noindent To predict {\tt Y} given {\tt X} and {\tt C}:
-
-{\hangindent=\parindent\noindent\tt
-\hml -f Kmeans-predict.dml -nvargs X=/user/biadmin/X.mtx
-         C=/user/biadmin/C.mtx prY=/user/biadmin/PredY.mtx O=/user/biadmin/stats.csv
-
-}
-\noindent To compare ``actual'' labels {\tt spY} with ``predicted'' labels given {\tt X} and {\tt C}:
-
-{\hangindent=\parindent\noindent\tt
-\hml -f Kmeans-predict.dml -nvargs X=/user/biadmin/X.mtx
-         C=/user/biadmin/C.mtx spY=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
-
-}
-\noindent To compare ``actual'' labels {\tt spY} with given ``predicted'' labels {\tt prY}:
-
-{\hangindent=\parindent\noindent\tt
-\hml -f Kmeans-predict.dml -nvargs spY=/user/biadmin/Y.mtx prY=/user/biadmin/PredY.mtx O=/user/biadmin/stats.csv
-
-}
-
-\smallskip
-\noindent{\bf References}
-\begin{itemize}
-\item
-D.~Aloise, A.~Deshpande, P.~Hansen, and P.~Popat.
-\newblock {NP}-hardness of {E}uclidean sum-of-squares clustering.
-\newblock {\em Machine Learning}, 75(2):245--248, May 2009.
-\item
-D.~Arthur and S.~Vassilvitskii.
-\newblock {\tt k-means++}: The advantages of careful seeding.
-\newblock In {\em Proceedings of the 18th Annual {ACM-SIAM} Symposium on
-  Discrete Algorithms ({SODA}~2007)}, pages 1027--1035, New Orleans~{LA},
-  {USA}, January 7--9 2007.
-\end{itemize}