You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemml.apache.org by du...@apache.org on 2015/12/02 02:05:16 UTC

[38/47] incubator-systemml git commit: [SYSML-328] Markdown for additional algorithms

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/2a2520bd/algorithms-matrix-factorization.md
----------------------------------------------------------------------
diff --git a/algorithms-matrix-factorization.md b/algorithms-matrix-factorization.md
index 2fedfce..a46a2cd 100644
--- a/algorithms-matrix-factorization.md
+++ b/algorithms-matrix-factorization.md
@@ -3,7 +3,6 @@ layout: global
 title: SystemML Algorithms Reference - Matrix Factorization
 displayTitle: <a href="algorithms-reference.html">SystemML Algorithms Reference</a>
 ---
-      
 
 # 5 Matrix Factorization
 
@@ -27,14 +26,14 @@ top-$K$ (for a given value of $K$) principle components.
 ### Usage
 
     hadoop jar SystemML.jar -f PCA.dml
-                            -nvargs INPUT=file 
-                                    K=int
-                                    CENTER=int
-                                    SCALE=int
-                                    PROJDATA=int 
-                                    OFMT=format 
-                                    MODEL=file 
-                                    OUTPUT=file
+                            -nvargs INPUT=<file>
+                                    K=<int>
+                                    CENTER=[int]
+                                    SCALE=[int]
+                                    PROJDATA=<int>
+                                    OFMT=[format]
+                                    MODEL=<file>
+                                    OUTPUT=<file>
 
 
 #### Arguments
@@ -42,22 +41,23 @@ top-$K$ (for a given value of $K$) principle components.
 **INPUT**: Location (on HDFS) to read the input matrix.
 
 **K**: Indicates dimension of the new vector space constructed from $K$
-    principle components. It must be a value between $1$ and the number
+    principle components. It must be a value between `1` and the number
     of columns in the input data.
 
-**CENTER**: (default: 0) 0 or 1. Indicates whether or not to
+**CENTER**: (default: `0`) `0` or `1`. Indicates whether or not to
     *center* input data prior to the computation of
     principle components.
 
-**SCALE**: (default: 0) 0 or 1. Indicates whether or not to
+**SCALE**: (default: `0`) `0` or `1`. Indicates whether or not to
     *scale* input data prior to the computation of
     principle components.
 
-**PROJDATA**: 0 or 1. Indicates whether or not the input data must be projected
+**PROJDATA**: `0` or `1`. Indicates whether or not the input data must be projected
     on to new vector space defined over principle components.
 
-**OFMT**: (default: `"csv"`) Specifies the output format.
-    Choice of comma-separated values (`csv`) or as a sparse-matrix (`text`).
+**OFMT**: (default: `"csv"`) Matrix file output format, such as `text`,
+`mm`, or `csv`; see read/write functions in
+SystemML Language Reference for details.
 
 **MODEL**: Either the location (on HDFS) where the computed model is
     stored; or the location of an existing model.
@@ -70,23 +70,23 @@ top-$K$ (for a given value of $K$) principle components.
 #### Examples
 
     hadoop jar SystemML.jar -f PCA.dml 
-                            -nvargs INPUT=/user/ml/input.mtx 
-                            K=10
-                            CENTER=1 
-                            SCALE=1O
-                            FMT=csv 
-                            PROJDATA=1
-                            OUTPUT=/user/ml/pca_output/
+                            -nvargs INPUT=/user/ml/input.mtx
+                                    K=10
+                                    CENTER=1
+                                    SCALE=1O
+                                    FMT=csv
+                                    PROJDATA=1
+                                    OUTPUT=/user/ml/pca_output/
 
-    hadoop jar SystemML.jar -f PCA.dml 
-                            -nvargs INPUT=/user/ml/test_input.mtx 
-                                    K=10 
+    hadoop jar SystemML.jar -f PCA.dml
+                            -nvargs INPUT=/user/ml/test_input.mtx
+                                    K=10
                                     CENTER=1
-                                    SCALE=1O 
-                                    FMT=csv 
+                                    SCALE=1O
+                                    FMT=csv
                                     PROJDATA=1
-                                    MODEL=/user/ml/pca_output/ 
-                                    OUTPUT=/user/ml/test_output.mtx  
+                                    MODEL=/user/ml/pca_output/
+                                    OUTPUT=/user/ml/test_output.mtx
 
 
 
@@ -141,3 +141,290 @@ stored at location OUTPUT.
 
 * * *
 
+## 5.2 Matrix Completion via Alternating Minimizations
+
+### Description
+
+Low-rank matrix completion is an effective technique for statistical
+data analysis widely used in the data mining and machine learning
+applications. Matrix completion is a variant of low-rank matrix
+factorization with the goal of recovering a partially observed and
+potentially noisy matrix from a subset of its revealed entries. Perhaps
+the most popular applications in which matrix completion has been
+successfully applied is in the context of collaborative filtering in
+recommender systems. In this setting, the rows in the data matrix
+correspond to users, the columns to items such as movies, and entries to
+feedback provided by users for items. The goal is to predict missing
+entries of the rating matrix. This implementation uses the alternating
+least-squares (ALS) technique for solving large-scale matrix completion
+problems.
+
+
+### Usage
+
+**ALS**:
+
+    hadoop jar SystemML.jar -f ALS.dml
+                            -nvargs V=<file>
+                                    L=<file>
+                                    R=<file>
+                                    rank=[int]
+                                    reg=[L2|wL2]
+                                    lambda=[double]
+                                    maxi=[int]
+                                    check=[boolean]
+                                    thr=[double]
+                                    fmt=[format]
+
+**ALS Prediction**:
+
+    hadoop jar SystemML.jar -f ALS_predict.dml
+                            -nvargs X=<file>
+                                    Y=<file>
+                                    L=<file>
+                                    R=<file>
+                                    Vrows=<int>
+                                    Vcols=<int>
+                                    fmt=[format]
+
+**ALS Top-K Prediction**:
+
+    hadoop jar SystemML.jar -f ALS_topk_predict.dml
+                            -nvargs X=<file>
+                                    Y=<file>
+                                    L=<file>
+                                    R=<file>
+                                    V=<file>
+                                    K=[int]
+                                    fmt=[format]
+
+
+### Arguments - ALS
+
+**V**: Location (on HDFS) to read the input (user-item) matrix $V$ to be
+factorized
+
+**L**: Location (on HDFS) to write the left (user) factor matrix $L$
+
+**R**: Location (on HDFS) to write the right (item) factor matrix $R$
+
+**rank**: (default: `10`) Rank of the factorization
+
+**reg**: (default: `L2`) Regularization:
+
+  * `L2` = `L2` regularization
+  * `wL2` = weighted `L2` regularization
+
+**lambda**: (default: `0.000001`) Regularization parameter
+
+**maxi**: (default: `50`) Maximum number of iterations
+
+**check**: (default: `FALSE`) Check for convergence after every iteration, i.e., updating
+$L$ and $R$ once
+
+**thr**: (default: `0.0001`) Assuming `check=TRUE`, the algorithm stops and
+convergence is declared if the decrease in loss in any two consecutive
+iterations falls below threshold `thr`; if
+`check=FALSE`, parameter `thr` is ignored.
+
+**fmt**: (default: `"text"`) Matrix file output format, such as `text`,
+`mm`, or `csv`; see read/write functions in
+SystemML Language Reference for details.
+
+
+### Arguments - ALS Prediction/Top-K Prediction
+
+**X**: Location (on HDFS) to read the input matrix $X$ with following format:
+
+  * for `ALS_predict.dml`: a 2-column matrix that contains
+    the user-ids (first column) and the item-ids (second column)
+  * for `ALS_topk_predict.dml`: a 1-column matrix that
+    contains the user-ids
+
+**Y**: Location (on HDFS) to write the output of prediction with the following
+format:
+
+  * for `ALS_predict.dml`: a 3-column matrix that contains
+    the user-ids (first column), the item-ids (second column) and the
+    predicted ratings (third column)
+  * for `ALS_topk_predict.dml`: a (`K+1`)-column matrix
+    that contains the user-ids in the first column and the top-K
+    item-ids in the remaining `K` columns will be stored at
+    `Y`. Additionally, a matrix with the same dimensions that
+    contains the corresponding actual top-K ratings will be stored at
+    `Y.ratings`; see below for details
+
+**L**: Location (on HDFS) to read the left (user) factor matrix $L$
+
+**R**: Location (on HDFS) to write the right (item) factor matrix $R$
+
+**V**: Location (on HDFS) to read the user-item matrix $V$
+
+**Vrows**: Number of rows of $V$ (i.e., number of users)
+
+**Vcols**: Number of columns of $V$ (i.e., number of items)
+
+**K**: (default: `5`) Number of top-K items for top-K prediction
+
+**fmt**: (default: `"text"`) Matrix file output format, such as `text`,
+`mm`, or `csv`; see read/write functions in
+SystemML Language Reference for details.
+
+
+### Examples
+
+**ALS**:
+
+    hadoop jar SystemML.jar -f ALS.dml
+                            -nvargs V=/user/ml/V
+                                    L=/user/ml/L
+                                    R=/user/ml/R
+                                    rank=10
+                                    reg="wL"
+                                    lambda=0.0001
+                                    maxi=50
+                                    check=TRUE
+                                    thr=0.001
+                                    fmt=csv
+
+
+**ALS Prediction**:
+
+To compute predicted ratings for a given list of users and items:
+
+    hadoop jar SystemML.jar -f ALS_predict.dml
+                            -nvargs X=/user/ml/X
+                                    Y=/user/ml/Y
+                                    L=/user/ml/L
+                                    R=/user/ml/R
+                                    Vrows=100000
+                                    Vcols=10000
+                                    fmt=csv
+
+
+**ALS Top-K Prediction**:
+
+To compute top-K items with highest predicted ratings together with the
+predicted ratings for a given list of users:
+
+    hadoop jar SystemML.jar -f ALS_topk_predict.dml
+                            -nvargs X=/user/ml/X
+                                    Y=/user/ml/Y
+                                    L=/user/ml/L
+                                    R=/user/ml/R
+                                    V=/user/ml/V
+                                    K=10
+                                    fmt=csv
+
+
+### Details
+
+Given an $m \times n$ input matrix $V$ and a rank parameter
+$r \ll \min{(m,n)}$, low-rank matrix factorization seeks to find an
+$m \times r$ matrix $L$ and an $r \times n$ matrix $R$ such that
+$V \approx LR$, i.e., we aim to approximate $V$ by the low-rank matrix
+$LR$. The quality of the approximation is determined by an
+application-dependent loss function $\mathcal{L}$. We aim at finding the
+loss-minimizing factor matrices, i.e.,
+
+$$
+\begin{equation}
+(L^*, R^*) = \textrm{argmin}_{L,R}{\mathcal{L}(V,L,R)}
+\end{equation}
+$$
+
+In the
+context of collaborative filtering in the recommender systems it is
+often the case that the input matrix $V$ contains several missing
+entries. Such entries are coded with the 0 value and the loss function
+is computed only based on the nonzero entries in $V$, i.e.,
+
+$$%\label{eq:loss}
+ \mathcal{L}=\sum_{(i,j)\in\Omega}l(V_{ij},L_{i*},R_{*j})$$
+
+where
+$$L_{i*}$$ and $$R_{*j}$$, respectively, denote the $i$th row of $L$ and the
+$j$th column of $R$, $$\Omega=\{\omega_1,\dots,\omega_N\}$$ denotes the
+training set containing the observed (nonzero) entries in $V$, and $l$
+is some local loss function.
+
+ALS is an optimization technique that can
+be used to solve quadratic problems. For matrix completion, the
+algorithm repeatedly keeps one of the unknown matrices ($L$ or $R$)
+fixed and optimizes the other one. In particular, ALS alternates between
+recomputing the rows of $L$ in one step and the columns of $R$ in the
+subsequent step. Our implementation of the ALS algorithm supports the
+loss functions summarized in [**Table 16**](algorithms-matrix-factorization.html#table16)
+commonly used for matrix completion [[ZhouWSP08]](algorithms-bibliography.html).
+
+* * *
+
+<a name="table16" />
+**Table 16**: Popular loss functions supported by our ALS implementation; $$N_{i*}$$
+  and $$N_{*j}$$, respectively, denote the number of nonzero entries in
+  row $i$ and column $j$ of $V$.
+
+| Loss                  | Definition |
+| --------------------- | ---------- |
+| $$\mathcal{L}_\text{Nzsl}$$     | $$\sum_{i,j:V_{ij}\neq 0} (V_{ij} - [LR]_{ij})^2$$
+| $$\mathcal{L}_\text{Nzsl+L2}$$  | $$\mathcal{L}_\text{Nzsl} + \lambda \Bigl( \sum_{ik} L_{ik}^2 + \sum_{kj} R_{kj}^2 \Bigr)$$
+| $$\mathcal{L}_\text{Nzsl+wL2}$$ | $$\mathcal{L}_\text{Nzsl} + \lambda \Bigl(\sum_{ik}N_{i*} L_{ik}^2 + \sum_{kj}N_{*j} R_{kj}^2 \Bigr)$$
+
+
+* * *
+
+
+Note that the matrix completion problem as defined in (1)
+is a non-convex problem for all loss functions from
+[**Table 16**](algorithms-matrix-factorization.html#table16). However, when fixing one of the matrices
+$L$ or $R$, we get a least-squares problem with a globally optimal
+solution. For example, for the case of $$\mathcal{L}_\text{Nzsl+wL2}$$ we
+have the following closed form solutions
+
+$$\begin{aligned}
+  L^\top_{n+1,i*} &\leftarrow (R^{(i)}_n {[R^{(i)}_n]}^\top + \lambda N_2 I)^{-1} R_n V^\top_{i*}, \\
+  R_{n+1,*j} &\leftarrow ({[L^{(j)}_{n+1}]}^\top L^{(j)}_{n+1} + \lambda N_1 I)^{-1} L^\top_{n+1} V_{*j}, 
+  \end{aligned}
+$$
+
+where $$L_{n+1,i*}$$ (resp. $$R_{n+1,*j}$$) denotes the
+$i$th row of $$L_{n+1}$$ (resp. $j$th column of $$R_{n+1}$$), $\lambda$
+denotes the regularization parameter, $I$ is the identity matrix of
+appropriate dimensionality, $$V_{i*}$$ (resp. $$V_{*j}$$) denotes the
+revealed entries in row $i$ (column $j$), $$R^{(i)}_n$$ (resp.
+$$L^{(j)}_{n+1}$$) refers to the corresponding columns of $R_n$ (rows of
+$$L_{n+1}$$), and $N_1$ (resp. $N_2$) denotes a diagonal matrix that
+contains the number of nonzero entries in row $i$ (column $j$) of $V$.
+
+
+**Prediction.** Based on the factor matrices computed by ALS we provide
+two prediction scripts:
+
+  1. `ALS_predict.dml` computes the predicted ratings for a given
+list of users and items.
+  2. `ALS_topk_predict.dml` computes top-K item (where $K$ is
+given as input) with highest predicted ratings together with their
+corresponding ratings for a given list of users.
+
+
+### Returns
+
+We output the factor matrices $L$ and $R$ after the algorithm has
+converged. The algorithm is declared as converged if one of the two
+criteria is meet: (1) the decrease in the value of loss function falls
+below `thr` given as an input parameter (if parameter
+`check=TRUE`), or (2) the maximum number of iterations
+(defined as parameter `maxi`) is reached. Note that for a
+given user $i$ prediction is possible only if user $i$ has rated at
+least one item, i.e., row $i$ in matrix $V$ has at least one nonzero
+entry. In case, some users have not rated any items the corresponding
+factor in $L$ will be all 0s. Similarly if some items have not been
+rated at all the corresponding factors in $R$ will contain only 0s. Our
+prediction scripts output the predicted ratings for a given list of
+users and items as well as the top-K items with highest predicted
+ratings together with the predicted ratings for a given list of users.
+Note that the predictions will only be provided for the users who have
+rated at least one item, i.e., the corresponding rows contain at least
+one nonzero entry.
+
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/2a2520bd/algorithms-reference.md
----------------------------------------------------------------------
diff --git a/algorithms-reference.md b/algorithms-reference.md
index 58d4aa0..4685761 100644
--- a/algorithms-reference.md
+++ b/algorithms-reference.md
@@ -17,14 +17,26 @@ displayTitle: SystemML Algorithms Reference
     * [Binary-Class Support Vector Machines](algorithms-classification.html#binary-class-support-vector-machines)
     * [Multi-Class Support Vector Machines](algorithms-classification.html#multi-class-support-vector-machines)
   * [Naive Bayes](algorithms-classification.html#naive-bayes)
+  * [Decision Trees](algorithms-classification.html#decision-trees)
+  * [Random Forests](algorithms-classification.html#random-forests)
   
 * [Clustering](algorithms-clustering.html)
   * [K-Means Clustering](algorithms-clustering.html#k-means-clustering)
 
 * [Regression](algorithms-regression.html)
   * [Linear Regression](algorithms-regression.html#linear-regression)
+  * [Stepwise Linear Regression](algorithms-regression.html#stepwise-linear-regression)
   * [Generalized Linear Models](algorithms-regression.html#generalized-linear-models)
+  * [Stepwise Generalized Linear Regression](algorithms-regression.html#stepwise-generalized-linear-regression)
   * [Regression Scoring and Prediction](algorithms-regression.html#regression-scoring-and-prediction)
   
 * [Matrix Factorization](algorithms-matrix-factorization.html)
   * [Principle Component Analysis](algorithms-matrix-factorization.html#principle-component-analysis)
+  * [Matrix Completion via Alternating Minimizations](algorithms-matrix-factorization.html#matrix-completion-via-alternating-minimizations)
+
+* [Survival Analysis](algorithms-survival-analysis.html)
+  * [Kaplan-Meier Survival Analysis](algorithms-survival-analysis.html#kaplan-meier-survival-analysis)
+  * [Cox Proportional Hazard Regression Model](algorithms-survival-analysis.html#cox-proportional-hazard-regression-model)
+
+* [Bibliography](algorithms-bibliography.html)
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/2a2520bd/algorithms-regression.md
----------------------------------------------------------------------
diff --git a/algorithms-regression.md b/algorithms-regression.md
index 0302a18..f1da3c9 100644
--- a/algorithms-regression.md
+++ b/algorithms-regression.md
@@ -3,7 +3,6 @@ layout: global
 title: SystemML Algorithms Reference - Regression
 displayTitle: <a href="algorithms-reference.html">SystemML Algorithms Reference</a>
 ---
-      
 
 # 4. Regression
 
@@ -60,31 +59,31 @@ efficient when the number of features $m$ is relatively small
 
 ### Usage
 
-Linear Regression - Direct Solve
+**Linear Regression - Direct Solve**:
 
     hadoop jar SystemML.jar -f LinearRegDS.dml
-                            -nvargs X=file
-                                    Y=file
-                                    B=file
-                                    O=file
-                                    icpt=int
-                                    reg=double
-                                    fmt=format
+                            -nvargs X=<file>
+                                    Y=<file>
+                                    B=<file>
+                                    O=[file]
+                                    icpt=[int]
+                                    reg=[double]
+                                    fmt=[format]
 
 
-Linear Regression - Conjugate Gradient
+**Linear Regression - Conjugate Gradient**:
 
     hadoop jar SystemML.jar -f LinearRegCG.dml
-                            -nvargs X=file
-                                    Y=file
-                                    B=file
-                                    O=file
-                                    Log=file
-                                    icpt=int
-                                    reg=double
-                                    tol=double
-                                    maxi=int
-                                    fmt=format
+                            -nvargs X=<file>
+                                    Y=<file>
+                                    B=<file>
+                                    O=[file]
+                                    Log=[file]
+                                    icpt=[int]
+                                    reg=[double]
+                                    tol=[double]
+                                    maxi=[int]
+                                    fmt=[format]
 
 
 ### Arguments
@@ -107,26 +106,25 @@ iteration-specific variables for monitoring and debugging purposes, see
 [**Table 8**](algorithms-regression.html#table8)
 for details.
 
-**icpt**: (default: 0) Intercept presence and shifting/rescaling the features
+**icpt**: (default: `0`) Intercept presence and shifting/rescaling the features
 in $X$:
-<ul>
-<li>0 = no intercept (hence no $\beta_0$), no shifting or
-rescaling of the features</li>
-<li>1 = add intercept, but do not shift/rescale the features
-in $X$</li>
-<li>2 = add intercept, shift/rescale the features in $X$ to
-mean 0, variance 1</li>
-</ul>
-
-**reg**: (default: 0.000001) L2-regularization parameter $\lambda\geq 0$; set to nonzero for highly
+
+  * 0 = no intercept (hence no $\beta_0$), no shifting or
+rescaling of the features
+  * 1 = add intercept, but do not shift/rescale the features
+in $X$
+  * 2 = add intercept, shift/rescale the features in $X$ to
+mean 0, variance 1
+
+**reg**: (default: `0.000001`) L2-regularization parameter $\lambda\geq 0$; set to nonzero for highly
 dependent, sparse, or numerous ($m \gtrsim n/10$) features
 
-**tol**: (default: 0.000001, `LinearRegCG.dml` only) Tolerance $\varepsilon\geq 0$ used in the
+**tol**: (default: `0.000001`, `LinearRegCG.dml` only) Tolerance $\varepsilon\geq 0$ used in the
 convergence criterion: we terminate conjugate gradient iterations when
 the $\beta$-residual reduces in L2-norm by this factor
 
-**maxi**: (default: 0, `LinearRegCG.dml` only) Maximum number of conjugate
-gradient iterations, or 0 if no maximum limit provided
+**maxi**: (default: `0`, `LinearRegCG.dml` only) Maximum number of conjugate
+gradient iterations, or `0` if no maximum limit provided
 
 **fmt**: (default: `"text"`) Matrix file output format, such as `text`,
 `mm`, or `csv`; see read/write functions in
@@ -135,7 +133,7 @@ SystemML Language Reference for details.
 
 ### Examples
 
-Linear Regression - Direct Solve
+**Linear Regression - Direct Solve**:
 
     hadoop jar SystemML.jar -f LinearRegDS.dml
                             -nvargs X=/user/ml/X.mtx 
@@ -147,7 +145,7 @@ Linear Regression - Direct Solve
                                     reg=1.0
 
 
-Linear Regression - Conjugate Gradient
+**Linear Regression - Conjugate Gradient**:
 
     hadoop jar SystemML.jar -f LinearRegCG.dml
                             -nvargs X=/user/ml/X.mtx 
@@ -359,7 +357,124 @@ can also be made available, see [**Table 8**](algorithms-regression.html#table
 
 * * *
 
-## 4.2. Generalized Linear Models
+## 4.2. Stepwise Linear Regression
+
+### Description
+
+Our stepwise linear regression script selects a linear model based on
+the Akaike information criterion (AIC): the model that gives rise to the
+lowest AIC is computed.
+
+
+### Usage
+
+    hadoop jar SystemML.jar -f StepLinearRegDS.dml
+                            -nvargs X=<file>
+                                    Y=<file>
+                                    B=<file>
+                                    S=[file]
+                                    O=[file]
+                                    icpt=[int]
+                                    thr=[double]
+                                    fmt=[format]
+
+
+### Arguments
+
+**X**: Location (on HDFS) to read the matrix of feature vectors, each row
+contains one feature vector.
+
+**Y**: Location (on HDFS) to read the 1-column matrix of response values
+
+**B**: Location (on HDFS) to store the estimated regression parameters (the
+$\beta_j$’s), with the intercept parameter $\beta_0$ at position
+B\[$m\,{+}\,1$, 1\] if available
+
+**S**: (default: `" "`) Location (on HDFS) to store the selected feature-ids in the
+order as computed by the algorithm; by default the selected feature-ids
+are forwarded to the standard output.
+
+**O**: (default: `" "`) Location (on HDFS) to store the CSV-file of summary
+statistics defined in [**Table 7**](algorithms-regression.html#table7); by default the
+summary statistics are forwarded to the standard output.
+
+**icpt**: (default: `0`) Intercept presence and shifting/rescaling the features
+in $X$:
+
+  * 0 = no intercept (hence no $\beta_0$), no shifting or
+rescaling of the features;
+  * 1 = add intercept, but do not shift/rescale the features
+in $X$;
+  * 2 = add intercept, shift/rescale the features in $X$ to
+mean 0, variance 1
+
+**thr**: (default: `0.01`) Threshold to stop the algorithm: if the decrease in the value
+of the AIC falls below `thr` no further features are being
+checked and the algorithm stops.
+
+**fmt**: (default: `"text"`) Matrix file output format, such as `text`,
+`mm`, or `csv`; see read/write functions in
+SystemML Language Reference for details.
+
+
+### Examples
+
+    hadoop jar SystemML.jar -f StepLinearRegDS.dml
+                            -nvargs X=/user/ml/X.mtx
+                                    Y=/user/ml/Y.mtx
+                                    B=/user/ml/B.mtx
+                                    S=/user/ml/selected.csv
+                                    O=/user/ml/stats.csv
+                                    icpt=2
+                                    thr=0.05
+                                    fmt=csv
+
+
+### Details
+
+Stepwise linear regression iteratively selects predictive variables in
+an automated procedure. Currently, our implementation supports forward
+selection: starting from an empty model (without any variable) the
+algorithm examines the addition of each variable based on the AIC as a
+model comparison criterion. The AIC is defined as
+
+$$
+\begin{equation}
+AIC = -2 \log{L} + 2 edf,\label{eq:AIC}
+\end{equation}
+$$
+
+where $L$ denotes the
+likelihood of the fitted model and $edf$ is the equivalent degrees of
+freedom, i.e., the number of estimated parameters. This procedure is
+repeated until including no additional variable improves the model by a
+certain threshold specified in the input parameter `thr`.
+
+For fitting a model in each iteration we use the `direct solve` method
+as in the script `LinearRegDS.dml` discussed in
+[Linear Regression](algorithms-regression.html#linear-regression).
+
+
+### Returns
+
+Similar to the outputs from `LinearRegDS.dml` the stepwise
+linear regression script computes the estimated regression coefficients
+and stores them in matrix $B$ on HDFS. The format of matrix $B$ is
+identical to the one produced by the scripts for linear regression (see
+[Linear Regression](algorithms-regression.html#linear-regression)). Additionally, `StepLinearRegDS.dml`
+outputs the variable indices (stored in the 1-column matrix $S$) in the
+order they have been selected by the algorithm, i.e., $i$th entry in
+matrix $S$ corresponds to the variable which improves the AIC the most
+in $i$th iteration. If the model with the lowest AIC includes no
+variables matrix $S$ will be empty (contains one 0). Moreover, the
+estimated summary statistics as defined in [**Table 7**](algorithms-regression.html#table7)
+are printed out or stored in a file (if requested). In the case where an
+empty model achieves the best AIC these statistics will not be produced.
+
+
+* * *
+
+## 4.3. Generalized Linear Models
 
 ### Description
 
@@ -384,7 +499,7 @@ $$x_i \,\,\,\,\mapsto\,\,\,\, \eta_i = \beta_0 + \sum\nolimits_{j=1}^m \beta_j x
 
 In linear regression the response mean $\mu_i$ *equals* some linear
 combination over $x_i$, denoted above by $\eta_i$. In logistic
-regression with $y\in\{0, 1\}$ (Bernoulli) the mean of $y$ is the same
+regression with $$y\in\{0, 1\}$$ (Bernoulli) the mean of $y$ is the same
 as $Prob[y=1]$
 and equals $1/(1+e^{-\eta_i})$, the logistic function of $\eta_i$. In
 GLM, $\mu_i$ and $\eta_i$ can be related via any given smooth monotone
@@ -407,23 +522,23 @@ distributions and link functions, see below for details.
 ### Usage
 
     hadoop jar SystemML.jar -f GLM.dml
-                            -nvargs X=file 
-                                    Y=file 
-                                    B=file 
-                                    fmt=format 
-                                    O=file 
-                                    Log=file 
-                                    dfam=int 
-                                    vpow=double 
-                                    link=int 
-                                    lpow=double 
-                                    yneg=double 
-                                    icpt=int 
-                                    reg=double 
-                                    tol=double 
-                                    disp=double 
-                                    moi=int 
-                                    mii=int
+                            -nvargs X=<file>
+                                    Y=<file>
+                                    B=<file>
+                                    fmt=[format]
+                                    O=[file]
+                                    Log=[file]
+                                    dfam=[int]
+                                    vpow=[double]
+                                    link=[int]
+                                    lpow=[double]
+                                    yneg=[double]
+                                    icpt=[int]
+                                    reg=[double]
+                                    tol=[double]
+                                    disp=[double]
+                                    moi=[int]
+                                    mii=[int]
 
 
 ### Arguments
@@ -448,14 +563,14 @@ by default it is standard output.
 **Log**: (default: `" "`) Location to store iteration-specific variables for monitoring
 and debugging purposes, see [**Table 10**](algorithms-regression.html#table10) for details.
 
-**dfam**: (default: 1) Distribution family code to specify
+**dfam**: (default: `1`) Distribution family code to specify
 $Prob[y\mid \mu]$,
 see [**Table 11**](algorithms-regression.html#table11):
 
   * 1 = power distributions with $Var(y) = \mu^{\alpha}$
   * 2 = binomial or Bernoulli
 
-**vpow**: (default: 0.0) When `dfam=1`, this provides the $q$ in
+**vpow**: (default: `0.0`) When `dfam=1`, this provides the $q$ in
 $Var(y) = a\mu^q$, the power
 dependence of the variance of $y$ on its mean. In particular, use:
 
@@ -464,7 +579,7 @@ dependence of the variance of $y$ on its mean. In particular, use:
   * 2.0 = Gamma
   * 3.0 = inverse Gaussian
 
-**link**: (default: 0) Link function code to determine the link
+**link**: (default: `0`) Link function code to determine the link
 function $\eta = g(\mu)$:
 
   * 0 = canonical link (depends on the distribution family),
@@ -475,7 +590,7 @@ see [**Table 11**](algorithms-regression.html#table11)
   * 4 = cloglog
   * 5 = cauchit
 
-**lpow**: (default: 1.0) When link=1, this provides the $s$ in
+**lpow**: (default: `1.0`) When link=1, this provides the $s$ in
 $\eta = \mu^s$, the power link function; `lpow=0.0` gives the
 log link $\eta = \log\mu$. Common power links:
 
@@ -485,13 +600,13 @@ log link $\eta = \log\mu$. Common power links:
   * 0.5 = sqrt 
   * 1.0 = identity
 
-**yneg**: (default: 0.0) When `dfam=2` and the response matrix $Y$ has
+**yneg**: (default: `0.0`) When `dfam=2` and the response matrix $Y$ has
 1 column, this specifies the $y$-value used for Bernoulli “No” label
 (“Yes” is $1$):
 0.0 when $y\in\\{0, 1\\}$; -1.0 when
 $y\in\\{-1, 1\\}$
 
-**icpt**: (default: 0) Intercept and shifting/rescaling of the features in $X$:
+**icpt**: (default: `0`) Intercept and shifting/rescaling of the features in $X$:
 
   * 0 = no intercept (hence no $\beta_0$), no
 shifting/rescaling of the features
@@ -500,18 +615,18 @@ in $X$
   * 2 = add intercept, shift/rescale the features in $X$ to
 mean 0, variance 1
 
-**reg**: (default: 0.0) L2-regularization parameter ($\lambda$)
+**reg**: (default: `0.0`) L2-regularization parameter ($\lambda$)
 
-**tol**: (default: 0.000001) Tolerance ($\varepsilon$) used in the convergence criterion: we
+**tol**: (default: `0.000001`) Tolerance ($\varepsilon$) used in the convergence criterion: we
 terminate the outer iterations when the deviance changes by less than
 this factor; see below for details
 
-**disp**: (default: 0.0) Dispersion parameter, or 0.0 to estimate it from
+**disp**: (default: `0.0`) Dispersion parameter, or 0.0 to estimate it from
 data
 
-**moi**: (default: 200) Maximum number of outer (Fisher scoring) iterations
+**moi**: (default: `200`) Maximum number of outer (Fisher scoring) iterations
 
-**mii**: (default: 0) Maximum number of inner (conjugate gradient) iterations, or 0
+**mii**: (default: `0`) Maximum number of inner (conjugate gradient) iterations, or 0
 if no maximum limit provided
 
 * * *
@@ -702,7 +817,7 @@ $$f(\beta; X, Y) \,\,=\,\, -\sum\nolimits_{i=1}^n \big(y_i\theta_i - b(\theta_i)
 \,+\,(\lambda/2) \sum\nolimits_{j=1}^m \beta_j^2\,\,\to\,\,\min$$ 
 
 where
-$\theta_i$ and $b(\theta_i)$ are from (5); note that $a$ and
+$\theta_i$ and $b(\theta_i)$ are from (6); note that $a$ and
 $c(y, a)$ are constant w.r.t. $\beta$ and can be ignored here. The
 canonical parameter $\theta_i$ depends on both $\beta$ and $x_i$:
 
@@ -748,7 +863,7 @@ $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$ and updated as described
 in [[Nocedal2006]](algorithms-bibliography.html). 
 The user can specify the maximum number of
 the outer and the inner iterations with input parameters
-moi and mii, respectively. The Fisher scoring
+`moi` and `mii`, respectively. The Fisher scoring
 algorithm terminates successfully if
 $2|\varDelta f(z; \beta)| < (D_1(\beta) + 0.1)\hspace{0.5pt}{\varepsilon}$
 where ${\varepsilon}> 0$ is a tolerance supplied by the user via
@@ -761,7 +876,7 @@ $$D_1(\beta) \,\,=\,\, 2 \cdot \big(Prob[Y \mid \!
 
 The deviance estimate is also produced as part of the output. Once the
 Fisher scoring algorithm terminates, if requested by the user, we
-estimate the dispersion $a$ from Eq. 5 using Pearson residuals
+estimate the dispersion $a$ from (6) using Pearson residuals
 
 $$
 \begin{equation}
@@ -772,7 +887,7 @@ $$
 and use it to adjust our deviance estimate:
 $D_{\hat{a}}(\beta) = D_1(\beta)/\hat{a}$. If input argument
 disp is 0.0 we estimate $\hat{a}$, otherwise
-we use its value as $a$. Note that in (6) $m$ counts
+we use its value as $a$. Note that in (7) $m$ counts
 the intercept ($m \leftarrow m+1$) if it is present.
 
 
@@ -817,7 +932,144 @@ to use the corresponding specialized scripts instead of GLM.
 
 * * *
 
-## 4.3. Regression Scoring and Prediction
+## 4.4. Stepwise Generalized Linear Regression
+
+### Description
+
+Our stepwise generalized linear regression script selects a model based
+on the Akaike information criterion (AIC): the model that gives rise to
+the lowest AIC is provided. Note that currently only the Bernoulli
+distribution family is supported (see below for details).
+
+
+### Usage
+
+    hadoop jar SystemML.jar -f StepGLM.dml
+                            -nvargs X=<file>
+                                    Y=<file>
+                                    B=<file>
+                                    S=[file]
+                                    O=[file]
+                                    link=[int]
+                                    yneg=[double]
+                                    icpt=[int]
+                                    tol=[double]
+                                    disp=[double]
+                                    moi=[int]
+                                    mii=[int]
+                                    thr=[double]
+                                    fmt=[format]
+
+
+### Arguments
+
+**X**: Location (on HDFS) to read the matrix of feature vectors; each row is an
+example.
+
+**Y**: Location (on HDFS) to read the response matrix, which may have 1 or 2
+columns
+
+**B**: Location (on HDFS) to store the estimated regression parameters (the
+$\beta_j$’s), with the intercept parameter $\beta_0$ at position
+B\[$m\,{+}\,1$, 1\] if available
+
+**S**: (default: `" "`) Location (on HDFS) to store the selected feature-ids in the
+order as computed by the algorithm, by default it is standard output.
+
+**O**: (default: `" "`) Location (on HDFS) to write certain summary statistics
+described in [**Table 9**](algorithms-regression.html#table9), by default it is standard
+output.
+
+**link**: (default: `2`) Link function code to determine the link
+function $\eta = g(\mu)$, see [**Table 11**](algorithms-regression.html#table11); currently the
+following link functions are supported:
+
+  * 1 = log
+  * 2 = logit
+  * 3 = probit
+  * 4 = cloglog
+
+**yneg**: (default: `0.0`) Response value for Bernoulli “No” label, usually 0.0 or -1.0
+
+**icpt**: (default: `0`) Intercept and shifting/rescaling of the features in $X$:
+
+  * 0 = no intercept (hence no $\beta_0$), no
+shifting/rescaling of the features
+  * 1 = add intercept, but do not shift/rescale the features
+in $X$
+  * 2 = add intercept, shift/rescale the features in $X$ to
+mean 0, variance 1
+
+**tol**: (default: `0.000001`) Tolerance ($\epsilon$) used in the convergence criterion: we
+terminate the outer iterations when the deviance changes by less than
+this factor; see below for details.
+
+**disp**: (default: `0.0`) Dispersion parameter, or `0.0` to estimate it from
+data
+
+**moi**: (default: `200`) Maximum number of outer (Fisher scoring) iterations
+
+**mii**: (default: `0`) Maximum number of inner (conjugate gradient) iterations, or 0
+if no maximum limit provided
+
+**thr**: (default: `0.01`) Threshold to stop the algorithm: if the decrease in the value
+of the AIC falls below `thr` no further features are being
+checked and the algorithm stops.
+
+**fmt**: (default: `"text"`) Matrix file output format, such as `text`,
+`mm`, or `csv`; see read/write functions in
+SystemML Language Reference for details.
+
+
+### Examples
+
+    hadoop jar SystemML.jar -f StepGLM.dml
+                            -nvargs X=/user/ml/X.mtx
+                                    Y=/user/ml/Y.mtx
+                                    B=/user/ml/B.mtx
+                                    S=/user/ml/selected.csv
+                                    O=/user/ml/stats.csv
+                                    link=2
+                                    yneg=-1.0
+                                    icpt=2
+                                    tol=0.000001
+                                    moi=100
+                                    mii=10
+                                    thr=0.05
+                                    fmt=csv
+
+
+### Details
+
+Similar to `StepLinearRegDS.dml` our stepwise GLM script
+builds a model by iteratively selecting predictive variables using a
+forward selection strategy based on the AIC (5). Note that
+currently only the Bernoulli distribution family (`fam=2` in
+[**Table 11**](algorithms-regression.html#table11)) together with the following link functions
+are supported: `log`, `logit`, `probit`, and `cloglog` (link
+$$\in\{1,2,3,4\}$$ in [**Table 11**](algorithms-regression.html#table11)).
+
+
+### Returns
+
+Similar to the outputs from `GLM.dml` the stepwise GLM script
+computes the estimated regression coefficients and stores them in matrix
+$B$ on HDFS; matrix $B$ follows the same format as the one produced by
+`GLM.dml` (see [Generalized Linear Models](algorithms-regression.html#generalized-linear-models)). Additionally,
+`StepGLM.dml` outputs the variable indices (stored in the
+1-column matrix $S$) in the order they have been selected by the
+algorithm, i.e., $i$th entry in matrix $S$ stores the variable which
+improves the AIC the most in $i$th iteration. If the model with the
+lowest AIC includes no variables matrix $S$ will be empty. Moreover, the
+estimated summary statistics as defined in [**Table 9**](algorithms-regression.html#table9) are
+printed out or stored in a file on HDFS (if requested); these statistics
+will be provided only if the selected model is nonempty, i.e., contains
+at least one variable.
+
+
+* * *
+
+## 4.5. Regression Scoring and Prediction
 
 ### Description
 
@@ -874,7 +1126,7 @@ If $y_i$ is categorical, i.e. a vector of label counts for record $i$,
 then $\mu_i$ is a vector of non-negative real numbers, one number
 $$\mu_{i,l}$$ per each label $l$. In this case we divide the $$\mu_{i,l}$$
 by their sum $\sum_l \mu_{i,l}$ to obtain predicted label
-probabilities . The output matrix $M$ is the
+probabilities $$p_{i,l}\in [0, 1]$$. The output matrix $M$ is the
 $n \times (k\,{+}\,1)$-matrix of these probabilities, where $n$ is the
 number of records and $k\,{+}\,1$ is the number of categories[^3]. Note
 again that we do not predict the labels themselves, nor their actual
@@ -894,17 +1146,17 @@ this step outside the scope of `GLM-predict.dml` for now.
 ### Usage
 
     hadoop jar SystemML.jar -f GLM-predict.dml
-                            -nvargs X=file 
-                                    Y=file 
-                                    B=file 
-                                    M=file 
-                                    O=file 
-                                    dfam=int 
-                                    vpow=double 
-                                    link=int 
-                                    lpow=double 
-                                    disp=double  
-                                    fmt=format
+                            -nvargs X=<file>
+                                    Y=[file]
+                                    B=<file>
+                                    M=[file]
+                                    O=[file]
+                                    dfam=[int]
+                                    vpow=[double]
+                                    link=[int]
+                                    lpow=[double]
+                                    disp=[double]
+                                    fmt=[format]
 
 
 ### Arguments
@@ -946,7 +1198,7 @@ statistics defined in [**Table 13**](algorithms-regression.html#table13),
 the default is to
 print them to the standard output
 
-**dfam**: (default: 1) GLM distribution family code to specify the type of
+**dfam**: (default: `1`) GLM distribution family code to specify the type of
 distribution
 $Prob[y\,|\,\mu]$
 that we assume:
@@ -957,7 +1209,7 @@ $Var(y) = \mu^{\alpha}$, see
   * 2 = binomial
   * 3 = multinomial logit
 
-**vpow**: (default: 0.0) Power for variance defined as (mean)$^{\textrm{power}}$
+**vpow**: (default: `0.0`) Power for variance defined as (mean)$^{\textrm{power}}$
 (ignored if `dfam`$\,{\neq}\,1$): when `dfam=1`,
 this provides the $q$ in
 $Var(y) = a\mu^q$, the power
@@ -968,7 +1220,7 @@ dependence of the variance of $y$ on its mean. In particular, use:
   * 2.0 = Gamma
   * 3.0 = inverse Gaussian
 
-**link**: (default: 0) Link function code to determine the link
+**link**: (default: `0`) Link function code to determine the link
 function $\eta = g(\mu)$, ignored for multinomial logit
 (`dfam=3`):
 
@@ -980,7 +1232,7 @@ see [**Table 11**](algorithms-regression.html#table11)
   * 4 = cloglog
   * 5 = cauchit
 
-**lpow**: (default: 1.0) Power for link function defined as
+**lpow**: (default: `1.0`) Power for link function defined as
 (mean)$^{\textrm{power}}$ (ignored if `link`$\,{\neq}\,1$):
 when `link=1`, this provides the $s$ in $\eta = \mu^s$, the
 power link function; `lpow=0.0` gives the log link
@@ -992,7 +1244,7 @@ $\eta = \log\mu$. Common power links:
   * 0.5 = sqrt 
   * 1.0 = identity
 
-**disp**: (default: 1.0) Dispersion value, when available; must be positive
+**disp**: (default: `1.0`) Dispersion value, when available; must be positive
 
 **fmt**: (default: `"text"`) Matrix M file output format, such as
 `text`, `mm`, or `csv`; see read/write
@@ -1004,7 +1256,7 @@ functions in SystemML Language Reference for details.
 Note that in the examples below the value for the `disp` input
 argument is set arbitrarily. The correct dispersion value should be
 computed from the training data during model estimation, or omitted if
-unknown (which sets it to 1.0).
+unknown (which sets it to `1.0`).
 
 Linear regression example:
 
@@ -1263,7 +1515,7 @@ $k = \mathop{\texttt{ncol}}(Y) - 1$. Given the dispersion parameter
 dispersion is accurate, $X^2 / \texttt{disp}$ should be close to \#d.f.
 In fact, $X^2 / \textrm{\#d.f.}$ over the *training* data is the
 dispersion estimator used in our `GLM.dml` script,
-see (6). Here we provide $X^2 / \textrm{\#d.f.}$ and
+see (7). Here we provide $X^2 / \textrm{\#d.f.}$ and
 $X^2_{\texttt{disp}} / \textrm{\#d.f.}$ as
 `PEARSON_X2_BY_DF` to enable dispersion comparison between
 the training data and the test data.
@@ -1291,8 +1543,8 @@ $\mu_i^{\mathrm{sat}}$ to equal $y_i$ for every record (for categorical
 data, $$p_{i,j}^{sat} = y_{i,j} / N_i$$), which represents the
 “perfect fit.” For records with $y_{i,j} \in \{0, N_i\}$ or otherwise at
 a boundary, by continuity we set $0 \log 0 = 0$. The GLM likelihood
-functions defined in (5) become simplified in
-ratio (7) due to canceling out the term $c(y, a)$
+functions defined in (6) become simplified in
+ratio (8) due to canceling out the term $c(y, a)$
 since it is the same in both models.
 
 The log of a likelihood ratio between two nested models, times two, is
@@ -1384,8 +1636,8 @@ $m$ with the intercept or $m+1$ without the intercept.
 ### Returns
 
 The matrix of predicted means (if the response is numerical) or
-probabilities (if the response is categorical), see “Description”
-subsection above for more information. Given Y, we return
+probabilities (if the response is categorical), see Description
+subsection above for more information. Given `Y`, we return
 some statistics in CSV format as described in
 [**Table 13**](algorithms-regression.html#table13) and in the above text.
 

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/2a2520bd/algorithms-survival-analysis.md
----------------------------------------------------------------------
diff --git a/algorithms-survival-analysis.md b/algorithms-survival-analysis.md
new file mode 100644
index 0000000..b76ab91
--- /dev/null
+++ b/algorithms-survival-analysis.md
@@ -0,0 +1,709 @@
+---
+layout: global
+title: SystemML Algorithms Reference - Survival Analysis
+displayTitle: <a href="algorithms-reference.html">SystemML Algorithms Reference</a>
+---
+
+
+# 6. Survival Analysis
+
+## 6.1. Kaplan-Meier Survival Analysis
+
+### Description
+
+Survival analysis examines the time needed for a particular event of
+interest to occur. In medical research, for example, the prototypical
+such event is the death of a patient but the methodology can be applied
+to other application areas, e.g., completing a task by an individual in
+a psychological experiment or the failure of electrical components in
+engineering. Kaplan-Meier or (product limit) method is a simple
+non-parametric approach for estimating survival probabilities from both
+censored and uncensored survival times.
+
+
+### Usage
+
+    hadoop jar SystemML.jar -f KM.dml
+                            -nvargs X=<file>
+                                    TE=<file>
+                                    GI=<file>
+                                    SI=<file>
+                                    O=<file>
+                                    M=<file>
+                                    T=<file>
+                                    alpha=[double]
+                                    etype=[greenwood|peto]
+                                    ctype=[plain|log|log-log]
+                                    ttype=[none|log-rank|wilcoxon]
+                                    fmt=[format]
+
+
+### Arguments
+
+**X**: Location (on HDFS) to read the input matrix of the survival data
+containing:
+
+  * timestamps
+  * whether event occurred (1) or data is censored (0)
+  * a number of factors (i.e., categorical features) for grouping and/or
+stratifying
+
+**TE**: Location (on HDFS) to read the 1-column matrix $TE$ that contains the
+column indices of the input matrix $X$ corresponding to timestamps
+(first entry) and event information (second entry)
+
+**GI**: Location (on HDFS) to read the 1-column matrix $GI$ that contains the
+column indices of the input matrix $X$ corresponding to the factors
+(i.e., categorical features) to be used for grouping
+
+**SI**: Location (on HDFS) to read the 1-column matrix $SI$ that contains the
+column indices of the input matrix $X$ corresponding to the factors
+(i.e., categorical features) to be used for grouping
+
+**O**: Location (on HDFS) to write the matrix containing the results of the
+Kaplan-Meier analysis $KM$
+
+**M**: Location (on HDFS) to write Matrix $M$ containing the following
+statistics: total number of events, median and its confidence intervals;
+if survival data for multiple groups and strata are provided each row of
+$M$ contains the above statistics per group and stratum.
+
+**T**: If survival data from multiple groups is available and
+`ttype=log-rank` or `ttype=wilcoxon`, location (on
+HDFS) to write the two matrices that contains the result of the
+(stratified) test for comparing these groups; see below for details.
+
+**alpha**: (default: `0.05`) Parameter to compute $100(1-\alpha)\%$ confidence intervals
+for the survivor function and its median
+
+**etype**: (default: `"greenwood"`) Parameter to specify the error type according to `greenwood`
+or `peto`
+
+**ctype**: (default: `"log"`) Parameter to modify the confidence interval; `plain` keeps
+the lower and upper bound of the confidence interval unmodified, `log`
+corresponds to logistic transformation and `log-log` corresponds to the
+complementary log-log transformation
+
+**ttype**: (default: `"none"`) If survival data for multiple groups is available specifies
+which test to perform for comparing survival data across multiple
+groups: `none`, `log-rank` or `wilcoxon` test
+
+**fmt**: (default:`"text"`) Matrix file output format, such as `text`,
+`mm`, or `csv`; see read/write functions in
+SystemML Language Reference for details.
+
+
+### Examples
+
+    hadoop jar SystemML.jar -f KM.dml
+                            -nvargs X=/user/ml/X.mtx
+                                    TE=/user/ml/TE
+                                    GI=/user/ml/GI
+                                    SI=/user/ml/SI
+                                    O=/user/ml/kaplan-meier.csv
+                                    M=/user/ml/model.csv
+                                    alpha=0.01
+                                    etype=greenwood
+                                    ctype=plain
+                                    fmt=csv
+
+    hadoop jar SystemML.jar -f KM.dml
+                            -nvargs X=/user/ml/X.mtx
+                                    TE=/user/ml/TE
+                                    GI=/user/ml/GI
+                                    SI=/user/ml/SI
+                                    O=/user/ml/kaplan-meier.csv
+                                    M=/user/ml/model.csv
+                                    T=/user/ml/test.csv
+                                    alpha=0.01
+                                    etype=peto
+                                    ctype=log
+                                    ttype=log-rank
+                                    fmt=csv
+
+
+### Details
+
+The Kaplan-Meier estimate is a non-parametric maximum likelihood
+estimate (MLE) of the survival function $S(t)$, i.e., the probability of
+survival from the time origin to a given future time. As an illustration
+suppose that there are $n$ individuals with observed survival times
+$t_1,t_2,\ldots t_n$ out of which there are $r\leq n$ distinct death
+times $$t_{(1)}\leq t_{(2)}\leq t_{(r)}$$—since some of the observations
+may be censored, in the sense that the end-point of interest has not
+been observed for those individuals, and there may be more than one
+individual with the same survival time. Let $S(t_j)$ denote the
+probability of survival until time $t_j$, $d_j$ be the number of events
+at time $t_j$, and $n_j$ denote the number of individual at risk (i.e.,
+those who die at time $t_j$ or later). Assuming that the events occur
+independently, in Kaplan-Meier method the probability of surviving from
+$t_j$ to $$t_{j+1}$$ is estimated from $S(t_j)$ and given by
+
+$$\hat{S}(t) = \prod_{j=1}^{k} \left( \frac{n_j-d_j}{n_j} \right)$$
+
+for
+$$t_k\leq t<t_{k+1}$$, $$k=1,2,\ldots r$$, $$\hat{S}(t)=1$$ for $$t<t_{(1)}$$,
+and $$t_{(r+1)}=\infty$$. Note that the value of $\hat{S}(t)$ is constant
+between times of event and therefore the estimate is a step function
+with jumps at observed event times. If there are no censored data this
+estimator would simply reduce to the empirical survivor function defined
+as $\frac{n_j}{n}$. Thus, the Kaplan-Meier estimate can be seen as the
+generalization of the empirical survivor function that handles censored
+observations.
+
+The methodology used in our `KM.dml` script closely
+follows Section 2 of [[Collett2003]](algorithms-bibliography.html). For completeness we briefly
+discuss the equations used in our implementation.
+
+**Standard error of the survivor function.** The standard error of the
+estimated survivor function (controlled by parameter `etype`)
+can be calculated as
+
+$$\text{se} \{\hat{S}(t)\} \approx \hat{S}(t) {\bigg\{ \sum_{j=1}^{k} \frac{d_j}{n_j(n_j -   d_j)}\biggr\}}^2$$
+
+for $$t_{(k)}\leq t<t_{(k+1)}$$. This equation is known as the
+*Greenwood’s* formula. An alternative approach is to apply
+the *Petos’s* expression
+
+$$\text{se}\{\hat{S}(t)\}=\frac{\hat{S}(t)\sqrt{1-\hat{S}(t)}}{\sqrt{n_k}}$$
+
+for $$t_{(k)}\leq t<t_{(k+1)}$$. Once the standard error of $\hat{S}$ has
+been found we compute the following types of confidence intervals
+(controlled by parameter `cctype`): The `plain`
+$100(1-\alpha)\%$ confidence interval for $S(t)$ is computed using
+
+$$\hat{S}(t)\pm z_{\alpha/2} \text{se}\{\hat{S}(t)\}$$
+
+where
+$z_{\alpha/2}$ is the upper $\alpha/2$-point of the standard normal
+distribution. Alternatively, we can apply the `log` transformation using
+
+$$\hat{S}(t)^{\exp[\pm z_{\alpha/2} \text{se}\{\hat{S}(t)\}/\hat{S}(t)]}$$
+
+or the `log-log` transformation using
+
+$$\hat{S}(t)^{\exp [\pm z_{\alpha/2} \text{se} \{\log [-\log \hat{S}(t)]\}]}$$
+
+**Median, its standard error and confidence interval.** Denote by
+$$\hat{t}(50)$$ the estimated median of $$\hat{S}$$, i.e.,
+$$\hat{t}(50)=\min \{ t_i \mid \hat{S}(t_i) < 0.5\}$$, where $$t_i$$ is the
+observed survival time for individual $$i$$. The standard error of
+$$\hat{t}(50)$$ is given by
+
+$$\text{se}\{ \hat{t}(50) \} = \frac{1}{\hat{f}\{\hat{t}(50)\}} \text{se}[\hat{S}\{ \hat{t}(50) \}]$$
+
+where $$\hat{f}\{ \hat{t}(50) \}$$ can be found from
+
+$$\hat{f}\{ \hat{t}(50) \} = \frac{\hat{S}\{ \hat{u}(50) \} -\hat{S}\{ \hat{l}(50) \} }{\hat{l}(50) - \hat{u}(50)}$$
+
+Above, $\hat{u}(50)$ is the largest survival time for which $\hat{S}$
+exceeds $0.5+\epsilon$, i.e.,
+$$\hat{u}(50)=\max \bigl\{ t_{(j)} \mid \hat{S}(t_{(j)}) \geq 0.5+\epsilon \bigr\}$$,
+and $\hat{l}(50)$ is the smallest survivor time for which $\hat{S}$ is
+less than $0.5-\epsilon$, i.e.,
+$$\hat{l}(50)=\min \bigl\{ t_{(j)} \mid \hat{S}(t_{(j)}) \leq 0.5+\epsilon \bigr\}$$,
+for small $\epsilon$.
+
+**Log-rank test and Wilcoxon test.** Our implementation supports
+comparison of survival data from several groups using two non-parametric
+procedures (controlled by parameter `ttype`): the
+*log-rank test* and the *Wilcoxon test* (also
+known as the *Breslow test*). Assume that the survival
+times in $g\geq 2$ groups of survival data are to be compared. Consider
+the *null hypothesis* that there is no difference in the
+survival times of the individuals in different groups. One way to
+examine the null hypothesis is to consider the difference between the
+observed number of deaths with the numbers expected under the null
+hypothesis. In both tests we define the $U$-statistics ($$U_{L}$$ for the
+log-rank test and $$U_{W}$$ for the Wilcoxon test) to compare the observed
+and the expected number of deaths in $1,2,\ldots,g-1$ groups as follows:
+
+$$\begin{aligned}
+U_{Lk} &= \sum_{j=1}^{r}\left( d_{kj} - \frac{n_{kj}d_j}{n_j} \right) \\
+U_{Wk} &= \sum_{j=1}^{r}n_j\left( d_{kj} - \frac{n_{kj}d_j}{n_j} \right)\end{aligned}$$
+
+where $$d_{kj}$$ is the of number deaths at time $$t_{(j)}$$ in group $k$,
+$$n_{kj}$$ is the number of individuals at risk at time $$t_{(j)}$$ in group
+$k$, and $k=1,2,\ldots,g-1$ to form the vectors $U_L$ and $U_W$ with
+$(g-1)$ components. The covariance (variance) between $$U_{Lk}$$ and
+$$U_{Lk'}$$ (when $k=k'$) is computed as
+
+$$V_{Lkk'}=\sum_{j=1}^{r} \frac{n_{kj}d_j(n_j-d_j)}{n_j(n_j-1)} \left( \delta_{kk'}-\frac{n_{k'j}}{n_j} \right)$$
+
+for $k,k'=1,2,\ldots,g-1$, with
+
+$$\delta_{kk'} = 
+\begin{cases}
+1 & \text{if } k=k'\\
+0 & \text{otherwise}
+\end{cases}$$
+
+These terms are combined in a
+*variance-covariance* matrix $V_L$ (referred to as the
+$V$-statistic). Similarly, the variance-covariance matrix for the
+Wilcoxon test $V_W$ is a matrix where the entry at position $(k,k')$ is
+given by
+
+$$V_{Wkk'}=\sum_{j=1}^{r} n_j^2 \frac{n_{kj}d_j(n_j-d_j)}{n_j(n_j-1)} \left( \delta_{kk'}-\frac{n_{k'j}}{n_j} \right)$$
+
+Under the null hypothesis of no group differences, the test statistics
+$U_L^\top V_L^{-1} U_L$ for the log-rank test and
+$U_W^\top V_W^{-1} U_W$ for the Wilcoxon test have a Chi-squared
+distribution on $(g-1)$ degrees of freedom. Our `KM.dml`
+script also provides a stratified version of the log-rank or Wilcoxon
+test if requested. In this case, the values of the $U$- and $V$-
+statistics are computed for each stratum and then combined over all
+strata.
+
+### Returns
+
+
+Below we list the results of the survival analysis computed by
+`KM.dml`. The calculated statistics are stored in matrix $KM$
+with the following schema:
+
+  * Column 1: timestamps
+  * Column 2: number of individuals at risk
+  * Column 3: number of events
+  * Column 4: Kaplan-Meier estimate of the survivor function $\hat{S}$
+  * Column 5: standard error of $\hat{S}$
+  * Column 6: lower bound of $100(1-\alpha)\%$ confidence interval for
+    $\hat{S}$
+  * Column 7: upper bound of $100(1-\alpha)\%$ confidence interval for
+    $\hat{S}$
+
+Note that if survival data for multiple groups and/or strata is
+available, each collection of 7 columns in $KM$ stores the results per
+group and/or per stratum. In this case $KM$ has $7g+7s$ columns, where
+$g\geq 1$ and $s\geq 1$ denote the number of groups and strata,
+respectively.
+
+Additionally, `KM.dml` stores the following statistics in the
+1-row matrix $M$ whose number of columns depends on the number of groups
+($g$) and strata ($s$) in the data. Below $k$ denotes the number of
+factors used for grouping and $l$ denotes the number of factors used for
+stratifying.
+
+  * Columns 1 to $k$: unique combination of values in the $k$ factors
+    used for grouping
+  * Columns $k+1$ to $k+l$: unique combination of values in the $l$
+    factors used for stratifying
+  * Column $k+l+1$: total number of records
+  * Column $k+l+2$: total number of events
+  * Column $k+l+3$: median of $\hat{S}$
+  * Column $k+l+4$: lower bound of $100(1-\alpha)\%$ confidence interval
+    for the median of $\hat{S}$
+  * Column $k+l+5$: upper bound of $100(1-\alpha)\%$ confidence interval
+    for the median of $\hat{S}$.
+
+If there is only 1 group and 1 stratum available $M$ will be a 1-row
+matrix with 5 columns where
+
+  * Column 1: total number of records
+  * Column 2: total number of events
+  * Column 3: median of $\hat{S}$
+  * Column 4: lower bound of $100(1-\alpha)\%$ confidence interval for
+    the median of $\hat{S}$
+  * Column 5: upper bound of $100(1-\alpha)\%$ confidence interval for
+    the median of $\hat{S}$.
+
+If a comparison of the survival data across multiple groups needs to be
+performed, `KM.dml` computes two matrices $T$ and
+$$T\_GROUPS\_OE$$ that contain a summary of the test. The 1-row matrix $T$
+stores the following statistics:
+
+  * Column 1: number of groups in the survival data
+  * Column 2: degree of freedom for Chi-squared distributed test
+    statistic
+  * Column 3: value of test statistic
+  * Column 4: $P$-value.
+
+Matrix $$T\_GROUPS\_OE$$ contains the following statistics for each of $g$
+groups:
+
+  * Column 1: number of events
+  * Column 2: number of observed death times ($O$)
+  * Column 3: number of expected death times ($E$)
+  * Column 4: $(O-E)^2/E$
+  * Column 5: $(O-E)^2/V$.
+
+
+* * *
+
+## 6.2. Cox Proportional Hazard Regression Model
+
+### Description
+
+The Cox (proportional hazard or PH) is a semi-parametric statistical
+approach commonly used for analyzing survival data. Unlike
+non-parametric approaches, e.g., the [Kaplan-Meier estimates](algorithms-survival-analysis.html#kaplan-meier-survival-analysis),
+which can be used to analyze single sample of
+survival data or to compare between groups of survival times, the Cox PH
+models the dependency of the survival times on the values of
+*explanatory variables* (i.e., covariates) recorded for
+each individual at the time origin. Our focus is on covariates that do
+not change value over time, i.e., time-independent covariates, and that
+may be categorical (ordinal or nominal) as well as continuous-valued.
+
+
+### Usage
+
+**Cox**:
+
+    hadoop jar SystemML.jar -f Cox.dml
+                            -nvargs X=<file>
+                                    TE=<file>
+                                    F=<file>
+                                    R=[file]
+                                    M=<file>
+                                    S=[file]
+                                    T=[file]
+                                    COV=<file>
+                                    RT=<file>
+                                    XO=<file>
+                                    MF=<file>
+                                    alpha=[double]
+                                    tol=[double]
+                                    moi=[int]
+                                    mii=[int]
+                                    fmt=[format]
+
+**Cox Prediction**:
+
+    hadoop jar SystemML.jar -f Cox-predict.dml
+                            -nvargs X=<file>
+                                    RT=<file>
+                                    M=<file>
+                                    Y=<file>
+                                    COV=<file>
+                                    MF=<file>
+                                    P=<file>
+                                    fmt=[format]
+
+### Arguments - Cox Model Fitting/Prediction
+
+**X**: Location (on HDFS) to read the input matrix of the survival data
+containing:
+
+  * timestamps
+  * whether event occurred (1) or data is censored (0)
+  * feature vectors
+
+**Y**: Location (on HDFS) to the read matrix used for prediction
+
+**TE**: Location (on HDFS) to read the 1-column matrix $TE$ that contains the
+column indices of the input matrix $X$ corresponding to timestamps
+(first entry) and event information (second entry)
+
+**F**: Location (on HDFS) to read the 1-column matrix $F$ that contains the
+column indices of the input matrix $X$ corresponding to the features to
+be used for fitting the Cox model
+
+**R**: (default: `" "`) If factors (i.e., categorical features) are available in the
+input matrix $X$, location (on HDFS) to read matrix $R$ containing the
+start (first column) and end (second column) indices of each factor in
+$X$; alternatively, user can specify the indices of the baseline level
+of each factor which needs to be removed from $X$. If $R$ is not
+provided by default all variables are considered to be
+continuous-valued.
+
+**M**: Location (on HDFS) to store the results of Cox regression analysis
+including regression coefficients $\beta_j$s, their standard errors,
+confidence intervals, and $P$-values
+
+**S**: (default: `" "`) Location (on HDFS) to store a summary of some statistics of
+the fitted model including number of records, number of events,
+log-likelihood, AIC, Rsquare (Cox & Snell), and maximum possible Rsquare
+
+**T**: (default: `" "`) Location (on HDFS) to store the results of Likelihood ratio
+test, Wald test, and Score (log-rank) test of the fitted model
+
+**COV**: Location (on HDFS) to store the variance-covariance matrix of
+$\beta_j$s; note that parameter `COV` needs to be provided as
+input to prediction.
+
+**RT**: Location (on HDFS) to store matrix $RT$ containing the order-preserving
+recoded timestamps from $X$; note that parameter `RT` needs
+to be provided as input for prediction.
+
+**XO**: Location (on HDFS) to store the input matrix $X$ ordered by the
+timestamps; note that parameter `XO` needs to be provided as
+input for prediction.
+
+**MF**: Location (on HDFS) to store column indices of $X$ excluding the baseline
+factors if available; note that parameter `MF` needs to be
+provided as input for prediction.
+
+**P**: Location (on HDFS) to store matrix $P$ containing the results of
+prediction
+
+**alpha**: (default: `0.05`) Parameter to compute a $100(1-\alpha)\%$ confidence interval
+for $\beta_j$s
+
+**tol**: (default: `0.000001`) Tolerance ($\epsilon$) used in the convergence criterion
+
+**moi**: (default: `100`) Maximum number of outer (Fisher scoring) iterations
+
+**mii**: (default: `0`) Maximum number of inner (conjugate gradient) iterations, or 0
+if no maximum limit provided
+
+**fmt**: (default: `"text"`) Matrix file output format, such as `text`,
+`mm`, or `csv`; see read/write functions in
+SystemML Language Reference for details.
+
+
+### Examples
+
+**Cox**:
+
+    hadoop jar SystemML.jar -f Cox.dml
+                            -nvargs X=/user/ml/X.mtx
+                                    TE=/user/ml/TE
+                                    F=/user/ml/F
+                                    R=/user/ml/R
+                                    M=/user/ml/model.csv
+                                    T=/user/ml/test.csv
+                                    COV=/user/ml/var-covar.csv
+                                    XO=/user/ml/X-sorted.mtx
+                                    fmt=csv
+
+    hadoop jar SystemML.jar -f Cox.dml
+                            -nvargs X=/user/ml/X.mtx
+                                    TE=/user/ml/TE
+                                    F=/user/ml/F
+                                    R=/user/ml/R
+                                    M=/user/ml/model.csv
+                                    T=/user/ml/test.csv
+                                    COV=/user/ml/var-covar.csv
+                                    RT=/user/ml/recoded-timestamps.csv
+                                    XO=/user/ml/X-sorted.csv
+                                    MF=/user/ml/baseline.csv
+                                    alpha=0.01
+                                    tol=0.000001
+                                    moi=100
+                                    mii=20
+                                    fmt=csv
+
+**Cox Prediction**:
+
+    hadoop jar SystemML.jar -f Cox-predict.dml
+                            -nvargs X=/user/ml/X-sorted.mtx
+                                    RT=/user/ml/recoded-timestamps.csv
+                                    M=/user/ml/model.csv
+                                    Y=/user/ml/Y.mtx
+                                    COV=/user/ml/var-covar.csv
+                                    MF=/user/ml/baseline.csv
+                                    P=/user/ml/predictions.csv
+                                    fmt=csv
+
+
+### Details
+
+In the Cox PH regression model, the relationship between the hazard
+function — i.e., the probability of event occurrence at a given time — and
+the covariates is described as
+
+$$
+\begin{equation}
+h_i(t)=h_0(t)\exp\Bigl\{ \sum_{j=1}^{p} \beta_jx_{ij} \Bigr\}
+\end{equation}
+$$
+
+where the hazard function for the $i$th individual
+($$i\in\{1,2,\ldots,n\}$$) depends on a set of $p$ covariates
+$$x_i=(x_{i1},x_{i2},\ldots,x_{ip})$$, whose importance is measured by the
+magnitude of the corresponding coefficients
+$$\beta=(\beta_1,\beta_2,\ldots,\beta_p)$$. The term $$h_0(t)$$ is the
+baseline hazard and is related to a hazard value if all covariates equal
+0. In the Cox PH model the hazard function for the individuals may vary
+over time, however the baseline hazard is estimated non-parametrically
+and can take any form. Note that re-writing (1) we have
+
+$$\log\biggl\{ \frac{h_i(t)}{h_0(t)} \biggr\} = \sum_{j=1}^{p} \beta_jx_{ij}$$
+
+Thus, the Cox PH model is essentially a linear model for the logarithm
+of the hazard ratio and the hazard of event for any individual is a
+constant multiple of the hazard of any other. We follow similar notation
+and methodology as in Section 3 of 
+[[Collett2003]](algorithms-bibliography.html). For
+completeness we briefly discuss the equations used in our
+implementation.
+
+**Factors in the model.** Note that if some of the feature variables are
+factors they need to *dummy code* as follows. Let $\alpha$
+be such a variable (i.e., a factor) with $a$ levels. We introduce $a-1$
+indicator (or dummy coded) variables $X_2,X_3\ldots,X_a$ with $X_j=1$ if
+$\alpha=j$ and 0 otherwise, for $$j\in\{ 2,3,\ldots,a\}$$. In particular,
+one of $a$ levels of $\alpha$ will be considered as the baseline and is
+not included in the model. In our implementation, user can specify a
+baseline level for each of the factor (as selecting the baseline level
+for each factor is arbitrary). On the other hand, if for a given factor
+$\alpha$ no baseline is specified by the user, the most frequent level
+of $\alpha$ will be considered as the baseline.
+
+**Fitting the model.** We estimate the coefficients of the Cox model via
+negative log-likelihood method. In particular the Cox PH model is fitted
+by using trust region Newton method with conjugate
+gradient [[Nocedal2006]](algorithms-bibliography.html). Define the risk set $R(t_j)$ at time
+$t_j$ to be the set of individuals who die at time $t_i$ or later. The
+PH model assumes that survival times are distinct. In order to handle
+tied observations we use the *Breslow* approximation of the likelihood
+function
+
+$$\mathcal{L}=\prod_{j=1}^{r} \frac{\exp(\beta^\top s_j)}{\biggl\{ \sum_{l\in R(t_j)} \exp(\beta^\top x_l) \biggr\}^{d_j} }$$
+
+where $d_j$ is number individuals who die at time $t_j$ and $s_j$
+denotes the element-wise sum of the covariates for those individuals who
+die at time $t_j$, $j=1,2,\ldots,r$, i.e., the $h$th element of $s_j$ is
+given by $$s_{hj}=\sum_{k=1}^{d_j}x_{hjk}$$, where $x_{hjk}$ is the value
+of $h$th variable ($$h\in \{1,2,\ldots,p\}$$) for the $k$th of the $d_j$
+individuals ($$k\in\{ 1,2,\ldots,d_j \}$$) who die at the $j$th death time
+($$j\in\{ 1,2,\ldots,r \}$$).
+
+**Standard error and confidence interval for coefficients.** Note that
+the variance-covariance matrix of the estimated coefficients
+$\hat{\beta}$ can be approximated by the inverse of the Hessian
+evaluated at $\hat{\beta}$. The square root of the diagonal elements of
+this matrix are the standard errors of estimated coefficients. Once the
+standard errors of the coefficients $se(\hat{\beta})$ is obtained we can
+compute a $100(1-\alpha)\%$ confidence interval using
+$$\hat{\beta}\pm z_{\alpha/2}se(\hat{\beta})$$, where $z_{\alpha/2}$ is
+the upper $\alpha/2$-point of the standard normal distribution. In
+`Cox.dml`, we utilize the built-in function
+`inv()` to compute the inverse of the Hessian. Note that this
+build-in function can be used only if the Hessian fits in the main
+memory of a single machine.
+
+**Wald test, likelihood ratio test, and log-rank test.** In order to
+test the *null hypothesis* that all of the coefficients
+$\beta_j$s are 0, our implementation provides three statistical test:
+*Wald test*, *likelihood ratio test*, the
+*log-rank test* (also known as the *score
+test*). Let $p$ be the number of coefficients. The Wald test is
+based on the test statistic ${\hat{\beta}}^2/{se(\hat{\beta})}^2$, which
+is compared to percentage points of the Chi-squared distribution to
+obtain the $P$-value. The likelihood ratio test relies on the test
+statistic $$-2\log\{ {L}(\textbf{0})/{L}(\hat{\beta}) \}$$ ($\textbf{0}$
+denotes a zero vector of size $p$ ) which has an approximate Chi-squared
+distribution with $p$ degrees of freedom under the null hypothesis that
+all $\beta_j$s are 0. The Log-rank test is based on the test statistic
+$l=\nabla^\top L(\textbf{0}) {\mathcal{H}}^{-1}(\textbf{0}) \nabla L(\textbf{0})$,
+where $\nabla L(\textbf{0})$ is the gradient of $L$ and
+$\mathcal{H}(\textbf{0})$ is the Hessian of $L$ evaluated at **0**.
+Under the null hypothesis that $\beta=\textbf{0}$, $l$ has a Chi-squared
+distribution on $p$ degrees of freedom.
+
+**Prediction.** Once the parameters of the model are fitted, we compute
+the following predictions together with their standard errors
+
+  * linear predictors
+  * risk
+  * estimated cumulative hazard
+
+Given feature vector $$X_i$$ for individual $$i$$, we obtain the above
+predictions at time $$t$$ as follows. The linear predictors (denoted as
+$$\mathcal{LP}$$) as well as the risk (denoted as $\mathcal{R}$) are
+computed relative to a baseline whose feature values are the mean of the
+values in the corresponding features. Let $$X_i^\text{rel} = X_i - \mu$$,
+where $$\mu$$ is a row vector that contains the mean values for each
+feature. We have $$\mathcal{LP}=X_i^\text{rel} \hat{\beta}$$ and
+$$\mathcal{R}=\exp\{ X_i^\text{rel}\hat{\beta} \}$$. The standard errors
+of the linear predictors $$se\{\mathcal{LP} \}$$ are computed as the
+square root of $${(X_i^\text{rel})}^\top V(\hat{\beta}) X_i^\text{rel}$$
+and the standard error of the risk $$se\{ \mathcal{R} \}$$ are given by
+the square root of
+$${(X_i^\text{rel} \odot \mathcal{R})}^\top V(\hat{\beta}) (X_i^\text{rel} \odot \mathcal{R})$$,
+where $$V(\hat{\beta})$$ is the variance-covariance matrix of the
+coefficients and $$\odot$$ is the element-wise multiplication.
+
+We estimate the cumulative hazard function for individual $i$ by
+
+$$\hat{H}_i(t) = \exp(\hat{\beta}^\top X_i) \hat{H}_0(t)$$
+
+where
+$$\hat{H}_0(t)$$ is the *Breslow estimate* of the cumulative baseline
+hazard given by
+
+$$\hat{H}_0(t) = \sum_{j=1}^{k} \frac{d_j}{\sum_{l\in R(t_{(j)})} \exp(\hat{\beta}^\top X_l)}$$
+
+In the equation above, as before, $d_j$ is the number of deaths, and
+$$R(t_{(j)})$$ is the risk set at time $$t_{(j)}$$, for
+$$t_{(k)} \leq t \leq t_{(k+1)}$$, $$k=1,2,\ldots,r-1$$. The standard error
+of $$\hat{H}_i(t)$$ is obtained using the estimation
+
+$$se\{ \hat{H}_i(t) \} = \sum_{j=1}^{k} \frac{d_j}{ {\left[ \sum_{l\in R(t_{(j)})} \exp(X_l\hat{\beta}) \right]}^2 } + J_i^\top(t) V(\hat{\beta}) J_i(t)\$$
+
+where
+
+$$J_i(t) = \sum_{j-1}^{k} d_j \frac{\sum_{l\in R(t_{(j)})} (X_l-X_i)\exp \{ (X_l-X_i)\hat{\beta} \}}{ {\left[ \sum_{l\in R(t_{(j)})} \exp\{(X_l-X_i)\hat{\beta}\} \right]}^2  }$$
+
+for $$t_{(k)} \leq t \leq t_{(k+1)}$, $k=1,2,\ldots,r-1$$.
+
+
+### Returns
+
+Below we list the results of fitting a Cox regression model stored in
+matrix $M$ with the following schema:
+
+  * Column 1: estimated regression coefficients $\hat{\beta}$
+  * Column 2: $\exp(\hat{\beta})$
+  * Column 3: standard error of the estimated coefficients
+    $se\{\hat{\beta}\}$
+  * Column 4: ratio of $\hat{\beta}$ to $se\{\hat{\beta}\}$ denoted by
+    $Z$
+  * Column 5: $P$-value of $Z$
+  * Column 6: lower bound of $100(1-\alpha)\%$ confidence interval for
+    $\hat{\beta}$
+  * Column 7: upper bound of $100(1-\alpha)\%$ confidence interval for
+    $\hat{\beta}$.
+
+Note that above $Z$ is the Wald test statistic which is asymptotically
+standard normal under the hypothesis that $\beta=\textbf{0}$.
+
+Moreover, `Cox.dml` outputs two log files `S` and
+`T` containing a summary statistics of the fitted model as
+follows. File `S` stores the following information
+
+  * Line 1: total number of observations
+  * Line 2: total number of events
+  * Line 3: log-likelihood (of the fitted model)
+  * Line 4: AIC
+  * Line 5: Cox & Snell Rsquare
+  * Line 6: maximum possible Rsquare.
+
+Above, the AIC is computed as in [[Stepwise Linear Regression]](algorithms-regression.html#stepwise-linear-regression), the Cox & Snell Rsquare
+is equal to $$1-\exp\{ -l/n \}$$, where $l$ is the log-rank test statistic
+as discussed above and $n$ is total number of observations, and the
+maximum possible Rsquare computed as $$1-\exp\{ -2 L(\textbf{0})/n \}$$,
+where $L(\textbf{0})$ denotes the initial likelihood.
+
+File `T` contains the following information
+
+  * Line 1: Likelihood ratio test statistic, degree of freedom of the
+    corresponding Chi-squared distribution, $P$-value
+  * Line 2: Wald test statistic, degree of freedom of the corresponding
+    Chi-squared distribution, $P$-value
+  * Line 3: Score (log-rank) test statistic, degree of freedom of the
+    corresponding Chi-squared distribution, $P$-value.
+
+Additionally, the following matrices will be stored. Note that these
+matrices are required for prediction.
+
+  * Order-preserving recoded timestamps $RT$, i.e., contiguously
+    numbered from 1 $\ldots$ \#timestamps
+  * Feature matrix ordered by the timestamps $XO$
+  * Variance-covariance matrix of the coefficients $COV$
+  * Column indices of the feature matrix with baseline factors removed
+    (if available) $MF$.
+
+**Prediction.** Finally, the results of prediction is stored in Matrix
+$P$ with the following schema
+
+  * Column 1: linear predictors
+  * Column 2: standard error of the linear predictors
+  * Column 3: risk
+  * Column 4: standard error of the risk
+  * Column 5: estimated cumulative hazard
+  * Column 6: standard error of the estimated cumulative hazard.
+
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/2a2520bd/img/algorithms-reference-figure-1-computation-of-quartiles-median-and-interquartile-mean.png
----------------------------------------------------------------------
diff --git a/img/algorithms-reference-figure-1-computation-of-quartiles-median-and-interquartile-mean.png b/img/algorithms-reference-figure-1-computation-of-quartiles-median-and-interquartile-mean.png
deleted file mode 100644
index e53fbb7..0000000
Binary files a/img/algorithms-reference-figure-1-computation-of-quartiles-median-and-interquartile-mean.png and /dev/null differ

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/2a2520bd/img/algorithms-reference/computation-of-quartiles-median-and-interquartile-mean.png
----------------------------------------------------------------------
diff --git a/img/algorithms-reference/computation-of-quartiles-median-and-interquartile-mean.png b/img/algorithms-reference/computation-of-quartiles-median-and-interquartile-mean.png
new file mode 100644
index 0000000..e53fbb7
Binary files /dev/null and b/img/algorithms-reference/computation-of-quartiles-median-and-interquartile-mean.png differ

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/2a2520bd/img/algorithms-reference/example-tree.png
----------------------------------------------------------------------
diff --git a/img/algorithms-reference/example-tree.png b/img/algorithms-reference/example-tree.png
new file mode 100644
index 0000000..17e4d93
Binary files /dev/null and b/img/algorithms-reference/example-tree.png differ