You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemml.apache.org by du...@apache.org on 2015/12/02 02:04:40 UTC

[02/47] incubator-systemml git commit: Add Algorithms and Language References

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a891c0a1/Algorithms Reference/Kmeans.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/Kmeans.tex b/Algorithms Reference/Kmeans.tex
new file mode 100644
index 0000000..0b8cbcd
--- /dev/null
+++ b/Algorithms Reference/Kmeans.tex	
@@ -0,0 +1,350 @@
+\subsection{K-Means Clustering}
+
+\noindent{\bf Description}
+\smallskip
+
+Given a collection of $n$ records with a pairwise similarity measure,
+the goal of clustering is to assign a category label to each record so that
+similar records tend to get the same label.  In contrast to multinomial
+logistic regression, clustering is an \emph{unsupervised}\/ learning problem
+with neither category assignments nor label interpretations given in advance.
+In $k$-means clustering, the records $x_1, x_2, \ldots, x_n$ are numerical
+feature vectors of $\dim x_i = m$ with the squared Euclidean distance 
+$\|x_i - x_{i'}\|_2^2$ as the similarity measure.  We want to partition
+$\{x_1, \ldots, x_n\}$ into $k$ clusters $\{S_1, \ldots, S_k\}$ so that
+the aggregated squared distance from records to their cluster means is
+minimized:
+\begin{equation}
+\textrm{WCSS}\,\,=\,\, \sum_{i=1}^n \,\big\|x_i - \mean(S_j: x_i\in S_j)\big\|_2^2 \,\,\to\,\,\min
+\label{eqn:WCSS}
+\end{equation}
+The aggregated distance measure in~(\ref{eqn:WCSS}) is called the
+\emph{within-cluster sum of squares}~(WCSS).  It can be viewed as a measure
+of residual variance that remains in the data after the clustering assignment,
+conceptually similar to the residual sum of squares~(RSS) in linear regression.
+However, unlike for the RSS, the minimization of~(\ref{eqn:WCSS}) is an NP-hard 
+problem~\cite{AloiseDHP2009:kmeans}.
+
+Rather than searching for the global optimum in~(\ref{eqn:WCSS}), a heuristic algorithm
+called Lloyd's algorithm is typically used.  This iterative algorithm maintains
+and updates a set of $k$~\emph{centroids} $\{c_1, \ldots, c_k\}$, one centroid per cluster.
+It defines each cluster $S_j$ as the set of all records closer to~$c_j$ than
+to any other centroid.  Each iteration of the algorithm reduces the WCSS in two steps:
+\begin{Enumerate}
+\item Assign each record to the closest centroid, making $\mean(S_j)\neq c_j$;
+\label{step:kmeans:recluster}
+\item Reset each centroid to its cluster's mean: $c_j := \mean(S_j)$.
+\label{step:kmeans:recenter}
+\end{Enumerate}
+After Step~\ref{step:kmeans:recluster} the centroids are generally different from the cluster
+means, so we can compute another ``within-cluster sum of squares'' based on the centroids:
+\begin{equation}
+\textrm{WCSS\_C}\,\,=\,\, \sum_{i=1}^n \,\big\|x_i - \mathop{\textrm{centroid}}(S_j: x_i\in S_j)\big\|_2^2
+\label{eqn:WCSS:C}
+\end{equation}
+This WCSS\_C after Step~\ref{step:kmeans:recluster} is less than the means-based WCSS
+before Step~\ref{step:kmeans:recluster} (or equal if convergence achieved), and in
+Step~\ref{step:kmeans:recenter} the WCSS cannot exceed the WCSS\_C for \emph{the same}
+clustering; hence the WCSS reduction.
+
+Exact convergence is reached when each record becomes closer to its
+cluster's mean than to any other cluster's mean, so there are no more re-assignments
+and the centroids coincide with the means.  In practice, iterations may be stopped
+when the reduction in WCSS (or in WCSS\_C) falls below a minimum threshold, or upon
+reaching the maximum number of iterations.  The initialization of the centroids is also
+an important part of the algorithm.  The smallest WCSS obtained by the algorithm is not
+the global minimum and varies depending on the initial centroids.  We implement multiple
+parallel runs with different initial centroids and report the best result.
+
+\Paragraph{Scoring} 
+Our scoring script evaluates the clustering output by comparing it with a known category
+assignment.  Since cluster labels have no prior correspondence to the categories, we
+cannot count ``correct'' and ``wrong'' cluster assignments.  Instead, we quantify them in
+two ways:
+\begin{Enumerate}
+\item Count how many same-category and different-category pairs of records end up in the
+same cluster or in different clusters;
+\item For each category, count the prevalence of its most common cluster; for each
+cluster, count the prevalence of its most common category.
+\end{Enumerate}
+The number of categories and the number of clusters ($k$) do not have to be equal.  
+A same-category pair of records clustered into the same cluster is viewed as a
+``true positive,'' a different-category pair clustered together is a ``false positive,''
+a same-category pair clustered apart is a ``false negative''~etc.
+
+
+\smallskip
+\noindent{\bf Usage: K-means Script}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}Kmeans.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} C=}path/file
+{\tt{} k=}int
+{\tt{} runs=}int
+{\tt{} maxi=}int
+{\tt{} tol=}double
+{\tt{} samp=}int
+{\tt{} isY=}int
+{\tt{} Y=}path/file
+{\tt{} fmt=}format
+{\tt{} verb=}int
+
+}
+
+\smallskip
+\noindent{\bf Usage: K-means Scoring/Prediction}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}Kmeans-predict.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} C=}path/file
+{\tt{} spY=}path/file
+{\tt{} prY=}path/file
+{\tt{} fmt=}format
+{\tt{} O=}path/file
+
+}
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location to read matrix $X$ with the input data records as rows
+\item[{\tt C}:] (default:\mbox{ }{\tt "C.mtx"})
+Location to store the output matrix with the best available cluster centroids as rows
+\item[{\tt k}:]
+Number of clusters (and centroids)
+\item[{\tt runs}:] (default:\mbox{ }{\tt 10})
+Number of parallel runs, each run with different initial centroids
+\item[{\tt maxi}:] (default:\mbox{ }{\tt 1000})
+Maximum number of iterations per run
+\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
+Tolerance (epsilon) for single-iteration WCSS\_C change ratio
+\item[{\tt samp}:] (default:\mbox{ }{\tt 50})
+Average number of records per centroid in data samples used in the centroid
+initialization procedure
+\item[{\tt Y}:] (default:\mbox{ }{\tt "Y.mtx"})
+Location to store the one-column matrix $Y$ with the best available mapping of
+records to clusters (defined by the output centroids)
+\item[{\tt isY}:] (default:\mbox{ }{\tt 0})
+{\tt 0} = do not write matrix~$Y$,  {\tt 1} = write~$Y$
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\item[{\tt verb}:] (default:\mbox{ }{\tt 0})
+{\tt 0} = do not print per-iteration statistics for each run, {\tt 1} = print them
+(the ``verbose'' option)
+\end{Description}
+\smallskip
+\noindent{\bf Arguments --- Scoring/Prediction}
+\begin{Description}
+\item[{\tt X}:] (default:\mbox{ }{\tt " "})
+Location to read matrix $X$ with the input data records as rows,
+optional when {\tt prY} input is provided
+\item[{\tt C}:] (default:\mbox{ }{\tt " "})
+Location to read matrix $C$ with cluster centroids as rows, optional
+when {\tt prY} input is provided; NOTE: if both {\tt X} and {\tt C} are
+provided, {\tt prY} is an output, not input
+\item[{\tt spY}:] (default:\mbox{ }{\tt " "})
+Location to read a one-column matrix with the externally specified ``true''
+assignment of records (rows) to categories, optional for prediction without
+scoring
+\item[{\tt prY}:] (default:\mbox{ }{\tt " "})
+Location to read (or write, if {\tt X} and {\tt C} are present) a
+column-vector with the predicted assignment of rows to clusters;
+NOTE: No prior correspondence is assumed between the predicted
+cluster labels and the externally specified categories
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format for {\tt prY}, such as {\tt text}, {\tt mm},
+or {\tt csv}; see read/write functions in SystemML Language Reference
+for details
+\item[{\tt O}:] (default:\mbox{ }{\tt " "})
+Location to write the output statistics defined in 
+Table~\ref{table:kmeans:predict:stats}, by default print them to the
+standard output
+\end{Description}
+
+
+\begin{table}[t]\small\centerline{%
+\begin{tabular}{|lcl|}
+\hline
+Name & CID & Meaning \\
+\hline
+{\tt TSS}             &     & Total Sum of Squares (from the total mean) \\
+{\tt WCSS\_M}         &     & Within-Cluster  Sum of Squares (means as centers) \\
+{\tt WCSS\_M\_PC}     &     & Within-Cluster  Sum of Squares (means), in \% of TSS \\
+{\tt BCSS\_M}         &     & Between-Cluster Sum of Squares (means as centers) \\
+{\tt BCSS\_M\_PC}     &     & Between-Cluster Sum of Squares (means), in \% of TSS \\
+\hline
+{\tt WCSS\_C}         &     & Within-Cluster  Sum of Squares (centroids as centers) \\
+{\tt WCSS\_C\_PC}     &     & Within-Cluster  Sum of Squares (centroids), \% of TSS \\
+{\tt BCSS\_C}         &     & Between-Cluster Sum of Squares (centroids as centers) \\
+{\tt BCSS\_C\_PC}     &     & Between-Cluster Sum of Squares (centroids), \% of TSS \\
+\hline
+{\tt TRUE\_SAME\_CT}  &     & Same-category pairs predicted as Same-cluster, count \\
+{\tt TRUE\_SAME\_PC}  &     & Same-category pairs predicted as Same-cluster, \% \\
+{\tt TRUE\_DIFF\_CT}  &     & Diff-category pairs predicted as Diff-cluster, count \\
+{\tt TRUE\_DIFF\_PC}  &     & Diff-category pairs predicted as Diff-cluster, \% \\
+{\tt FALSE\_SAME\_CT} &     & Diff-category pairs predicted as Same-cluster, count \\
+{\tt FALSE\_SAME\_PC} &     & Diff-category pairs predicted as Same-cluster, \% \\
+{\tt FALSE\_DIFF\_CT} &     & Same-category pairs predicted as Diff-cluster, count \\
+{\tt FALSE\_DIFF\_PC} &     & Same-category pairs predicted as Diff-cluster, \% \\
+\hline
+{\tt SPEC\_TO\_PRED}  & $+$ & For specified category, the best predicted cluster id \\
+{\tt SPEC\_FULL\_CT}  & $+$ & For specified category, its full count \\
+{\tt SPEC\_MATCH\_CT} & $+$ & For specified category, best-cluster matching count \\
+{\tt SPEC\_MATCH\_PC} & $+$ & For specified category, \% of matching to full count \\
+{\tt PRED\_TO\_SPEC}  & $+$ & For predicted cluster, the best specified category id \\
+{\tt PRED\_FULL\_CT}  & $+$ & For predicted cluster, its full count \\
+{\tt PRED\_MATCH\_CT} & $+$ & For predicted cluster, best-category matching count \\
+{\tt PRED\_MATCH\_PC} & $+$ & For predicted cluster, \% of matching to full count \\
+\hline
+\end{tabular}}
+\caption{The {\tt O}-file for {\tt Kmeans-predict} provides the output statistics
+in CSV format, one per line, in the following format: (NAME, [CID], VALUE).  Note:
+the 1st group statistics are given if {\tt X} input is available;
+the 2nd group statistics are given if {\tt X} and {\tt C} inputs are available;
+the 3rd and 4th group statistics are given if {\tt spY} input is available;
+only the 4th group statistics contain a nonempty CID value;
+when present, CID contains either the specified category label or the
+predicted cluster label.}
+\label{table:kmeans:predict:stats}
+\end{table}
+
+
+\noindent{\bf Details}
+\smallskip
+
+Our clustering script proceeds in 3~stages: centroid initialization,
+parallel $k$-means iterations, and the best-available output generation.
+Centroids are initialized at random from the input records (the rows of~$X$),
+biased towards being chosen far apart from each other.  The initialization
+method is based on the {\tt k-means++} heuristic from~\cite{ArthurVassilvitskii2007:kmeans},
+with one important difference: to reduce the number of passes through~$X$,
+we take a small sample of $X$ and run the {\tt k-means++} heuristic over
+this sample.  Here is, conceptually, our centroid initialization algorithm
+for one clustering run:
+\begin{Enumerate}
+\item Sample the rows of~$X$ uniformly at random, picking each row with probability
+$p = ks / n$ where
+\begin{Itemize}
+\item $k$~is the number of centroids, 
+\item $n$~is the number of records, and
+\item $s$~is the {\tt samp} input parameter.
+\end{Itemize}
+If $ks \geq n$, the entire $X$ is used in place of its sample.
+\item Choose the first centroid uniformly at random from the sampled rows.
+\item Choose each subsequent centroid from the sampled rows, at random, with
+probability proportional to the squared Euclidean distance between the row and
+the nearest already-chosen centroid.
+\end{Enumerate}
+The sampling of $X$ and the selection of centroids are performed independently
+and in parallel for each run of the $k$-means algorithm.  When we sample the
+rows of~$X$, rather than tossing a random coin for each row, we compute the
+number of rows to skip until the next sampled row as $\lceil \log(u) / \log(1 - p) \rceil$
+where $u\in (0, 1)$ is uniformly random.  This time-saving trick works because
+\begin{equation*}
+\Prob [k-1 < \log_{1-p}(u) < k] \,\,=\,\, p(1-p)^{k-1} \,\,=\,\,
+\Prob [\textrm{skip $k-1$ rows}]
+\end{equation*}
+However, it requires us to estimate the maximum sample size, which we set
+near~$ks + 10\sqrt{ks}$ to make it generous enough.
+
+Once we selected the initial centroid sets, we start the $k$-means iterations
+independently in parallel for all clustering runs.  The number of clustering runs
+is given as the {\tt runs} input parameter.  Each iteration of each clustering run
+performs the following steps:
+\begin{Itemize}
+\item Compute the centroid-dependent part of squared Euclidean distances from
+all records (rows of~$X$) to each of the $k$~centroids using matrix product;
+\item Take the minimum of the above for each record;
+\item Update the current within-cluster sum of squares (WCSS) value, with centroids
+substituted instead of the means for efficiency;
+\item Check the convergence criterion:\hfil
+$\textrm{WCSS}_{\mathrm{old}} - \textrm{WCSS}_{\mathrm{new}} < \eps\cdot\textrm{WCSS}_{\mathrm{new}}$\linebreak
+as well as the number of iterations limit;
+\item Find the closest centroid for each record, sharing equally any records with multiple
+closest centroids;
+\item Compute the number of records closest to each centroid, checking for ``runaway''
+centroids with no records left (in which case the run fails);
+\item Compute the new centroids by averaging the records in their clusters.
+\end{Itemize}
+When a termination condition is satisfied, we store the centroids and the WCSS value
+and exit this run.  A run has to satisfy the WCSS convergence criterion to be considered
+successful.  Upon the termination of all runs, we select the smallest WCSS value among
+the successful runs, and write out this run's centroids.  If requested, we also compute
+the cluster assignment of all records in~$X$, using integers from 1 to~$k$ as the cluster
+labels.  The scoring script can then be used to compare the cluster assignment with
+an externally specified category assignment.
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+We output the $k$ centroids for the best available clustering, i.~e.\ whose WCSS
+is the smallest of all successful runs.
+The centroids are written as the rows of the $k\,{\times}\,m$-matrix into the output
+file whose path/name was provided as the ``{\tt C}'' input argument.  If the input
+parameter ``{\tt isY}'' was set to~{\tt 1}, we also output the one-column matrix with
+the cluster assignment for all the records.  This assignment is written into the
+file whose path/name was provided as the ``{\tt Y}'' input argument.
+The best WCSS value, as well as some information about the performance of the other
+runs, is printed during the script execution.  The scoring script {\tt Kmeans-predict}
+prints all its results in a self-explanatory manner, as defined in
+Table~\ref{table:kmeans:predict:stats}.
+
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f Kmeans.dml -nvargs X=/user/biadmin/X.mtx k=5 C=/user/biadmin/centroids.mtx fmt=csv
+
+}
+
+{\hangindent=\parindent\noindent\tt
+\hml -f Kmeans.dml -nvargs X=/user/biadmin/X.mtx k=5 runs=100 maxi=5000 
+tol=0.00000001 samp=20 C=/user/biadmin/centroids.mtx isY=1 Y=/user/biadmin/Yout.mtx verb=1
+
+}
+\noindent To predict {\tt Y} given {\tt X} and {\tt C}:
+
+{\hangindent=\parindent\noindent\tt
+\hml -f Kmeans-predict.dml -nvargs X=/user/biadmin/X.mtx
+         C=/user/biadmin/C.mtx prY=/user/biadmin/PredY.mtx O=/user/biadmin/stats.csv
+
+}
+\noindent To compare ``actual'' labels {\tt spY} with ``predicted'' labels given {\tt X} and {\tt C}:
+
+{\hangindent=\parindent\noindent\tt
+\hml -f Kmeans-predict.dml -nvargs X=/user/biadmin/X.mtx
+         C=/user/biadmin/C.mtx spY=/user/biadmin/Y.mtx O=/user/biadmin/stats.csv
+
+}
+\noindent To compare ``actual'' labels {\tt spY} with given ``predicted'' labels {\tt prY}:
+
+{\hangindent=\parindent\noindent\tt
+\hml -f Kmeans-predict.dml -nvargs spY=/user/biadmin/Y.mtx prY=/user/biadmin/PredY.mtx O=/user/biadmin/stats.csv
+
+}
+
+\smallskip
+\noindent{\bf References}
+\begin{itemize}
+\item
+D.~Aloise, A.~Deshpande, P.~Hansen, and P.~Popat.
+\newblock {NP}-hardness of {E}uclidean sum-of-squares clustering.
+\newblock {\em Machine Learning}, 75(2):245--248, May 2009.
+\item
+D.~Arthur and S.~Vassilvitskii.
+\newblock {\tt k-means++}: The advantages of careful seeding.
+\newblock In {\em Proceedings of the 18th Annual {ACM-SIAM} Symposium on
+  Discrete Algorithms ({SODA}~2007)}, pages 1027--1035, New Orleans~{LA},
+  {USA}, January 7--9 2007.
+\end{itemize}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a891c0a1/Algorithms Reference/LinReg.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/LinReg.tex b/Algorithms Reference/LinReg.tex
new file mode 100644
index 0000000..ad0dab8
--- /dev/null
+++ b/Algorithms Reference/LinReg.tex	
@@ -0,0 +1,306 @@
+\subsection{Linear Regression}
+
+\noindent{\bf Description}
+\smallskip
+
+Linear Regression scripts are used to model the relationship between one numerical
+response variable and one or more explanatory (feature) variables.
+The scripts are given a dataset $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a
+numerical vector of feature variables and $y_i$ is a numerical response value for
+each training data record.  The feature vectors are provided as a matrix $X$ of size
+$n\,{\times}\,m$, where $n$ is the number of records and $m$ is the number of features.
+The observed response values are provided as a 1-column matrix~$Y$, with a numerical
+value $y_i$ for each~$x_i$ in the corresponding row of matrix~$X$.
+
+In linear regression, we predict the distribution of the response~$y_i$ based on
+a fixed linear combination of the features in~$x_i$.  We assume that
+there exist constant regression coefficients $\beta_0, \beta_1, \ldots, \beta_m$
+and a constant residual variance~$\sigma^2$ such that
+\begin{equation}
+y_i \sim \Normal(\mu_i, \sigma^2) \,\,\,\,\textrm{where}\,\,\,\,
+\mu_i \,=\, \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m}
+\label{eqn:linregdef}
+\end{equation}
+Distribution $y_i \sim \Normal(\mu_i, \sigma^2)$ models the ``unexplained'' residual
+noise and is assumed independent across different records.
+
+The goal is to estimate the regression coefficients and the residual variance.
+Once they are accurately estimated, we can make predictions about $y_i$ given~$x_i$
+in new records.  We can also use the $\beta_j$'s to analyze the influence of individual
+features on the response value, and assess the quality of this model by comparing
+residual variance in the response, left after prediction, with its total variance.
+
+There are two scripts in our library, both doing the same estimation, but using different
+computational methods.  Depending on the size and the sparsity of the feature matrix~$X$,
+one or the other script may be more efficient.  The ``direct solve'' script
+{\tt LinearRegDS} is more efficient when the number of features $m$ is relatively small
+($m \sim 1000$ or less) and matrix~$X$ is either tall or fairly dense
+(has~${\gg}\:m^2$ nonzeros); otherwise, the ``conjugate gradient'' script {\tt LinearRegCG}
+is more efficient.  If $m > 50000$, use only {\tt LinearRegCG}.
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}LinearRegDS.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} Y=}path/file
+{\tt{} B=}path/file
+{\tt{} O=}path/file
+{\tt{} icpt=}int
+{\tt{} reg=}double
+{\tt{} fmt=}format
+
+}\smallskip
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}LinearRegCG.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} Y=}path/file
+{\tt{} B=}path/file
+{\tt{} O=}path/file
+{\tt{} Log=}path/file
+{\tt{} icpt=}int
+{\tt{} reg=}double
+{\tt{} tol=}double
+{\tt{} maxi=}int
+{\tt{} fmt=}format
+
+}
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the matrix of feature vectors, each row constitutes
+one feature vector
+\item[{\tt Y}:]
+Location to read the 1-column matrix of response values
+\item[{\tt B}:]
+Location to store the estimated regression parameters (the $\beta_j$'s), with the
+intercept parameter~$\beta_0$ at position {\tt B[}$m\,{+}\,1$, {\tt 1]} if available
+\item[{\tt O}:] (default:\mbox{ }{\tt " "})
+Location to store the CSV-file of summary statistics defined in
+Table~\ref{table:linreg:stats}, the default is to print it to the standard output
+\item[{\tt Log}:] (default:\mbox{ }{\tt " "}, {\tt LinearRegCG} only)
+Location to store iteration-specific variables for monitoring and debugging purposes,
+see Table~\ref{table:linreg:log} for details.
+\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
+Intercept presence and shifting/rescaling the features in~$X$:\\
+{\tt 0} = no intercept (hence no~$\beta_0$), no shifting or rescaling of the features;\\
+{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
+{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
+\item[{\tt reg}:] (default:\mbox{ }{\tt 0.000001})
+L2-regularization parameter~\mbox{$\lambda\geq 0$}; set to nonzero for highly dependent,
+sparse, or numerous ($m \gtrsim n/10$) features
+\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001}, {\tt LinearRegCG} only)
+Tolerance \mbox{$\eps\geq 0$} used in the convergence criterion: we terminate conjugate
+gradient iterations when the $\beta$-residual reduces in L2-norm by this factor
+\item[{\tt maxi}:] (default:\mbox{ }{\tt 0}, {\tt LinearRegCG} only)
+Maximum number of conjugate gradient iterations, or~0 if no maximum
+limit provided
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+\begin{table}[t]\small\centerline{%
+\begin{tabular}{|ll|}
+\hline
+Name & Meaning \\
+\hline
+{\tt AVG\_TOT\_Y}          & Average of the response value $Y$ \\
+{\tt STDEV\_TOT\_Y}        & Standard Deviation of the response value $Y$ \\
+{\tt AVG\_RES\_Y}          & Average of the residual $Y - \mathop{\mathrm{pred}}(Y|X)$, i.e.\ residual bias \\
+{\tt STDEV\_RES\_Y}        & Standard Deviation of the residual $Y - \mathop{\mathrm{pred}}(Y|X)$ \\
+{\tt DISPERSION}           & GLM-style dispersion, i.e.\ residual sum of squares / \#deg.\ fr. \\
+{\tt PLAIN\_R2}            & Plain $R^2$ of residual with bias included vs.\ total average \\
+{\tt ADJUSTED\_R2}         & Adjusted $R^2$ of residual with bias included vs.\ total average \\
+{\tt PLAIN\_R2\_NOBIAS}    & Plain $R^2$ of residual with bias subtracted vs.\ total average \\
+{\tt ADJUSTED\_R2\_NOBIAS} & Adjusted $R^2$ of residual with bias subtracted vs.\ total average \\
+{\tt PLAIN\_R2\_VS\_0}     & ${}^*$Plain $R^2$ of residual with bias included vs.\ zero constant \\
+{\tt ADJUSTED\_R2\_VS\_0}  & ${}^*$Adjusted $R^2$ of residual with bias included vs.\ zero constant \\
+\hline
+\multicolumn{2}{r}{${}^{*\mathstrut}$ The last two statistics are only printed if there is no intercept ({\tt icpt=0})} \\
+\end{tabular}}
+\caption{Besides~$\beta$, linear regression scripts compute a few summary statistics
+listed above.  The statistics are provided in CSV format, one comma-separated name-value
+pair per each line.}
+\label{table:linreg:stats}
+\end{table}
+
+\begin{table}[t]\small\centerline{%
+\begin{tabular}{|ll|}
+\hline
+Name & Meaning \\
+\hline
+{\tt CG\_RESIDUAL\_NORM}  & L2-norm of conjug.\ grad.\ residual, which is $A \pxp \beta - t(X) \pxp y$ \\
+                          & where $A = t(X) \pxp X + \diag (\lambda)$, or a similar quantity \\
+{\tt CG\_RESIDUAL\_RATIO} & Ratio of current L2-norm of conjug.\ grad.\ residual over the initial \\
+\hline
+\end{tabular}}
+\caption{
+The {\tt Log} file for {\tt{}LinearRegCG} script contains the above \mbox{per-}iteration
+variables in CSV format, each line containing triple (Name, Iteration\#, Value) with
+Iteration\# being~0 for initial values.}
+\label{table:linreg:log}
+\end{table}
+
+
+\noindent{\bf Details}
+\smallskip
+
+To solve a linear regression problem over feature matrix~$X$ and response vector~$Y$,
+we can find coefficients $\beta_0, \beta_1, \ldots, \beta_m$ and $\sigma^2$ that maximize
+the joint likelihood of all $y_i$ for $i=1\ldots n$, defined by the assumed statistical
+model~(\ref{eqn:linregdef}).  Since the joint likelihood of the independent
+$y_i \sim \Normal(\mu_i, \sigma^2)$ is proportional to the product of
+$\exp\big({-}\,(y_i - \mu_i)^2 / (2\sigma^2)\big)$, we can take the logarithm of this
+product, then multiply by $-2\sigma^2 < 0$ to obtain a least squares problem:
+\begin{equation}
+\sum_{i=1}^n \, (y_i - \mu_i)^2 \,\,=\,\, 
+\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
+\,\,\to\,\,\min
+\label{eqn:linregls}
+\end{equation}
+This may not be enough, however.  The minimum may sometimes be attained over infinitely many
+$\beta$-vectors, for example if $X$ has an all-0 column, or has linearly dependent columns,
+or has fewer rows than columns~\mbox{($n < m$)}.  Even if~(\ref{eqn:linregls}) has a unique
+solution, other $\beta$-vectors may be just a little suboptimal\footnote{Smaller likelihood
+difference between two models suggests less statistical evidence to pick one model over the
+other.}, yet give significantly different predictions for new feature vectors.  This results
+in \emph{overfitting}: prediction error for the training data ($X$ and~$Y$) is much smaller
+than for the test data (new records).
+
+Overfitting and degeneracy in the data is commonly mitigated by adding a regularization penalty
+term to the least squares function:
+\begin{equation}
+\sum_{i=1}^n \Big(y_i - \beta_0 - \sum_{j=1}^m \beta_j x_{i,j}\Big)^2
+\,+\,\, \lambda \sum_{j=1}^m \beta_j^2
+\,\,\to\,\,\min
+\label{eqn:linreglsreg}
+\end{equation}
+The choice of $\lambda>0$, the regularization constant, typically involves cross-validation
+where the dataset is repeatedly split into a training part (to estimate the~$\beta_j$'s) and
+a test part (to evaluate prediction accuracy), with the goal of maximizing the test accuracy.
+In our scripts, $\lambda$~is provided as input parameter~{\tt reg}.
+
+The solution to least squares problem~(\ref{eqn:linreglsreg}), through taking the derivative
+and setting it to~0, has the matrix linear equation form
+\begin{equation}
+A\left[\textstyle\beta_{1:m}\atop\textstyle\beta_0\right] \,=\, \big[X,\,1\big]^T Y,\,\,\,
+\textrm{where}\,\,\,
+A \,=\, \big[X,\,1\big]^T \big[X,\,1\big]\,+\,\hspace{0.5pt} \diag(\hspace{0.5pt}
+\underbrace{\raisebox{0pt}[0pt][0.5pt]{$\lambda,\ldots, \lambda$}}_{\raisebox{2pt}{$\scriptstyle m$}}
+\hspace{0.5pt}, 0)
+\label{eqn:linregeq}
+\end{equation}
+where $[X,\,1]$ is $X$~with an extra column of~1s appended on the right, and the
+diagonal matrix of $\lambda$'s has a zero to keep the intercept~$\beta_0$ unregularized.
+If the intercept is disabled by setting {\tt icpt=0}, the equation is simply
+\mbox{$X^T X \beta = X^T Y$}.
+
+We implemented two scripts for solving equation~(\ref{eqn:linregeq}): one is a ``direct solver''
+that computes $A$ and then solves $A\beta = [X,\,1]^T Y$ by calling an external package,
+the other performs linear conjugate gradient~(CG) iterations without ever materializing~$A$.
+The CG~algorithm closely follows Algorithm~5.2 in Chapter~5 of~\cite{Nocedal2006:Optimization}.
+Each step in the CG~algorithm computes a matrix-vector multiplication $q = Ap$ by first computing
+$[X,\,1]\, p$ and then $[X,\,1]^T [X,\,1]\, p$.  Usually the number of such multiplications,
+one per CG iteration, is much smaller than~$m$.  The user can put a hard bound on it with input 
+parameter~{\tt maxi}, or use the default maximum of~$m+1$ (or~$m$ if no intercept) by
+having {\tt maxi=0}.  The CG~iterations terminate when the L2-norm of vector
+$r = A\beta - [X,\,1]^T Y$ decreases from its initial value (for~$\beta=0$) by the tolerance
+factor specified in input parameter~{\tt tol}.
+
+The CG algorithm is more efficient if computing
+$[X,\,1]^T \big([X,\,1]\, p\big)$ is much faster than materializing $A$,
+an $(m\,{+}\,1)\times(m\,{+}\,1)$ matrix.  The Direct Solver~(DS) is more efficient if
+$X$ takes up a lot more memory than $A$ (i.e.\ $X$~has a lot more nonzeros than~$m^2$)
+and if $m^2$ is small enough for the external solver ($m \lesssim 50000$).  A more precise
+determination between CG and~DS is subject to further research.
+
+In addition to the $\beta$-vector, the scripts estimate the residual standard
+deviation~$\sigma$ and the~$R^2$, the ratio of ``explained'' variance to the total
+variance of the response variable.  These statistics only make sense if the number
+of degrees of freedom $n\,{-}\,m\,{-}\,1$ is positive and the regularization constant
+$\lambda$ is negligible or zero.  The formulas for $\sigma$ and $R^2$~are:
+\begin{equation*}
+R^2_{\textrm{plain}} = 1 - \frac{\mathrm{RSS}}{\mathrm{TSS}},\quad
+\sigma \,=\, \sqrt{\frac{\mathrm{RSS}}{n - m - 1}},\quad
+R^2_{\textrm{adj.}} = 1 - \frac{\sigma^2 (n-1)}{\mathrm{TSS}}
+\end{equation*}
+where
+\begin{equation*}
+\mathrm{RSS} \,=\, \sum_{i=1}^n \Big(y_i - \hat{\mu}_i - 
+\frac{1}{n} \sum_{i'=1}^n \,(y_{i'} - \hat{\mu}_{i'})\Big)^2; \quad
+\mathrm{TSS} \,=\, \sum_{i=1}^n \Big(y_i - \frac{1}{n} \sum_{i'=1}^n y_{i'}\Big)^2
+\end{equation*}
+Here $\hat{\mu}_i$ are the predicted means for $y_i$ based on the estimated
+regression coefficients and the feature vectors.  They may be biased when no
+intercept is present, hence the RSS formula subtracts the bias.
+
+Lastly, note that by choosing the input option {\tt icpt=2} the user can shift
+and rescale the columns of~$X$ to have zero average and the variance of~1.
+This is particularly important when using regularization over highly disbalanced
+features, because regularization tends to penalize small-variance columns (which
+need large~$\beta_j$'s) more than large-variance columns (with small~$\beta_j$'s).
+At the end, the estimated regression coefficients are shifted and rescaled to
+apply to the original features.
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+The estimated regression coefficients (the $\hat{\beta}_j$'s) are populated into
+a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
+input argument.  What this matrix contains, and its size, depends on the input
+argument {\tt icpt}, which specifies the user's intercept and rescaling choice:
+\begin{Description}
+\item[{\tt icpt=0}:] No intercept, matrix~$B$ has size $m\,{\times}\,1$, with
+$B[j, 1] = \hat{\beta}_j$ for each $j$ from 1 to~$m = {}$ncol$(X)$.
+\item[{\tt icpt=1}:] There is intercept, but no shifting/rescaling of~$X$; matrix~$B$
+has size $(m\,{+}\,1) \times 1$, with $B[j, 1] = \hat{\beta}_j$ for $j$ from 1 to~$m$,
+and $B[m\,{+}\,1, 1] = \hat{\beta}_0$, the estimated intercept coefficient.
+\item[{\tt icpt=2}:] There is intercept, and the features in~$X$ are shifted to
+mean${} = 0$ and rescaled to variance${} = 1$; then there are two versions of
+the~$\hat{\beta}_j$'s, one for the original features and another for the
+shifted/rescaled features.  Now matrix~$B$ has size $(m\,{+}\,1) \times 2$, with
+$B[\cdot, 1]$ for the original features and $B[\cdot, 2]$ for the shifted/rescaled
+features, in the above format.  Note that $B[\cdot, 2]$ are iteratively estimated
+and $B[\cdot, 1]$ are obtained from $B[\cdot, 2]$ by complementary shifting and
+rescaling.
+\end{Description}
+The estimated summary statistics, including residual standard deviation~$\sigma$ and
+the~$R^2$, are printed out or sent into a file (if specified) in CSV format as
+defined in Table~\ref{table:linreg:stats}.  For conjugate gradient iterations,
+a log file with monitoring variables can also be made available, see
+Table~\ref{table:linreg:log}.
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f LinearRegCG.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
+  B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv
+  icpt=2 reg=1.0 tol=0.00000001 maxi=100 Log=/user/biadmin/log.csv
+
+}
+{\hangindent=\parindent\noindent\tt
+\hml -f LinearRegDS.dml -nvargs X=/user/biadmin/X.mtx Y=/user/biadmin/Y.mtx
+  B=/user/biadmin/B.mtx fmt=csv O=/user/biadmin/stats.csv
+  icpt=2 reg=1.0
+
+}
+
+% \smallskip
+% \noindent{\bf See Also}
+% \smallskip
+% 
+% In case of binary classification problems, please consider using L2-SVM or
+% binary logistic regression; for multiclass classification, use multiclass~SVM
+% or multinomial logistic regression.  For more complex distributions of the
+% response variable use the Generalized Linear Models script.

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a891c0a1/Algorithms Reference/LogReg.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/LogReg.tex b/Algorithms Reference/LogReg.tex
new file mode 100644
index 0000000..6693a01
--- /dev/null
+++ b/Algorithms Reference/LogReg.tex	
@@ -0,0 +1,266 @@
+\subsection{Multinomial Logistic Regression}
+
+\noindent{\bf Description}
+\smallskip
+
+Our logistic regression script performs both binomial and multinomial logistic regression.
+The script is given a dataset $(X, Y)$ where matrix $X$ has $m$~columns and matrix $Y$ has
+one column; both $X$ and~$Y$ have $n$~rows.  The rows of $X$ and~$Y$ are viewed as a collection
+of records: $(X, Y) = (x_i, y_i)_{i=1}^n$ where $x_i$ is a numerical vector of explanatory
+(feature) variables and $y_i$ is a categorical response variable.
+Each row~$x_i$ in~$X$ has size~\mbox{$\dim x_i = m$}, while its corresponding $y_i$ is an
+integer that represents the observed response value for record~$i$.
+
+The goal of logistic regression is to learn a linear model over the feature vector
+$x_i$ that can be used to predict how likely each categorical label is expected to
+be observed as the actual~$y_i$.
+Note that logistic regression predicts more than a label: it predicts the probability
+for every possible label.  The binomial case allows only two possible labels, the
+multinomial case has no such restriction.
+
+Just as linear regression estimates the mean value $\mu_i$ of a numerical response
+variable, logistic regression does the same for category label probabilities.
+In linear regression, the mean of $y_i$ is estimated as a linear combination of the features:
+$\mu_i = \beta_0 + \beta_1 x_{i,1} + \ldots + \beta_m x_{i,m} = \beta_0 + x_i\beta_{1:m}$.
+In logistic regression, the
+label probability has to lie between 0 and~1, so a link function is applied to connect
+it to $\beta_0 + x_i\beta_{1:m}$.  If there are just two possible category labels, for example
+0~and~1, the logistic link looks as follows:
+\begin{equation*}
+\Prob[y_i\,{=}\,1\mid x_i; \beta] \,=\, 
+\frac{e^{\,\beta_0 + x_i\beta_{1:m}}}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}};
+\quad
+\Prob[y_i\,{=}\,0\mid x_i; \beta] \,=\, 
+\frac{1}{1 + e^{\,\beta_0 + x_i\beta_{1:m}}}
+\end{equation*}
+Here category label~0 serves as the \emph{baseline}, and function
+$\exp(\beta_0 + x_i\beta_{1:m})$
+shows how likely we expect to see ``$y_i = 1$'' in comparison to the baseline.
+Like in a loaded coin, the predicted odds of seeing 1~versus~0 are
+$\exp(\beta_0 + x_i\beta_{1:m})$ to~1,
+with each feature $x_{i,j}$ multiplying its own factor $\exp(\beta_j x_{i,j})$ to the odds.
+Given a large collection of pairs $(x_i, y_i)$, $i=1\ldots n$, logistic regression seeks
+to find the $\beta_j$'s that maximize the product of probabilities
+\hbox{$\Prob[y_i\mid x_i; \beta]$}
+for actually observed $y_i$-labels (assuming no regularization).
+
+Multinomial logistic regression~\cite{Agresti2002:CDA} extends this link to $k \geq 3$ possible
+categories.  Again we identify one category as the baseline, for example the $k$-th category.
+Instead of a coin, here we have a loaded multisided die, one side per category.  Each non-baseline
+category $l = 1\ldots k\,{-}\,1$ has its own vector $(\beta_{0,l}, \beta_{1,l}, \ldots, \beta_{m,l})$
+of regression parameters with the intercept, making up a matrix $B$ of size
+$(m\,{+}\,1)\times(k\,{-}\,1)$.  The predicted odds of seeing non-baseline category~$l$ versus
+the baseline~$k$ are $\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)$
+to~1, and the predicted probabilities are:
+\begin{align}
+l < k:\quad\Prob[y_i\,{=}\,\makebox[0.5em][c]{$l$}\mid x_i; B] \,\,\,{=}\,\,\,&
+\frac{\exp\big(\beta_{0,l} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l}\big)}%
+{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)};
+\label{eqn:mlogreg:nonbaseprob}\\
+\Prob[y_i\,{=}\,\makebox[0.5em][c]{$k$}\mid x_i; B] \,\,\,{=}\,\,\,& \frac{1}%
+{1 \,+\, \sum_{l'=1}^{k-1}\exp\big(\beta_{0,l'} + \sum\nolimits_{j=1}^m x_{i,j}\beta_{j,l'}\big)}.
+\label{eqn:mlogreg:baseprob}
+\end{align}
+The goal of the regression is to estimate the parameter matrix~$B$ from the provided dataset
+$(X, Y) = (x_i, y_i)_{i=1}^n$ by maximizing the product of \hbox{$\Prob[y_i\mid x_i; B]$}
+over the observed labels~$y_i$.  Taking its logarithm, negating, and adding a regularization term
+gives us a minimization objective:
+\begin{equation}
+f(B; X, Y) \,\,=\,\,
+-\sum_{i=1}^n \,\log \Prob[y_i\mid x_i; B] \,+\,
+\frac{\lambda}{2} \sum_{j=1}^m \sum_{l=1}^{k-1} |\beta_{j,l}|^2
+\,\,\to\,\,\min
+\label{eqn:mlogreg:loss}
+\end{equation}
+The optional regularization term is added to mitigate overfitting and degeneracy in the data;
+to reduce bias, the intercepts $\beta_{0,l}$ are not regularized.  Once the~$\beta_{j,l}$'s
+are accurately estimated, we can make predictions about the category label~$y$ for a new
+feature vector~$x$ using Eqs.~(\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob}).
+
+\smallskip
+\noindent{\bf Usage}
+\smallskip
+
+{\hangindent=\parindent\noindent\it%
+{\tt{}-f }path/\/{\tt{}MultiLogReg.dml}
+{\tt{} -nvargs}
+{\tt{} X=}path/file
+{\tt{} Y=}path/file
+{\tt{} B=}path/file
+{\tt{} Log=}path/file
+{\tt{} icpt=}int
+{\tt{} reg=}double
+{\tt{} tol=}double
+{\tt{} moi=}int
+{\tt{} mii=}int
+{\tt{} fmt=}format
+
+}
+
+
+\smallskip
+\noindent{\bf Arguments}
+\begin{Description}
+\item[{\tt X}:]
+Location (on HDFS) to read the input matrix of feature vectors; each row constitutes
+one feature vector.
+\item[{\tt Y}:]
+Location to read the input one-column matrix of category labels that correspond to
+feature vectors in~{\tt X}.  Note the following:\\
+-- Each non-baseline category label must be a positive integer.\\
+-- If all labels are positive, the largest represents the baseline category.\\
+-- If non-positive labels such as $-1$ or~$0$ are present, then they represent the (same)
+baseline category and are converted to label $\max(\texttt{Y})\,{+}\,1$.
+\item[{\tt B}:]
+Location to store the matrix of estimated regression parameters (the $\beta_{j, l}$'s),
+with the intercept parameters~$\beta_{0, l}$ at position {\tt B[}$m\,{+}\,1$, $l${\tt ]}
+if available.  The size of {\tt B} is $(m\,{+}\,1)\times (k\,{-}\,1)$ with the intercepts
+or $m \times (k\,{-}\,1)$ without the intercepts, one column per non-baseline category
+and one row per feature.
+\item[{\tt Log}:] (default:\mbox{ }{\tt " "})
+Location to store iteration-specific variables for monitoring and debugging purposes,
+see Table~\ref{table:mlogreg:log} for details.
+\item[{\tt icpt}:] (default:\mbox{ }{\tt 0})
+Intercept and shifting/rescaling of the features in~$X$:\\
+{\tt 0} = no intercept (hence no~$\beta_0$), no shifting/rescaling of the features;\\
+{\tt 1} = add intercept, but do not shift/rescale the features in~$X$;\\
+{\tt 2} = add intercept, shift/rescale the features in~$X$ to mean~0, variance~1
+\item[{\tt reg}:] (default:\mbox{ }{\tt 0.0})
+L2-regularization parameter (lambda)
+\item[{\tt tol}:] (default:\mbox{ }{\tt 0.000001})
+Tolerance (epsilon) used in the convergence criterion
+\item[{\tt moi}:] (default:\mbox{ }{\tt 100})
+Maximum number of outer (Fisher scoring) iterations
+\item[{\tt mii}:] (default:\mbox{ }{\tt 0})
+Maximum number of inner (conjugate gradient) iterations, or~0 if no maximum
+limit provided
+\item[{\tt fmt}:] (default:\mbox{ }{\tt "text"})
+Matrix file output format, such as {\tt text}, {\tt mm}, or {\tt csv};
+see read/write functions in SystemML Language Reference for details.
+\end{Description}
+
+
+\begin{table}[t]\small\centerline{%
+\begin{tabular}{|ll|}
+\hline
+Name & Meaning \\
+\hline
+{\tt LINEAR\_TERM\_MIN}  & The minimum value of $X \pxp B$, used to check for overflows \\
+{\tt LINEAR\_TERM\_MAX}  & The maximum value of $X \pxp B$, used to check for overflows \\
+{\tt NUM\_CG\_ITERS}     & Number of inner (Conj.\ Gradient) iterations in this outer iteration \\
+{\tt IS\_TRUST\_REACHED} & $1 = {}$trust region boundary was reached, $0 = {}$otherwise \\
+{\tt POINT\_STEP\_NORM}  & L2-norm of iteration step from old point (matrix $B$) to new point \\
+{\tt OBJECTIVE}          & The loss function we minimize (negative regularized log-likelihood) \\
+{\tt OBJ\_DROP\_REAL}    & Reduction in the objective during this iteration, actual value \\
+{\tt OBJ\_DROP\_PRED}    & Reduction in the objective predicted by a quadratic approximation \\
+{\tt OBJ\_DROP\_RATIO}   & Actual-to-predicted reduction ratio, used to update the trust region \\
+{\tt IS\_POINT\_UPDATED} & $1 = {}$new point accepted; $0 = {}$new point rejected, old point restored \\
+{\tt GRADIENT\_NORM}     & L2-norm of the loss function gradient (omitted if point is rejected) \\
+{\tt TRUST\_DELTA}       & Updated trust region size, the ``delta'' \\
+\hline
+\end{tabular}}
+\caption{
+The {\tt Log} file for multinomial logistic regression contains the above \mbox{per-}iteration
+variables in CSV format, each line containing triple (Name, Iteration\#, Value) with Iteration\#
+being~0 for initial values.}
+\label{table:mlogreg:log}
+\end{table}
+
+
+\noindent{\bf Details}
+\smallskip
+
+We estimate the logistic regression parameters via L2-regularized negative
+log-likelihood minimization~(\ref{eqn:mlogreg:loss}).
+The optimization method used in the script closely follows the trust region
+Newton method for logistic regression described in~\cite{Lin2008:logistic}.
+For convenience, let us make some changes in notation:
+\begin{Itemize}
+\item Convert the input vector of observed category labels into an indicator matrix $Y$
+of size $n \times k$ such that $Y_{i, l} = 1$ if the $i$-th category label is~$l$ and
+$Y_{i, l} = 0$ otherwise;
+\item Append an extra column of all ones, i.e.\ $(1, 1, \ldots, 1)^T$, as the
+$m\,{+}\,1$-st column to the feature matrix $X$ to represent the intercept;
+\item Append an all-zero column as the $k$-th column to $B$, the matrix of regression
+parameters, to represent the baseline category;
+\item Convert the regularization constant $\lambda$ into matrix $\Lambda$ of the same
+size as $B$, placing 0's into the $m\,{+}\,1$-st row to disable intercept regularization,
+and placing $\lambda$'s everywhere else.
+\end{Itemize}
+Now the ($n\,{\times}\,k$)-matrix of predicted probabilities given
+by (\ref{eqn:mlogreg:nonbaseprob}) and~(\ref{eqn:mlogreg:baseprob})
+and the objective function $f$ in~(\ref{eqn:mlogreg:loss}) have the matrix form
+\begin{align*}
+P \,\,&=\,\, \exp(XB) \,\,/\,\, \big(\exp(XB)\,1_{k\times k}\big)\\
+f \,\,&=\,\, - \,\,{\textstyle\sum} \,\,Y \cdot (X B)\, + \,
+{\textstyle\sum}\,\log\big(\exp(XB)\,1_{k\times 1}\big) \,+ \,
+(1/2)\,\, {\textstyle\sum} \,\,\Lambda \cdot B \cdot B
+\end{align*}
+where operations $\cdot\,$, $/$, $\exp$, and $\log$ are applied cellwise,
+and $\textstyle\sum$ denotes the sum of all cells in a matrix.
+The gradient of~$f$ with respect to~$B$ can be represented as a matrix too:
+\begin{equation*}
+\nabla f \,\,=\,\, X^T (P - Y) \,+\, \Lambda \cdot B
+\end{equation*}
+The Hessian $\mathcal{H}$ of~$f$ is a tensor, but, fortunately, the conjugate
+gradient inner loop of the trust region algorithm in~\cite{Lin2008:logistic}
+does not need to instantiate it.  We only need to multiply $\mathcal{H}$ by
+ordinary matrices of the same size as $B$ and $\nabla f$, and this can be done
+in matrix form:
+\begin{equation*}
+\mathcal{H}V \,\,=\,\, X^T \big( Q \,-\, P \cdot (Q\,1_{k\times k}) \big) \,+\,
+\Lambda \cdot V, \,\,\,\,\textrm{where}\,\,\,\,Q \,=\, P \cdot (XV)
+\end{equation*}
+At each Newton iteration (the \emph{outer} iteration) the minimization algorithm
+approximates the difference $\varDelta f(S; B) = f(B + S; X, Y) \,-\, f(B; X, Y)$
+attained in the objective function after a step $B \mapsto B\,{+}\,S$ by a
+second-degree formula
+\begin{equation*}
+\varDelta f(S; B) \,\,\,\approx\,\,\, (1/2)\,\,{\textstyle\sum}\,\,S \cdot \mathcal{H}S
+ \,+\, {\textstyle\sum}\,\,S\cdot \nabla f
+\end{equation*}
+This approximation is then minimized by trust-region conjugate gradient iterations
+(the \emph{inner} iterations) subject to the constraint $\|S\|_2 \leq \delta$.
+The trust region size $\delta$ is initialized as $0.5\sqrt{m}\,/ \max\nolimits_i \|x_i\|_2$
+and updated as described in~\cite{Lin2008:logistic}.
+Users can specify the maximum number of the outer and the inner iterations with
+input parameters {\tt moi} and {\tt mii}, respectively.  The iterative minimizer
+terminates successfully if $\|\nabla f\|_2 < \eps\,\|\nabla f_{B=0}\|_2$,
+where $\eps > 0$ is a tolerance supplied by the user via input parameter~{\tt tol}.
+
+\smallskip
+\noindent{\bf Returns}
+\smallskip
+
+The estimated regression parameters (the $\hat{\beta}_{j, l}$) are populated into
+a matrix and written to an HDFS file whose path/name was provided as the ``{\tt B}''
+input argument.  Only the non-baseline categories ($1\leq l \leq k\,{-}\,1$) have
+their $\hat{\beta}_{j, l}$ in the output; to add the baseline category, just append
+a column of zeros.  If {\tt icpt=0} in the input command line, no intercepts are used
+and {\tt B} has size $m\times (k\,{-}\,1)$; otherwise {\tt B} has size 
+$(m\,{+}\,1)\times (k\,{-}\,1)$
+and the intercepts are in the $m\,{+}\,1$-st row.  If {\tt icpt=2}, then initially
+the feature columns in~$X$ are shifted to mean${} = 0$ and rescaled to variance${} = 1$.
+After the iterations converge, the $\hat{\beta}_{j, l}$'s are rescaled and shifted
+to work with the original features.
+
+
+\smallskip
+\noindent{\bf Examples}
+\smallskip
+
+{\hangindent=\parindent\noindent\tt
+\hml -f MultiLogReg.dml -nvargs X=/user/biadmin/X.mtx 
+  Y=/user/biadmin/Y.mtx B=/user/biadmin/B.mtx fmt=csv
+  icpt=2 reg=1.0 tol=0.0001 moi=100 mii=10 Log=/user/biadmin/log.csv
+
+}
+
+
+\smallskip
+\noindent{\bf References}
+\begin{itemize}
+\item A.~Agresti.
+\newblock {\em Categorical Data Analysis}.
+\newblock Wiley Series in Probability and Statistics. Wiley-Interscience,  second edition, 2002.
+\end{itemize}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a891c0a1/Algorithms Reference/MultiSVM.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/MultiSVM.tex b/Algorithms Reference/MultiSVM.tex
new file mode 100644
index 0000000..7c844f7
--- /dev/null
+++ b/Algorithms Reference/MultiSVM.tex	
@@ -0,0 +1,153 @@
+\subsubsection{Multi-class Support Vector Machines}
+\label{msvm}
+
+\noindent{\bf Description}
+
+Support Vector Machines are used to model the relationship between a categorical 
+dependent variable y and one or more explanatory variables denoted X. This 
+implementation supports dependent variables that have domain size greater or
+equal to 2 and hence is not restricted to binary class labels.
+\\
+
+\noindent{\bf Usage}
+
+\begin{tabbing}
+\texttt{-f} \textit{path}/\texttt{m-svm.dml -nvargs}
+\=\texttt{X=}\textit{path}/\textit{file} 
+  \texttt{Y=}\textit{path}/\textit{file}
+  \texttt{icpt=}\textit{int}\\
+\>\texttt{tol=}\textit{double} 
+  \texttt{reg=}\textit{double}
+  \texttt{maxiter=}\textit{int} 
+  \texttt{model=}\textit{path}/\textit{file}\\
+\>\texttt{Log=}\textit{path}/\textit{file}
+  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
+\end{tabbing}
+
+\begin{tabbing}
+\texttt{-f} \textit{path}/\texttt{m-svm-predict.dml -nvargs}
+\=\texttt{X=}\textit{path}/\textit{file} 
+  \texttt{Y=}\textit{path}/\textit{file}
+  \texttt{icpt=}\textit{int}
+  \texttt{model=}\textit{path}/\textit{file}\\
+\>\texttt{scores=}\textit{path}/\textit{file}
+  \texttt{accuracy=}\textit{path}/\textit{file}\\
+\>\texttt{confusion=}\textit{path}/\textit{file}
+  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
+\end{tabbing}
+
+\noindent{\bf Arguments}
+
+\begin{itemize}
+\item X: Location (on HDFS) containing the explanatory variables 
+in a matrix. Each row constitutes an example.
+\item Y: Location (on HDFS) containing a 1-column matrix specifying 
+the categorical dependent variable (label). Labels are assumed to be 
+contiguously numbered from 1 $\ldots$ \#classes.  Note that, this 
+argument is optional for prediction.
+\item icpt (default: {\tt 0}): If set to 1 then a constant bias column
+is added to X.
+\item tol (default: {\tt 0.001}): Procedure terminates early if the reduction
+in objective function value is less than tolerance times the initial objective
+function value.
+\item reg (default: {\tt 1}): Regularization constant. See details to find 
+out where lambda appears in the objective function. If one were interested 
+in drawing an analogy with C-SVM, then C = 2/lambda. Usually, cross validation 
+is employed to determine the optimum value of lambda.
+\item maxiter (default: {\tt 100}): The maximum number of iterations.
+\item model: Location (on HDFS) that contains the learnt weights.
+\item Log: Location (on HDFS) to collect various metrics (e.g., objective 
+function value etc.) that depict progress across iterations while training.
+\item fmt (default: {\tt text}): Specifies the output format. Choice of 
+comma-separated values (csv) or as a sparse-matrix (text).
+\item scores: Location (on HDFS) to store scores for a held-out test set.
+Note that, this is an optional argument.
+\item accuracy: Location (on HDFS) to store the accuracy computed on a
+held-out test set. Note that, this is an optional argument.
+\item confusion: Location (on HDFS) to store the confusion matrix
+computed using a held-out test set. Note that, this is an optional 
+argument.
+\end{itemize}
+
+\noindent{\bf Details}
+
+Support vector machines learn a classification function by solving the
+following optimization problem ($L_2$-SVM):
+\begin{eqnarray*}
+&\textrm{argmin}_w& \frac{\lambda}{2} ||w||_2^2 + \sum_i \xi_i^2\\
+&\textrm{subject to:}& y_i w^{\top} x_i \geq 1 - \xi_i ~ \forall i
+\end{eqnarray*}
+where $x_i$ is an example from the training set with its label given by $y_i$, 
+$w$ is the vector of parameters and $\lambda$ is the regularization constant 
+specified by the user.
+
+To extend the above formulation (binary class SVM) to the multiclass setting,
+one standard approache is to learn one binary class SVM per class that 
+separates data belonging to that class from the rest of the training data 
+(one-against-the-rest SVM, see C. Scholkopf, 1995).
+
+To account for the missing bias term, one may augment the data with a column
+of constants which is achieved by setting intercept argument to 1 (C-J Hsieh 
+et al, 2008).
+
+This implementation optimizes the primal directly (Chapelle, 2007). It uses 
+nonlinear conjugate gradient descent to minimize the objective function 
+coupled with choosing step-sizes by performing one-dimensional Newton 
+minimization in the direction of the gradient.
+\\
+
+\noindent{\bf Returns}
+
+The learnt weights produced by m-svm.dml are populated into a matrix that 
+has as many columns as there are classes in the training data, and written 
+to file provided on HDFS (see model in section Arguments). The number of rows
+in this matrix is ncol(X) if intercept was set to 0 during invocation and ncol(X) + 1
+otherwise. The bias terms, if used, are placed in the last row. Depending on what
+arguments are provided during invocation, m-svm-predict.dml may compute one or more
+of scores, accuracy and confusion matrix in the output format specified.
+\\
+
+%%\noindent{\bf See Also}
+%%
+%%In case of binary classification problems, please consider using a binary class classifier
+%%learning algorithm, e.g., binary class $L_2$-SVM (see Section \ref{l2svm}) or logistic regression
+%%(see Section \ref{logreg}). To model the relationship between a scalar dependent variable 
+%%y and one or more explanatory variables X, consider Linear Regression instead (see Section 
+%%\ref{linreg-solver} or Section \ref{linreg-iterative}).
+%%\\
+%%
+\noindent{\bf Examples}
+\begin{verbatim}
+hadoop jar SystemML.jar -f m-svm.dml -nvargs X=/user/biadmin/X.mtx 
+                                             Y=/user/biadmin/y.mtx 
+                                             icpt=0 tol=0.001
+                                             reg=1.0 maxiter=100 fmt=csv 
+                                             model=/user/biadmin/weights.csv
+                                             Log=/user/biadmin/Log.csv
+\end{verbatim}
+
+\begin{verbatim}
+hadoop jar SystemML.jar -f m-svm-predict.dml -nvargs X=/user/biadmin/X.mtx 
+                                                     Y=/user/biadmin/y.mtx 
+                                                     icpt=0 fmt=csv
+                                                     model=/user/biadmin/weights.csv
+                                                     scores=/user/biadmin/scores.csv
+                                                     accuracy=/user/biadmin/accuracy.csv
+                                                     confusion=/user/biadmin/confusion.csv
+\end{verbatim}
+
+\noindent{\bf References}
+
+\begin{itemize}
+\item W. T. Vetterling and B. P. Flannery. \newblock{\em Conjugate Gradient Methods in Multidimensions in 
+Numerical Recipes in C - The Art in Scientific Computing.} \newblock W. H. Press and S. A. Teukolsky
+(eds.), Cambridge University Press, 1992.
+\item J. Nocedal and  S. J. Wright. \newblock{\em Numerical Optimization.} \newblock Springer-Verlag, 1999.
+\item C-J Hsieh, K-W Chang, C-J Lin, S. S. Keerthi and S. Sundararajan. \newblock {\em A Dual Coordinate 
+Descent Method for Large-scale Linear SVM.} \newblock International Conference of Machine Learning
+(ICML), 2008.
+\item Olivier Chapelle. \newblock{\em Training a Support Vector Machine in the Primal.} \newblock Neural 
+Computation, 2007.
+\item B. Scholkopf, C. Burges and V. Vapnik. \newblock{\em Extracting Support Data for a Given Task.} \newblock International Conference on Knowledge Discovery and Data Mining (ICDM), 1995.
+\end{itemize}
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a891c0a1/Algorithms Reference/NaiveBayes.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/NaiveBayes.tex b/Algorithms Reference/NaiveBayes.tex
new file mode 100644
index 0000000..0f99c46
--- /dev/null
+++ b/Algorithms Reference/NaiveBayes.tex	
@@ -0,0 +1,134 @@
+\subsection{Naive Bayes}
+\label{naive_bayes}
+
+\noindent{\bf Description}
+
+Naive Bayes is very simple generative model used for classifying data. 
+This implementation learns a multinomial naive Bayes classifier which
+is applicable when all features are counts of categorical values.
+\\
+
+\noindent{\bf Usage}
+
+\begin{tabbing}
+\texttt{-f} \textit{path}/\texttt{naive-bayes.dml -nvargs} 
+\=\texttt{X=}\textit{path}/\textit{file} 
+  \texttt{Y=}\textit{path}/\textit{file} 
+  \texttt{laplace=}\textit{double}\\
+\>\texttt{prior=}\textit{path}/\textit{file}
+  \texttt{conditionals=}\textit{path}/\textit{file}\\
+\>\texttt{accuracy=}\textit{path}/\textit{file}
+  \texttt{fmt=}\textit{csv}$\vert$\textit{text}
+\end{tabbing}
+
+\begin{tabbing}
+\texttt{-f} \textit{path}/\texttt{naive-bayes-predict.dml -nvargs} 
+\=\texttt{X=}\textit{path}/\textit{file} 
+  \texttt{Y=}\textit{path}/\textit{file} 
+  \texttt{prior=}\textit{path}/\textit{file}\\
+\>\texttt{conditionals=}\textit{path}/\textit{file}
+  \texttt{fmt=}\textit{csv}$\vert$\textit{text}\\
+\>\texttt{accuracy=}\textit{path}/\textit{file}
+  \texttt{confusion=}\textit{path}/\textit{file}\\
+\>\texttt{probabilities=}\textit{path}/\textit{file}
+\end{tabbing}
+
+\noindent{\bf Arguments}
+
+\begin{itemize}
+\item X: Location (on HDFS) to read the matrix of feature vectors; 
+each row constitutes one feature vector.
+\item Y: Location (on HDFS) to read the one-column matrix of (categorical) 
+labels that correspond to feature vectors in X. Classes are assumed to be
+contiguously labeled beginning from 1. Note that, this argument is optional
+for prediction.
+\item laplace (default: {\tt 1}): Laplace smoothing specified by the 
+user to avoid creation of 0 probabilities.
+\item prior: Location (on HDFS) that contains the class prior probabilites.
+\item conditionals: Location (on HDFS) that contains the class conditional
+feature distributions.
+\item fmt (default: {\tt text}): Specifies the output format. Choice of 
+comma-separated values (csv) or as a sparse-matrix (text).
+\item probabilities: Location (on HDFS) to store class membership probabilities
+for a held-out test set. Note that, this is an optional argument.
+\item accuracy: Location (on HDFS) to store the training accuracy during
+learning and testing accuracy from a held-out test set during prediction. 
+Note that, this is an optional argument for prediction.
+\item confusion: Location (on HDFS) to store the confusion matrix
+computed using a held-out test set. Note that, this is an optional 
+argument.
+\end{itemize}
+
+\noindent{\bf Details}
+
+Naive Bayes is a very simple generative classification model. It posits that 
+given the class label, features can be generated independently of each other.
+More precisely, the (multinomial) naive Bayes model uses the following 
+equation to estimate the joint probability of a feature vector $x$ belonging 
+to class $y$:
+\begin{equation*}
+\text{Prob}(y, x) = \pi_y \prod_{i \in x} \theta_{iy}^{n(i,x)}
+\end{equation*}
+where $\pi_y$ denotes the prior probability of class $y$, $i$ denotes a feature
+present in $x$ with $n(i,x)$ denoting its count and $\theta_{iy}$ denotes the 
+class conditional probability of feature $i$ in class $y$. The usual 
+constraints hold on $\pi$ and $\theta$:
+\begin{eqnarray*}
+&& \pi_y \geq 0, ~ \sum_{y \in \mathcal{C}} \pi_y = 1\\
+\forall y \in \mathcal{C}: && \theta_{iy} \geq 0, ~ \sum_i \theta_{iy} = 1
+\end{eqnarray*}
+where $\mathcal{C}$ is the set of classes.
+
+Given a fully labeled training dataset, it is possible to learn a naive Bayes 
+model using simple counting (group-by aggregates). To compute the class conditional
+probabilities, it is usually advisable to avoid setting $\theta_{iy}$ to 0. One way to 
+achieve this is using additive smoothing or Laplace smoothing. Some authors have argued
+that this should in fact be add-one smoothing. This implementation uses add-one smoothing
+by default but lets the user specify her/his own constant, if required.
+
+This implementation is sometimes referred to as \emph{multinomial} naive Bayes. Other
+flavours of naive Bayes are also popular.
+\\
+
+\noindent{\bf Returns}
+
+The learnt model produced by naive-bayes.dml is stored in two separate files. 
+The first file stores the class prior (a single-column matrix). The second file 
+stores the class conditional probabilities organized into a matrix with as many 
+rows as there are class labels and as many columns as there are features. 
+Depending on what arguments are provided during invocation, naive-bayes-predict.dml 
+may compute one or more of probabilities, accuracy and confusion matrix in the 
+output format specified. 
+\\
+
+\noindent{\bf Examples}
+
+\begin{verbatim}
+hadoop jar SystemML.jar -f naive-bayes.dml -nvargs 
+                           X=/user/biadmin/X.mtx 
+                           Y=/user/biadmin/y.mtx 
+                           laplace=1 fmt=csv
+                           prior=/user/biadmin/prior.csv
+                           conditionals=/user/biadmin/conditionals.csv
+                           accuracy=/user/biadmin/accuracy.csv
+\end{verbatim}
+
+\begin{verbatim}
+hadoop jar SystemML.jar -f naive-bayes-predict.dml -nvargs 
+                           X=/user/biadmin/X.mtx 
+                           Y=/user/biadmin/y.mtx 
+                           prior=/user/biadmin/prior.csv
+                           conditionals=/user/biadmin/conditionals.csv
+                           fmt=csv
+                           accuracy=/user/biadmin/accuracy.csv
+                           probabilities=/user/biadmin/probabilities.csv
+                           confusion=/user/biadmin/confusion.csv
+\end{verbatim}
+
+\noindent{\bf References}
+
+\begin{itemize}
+\item S. Russell and P. Norvig. \newblock{\em Artificial Intelligence: A Modern Approach.} Prentice Hall, 2009.
+\item A. McCallum and K. Nigam. \newblock{\em A comparison of event models for naive bayes text classification.} 
+\newblock AAAI-98 workshop on learning for text categorization, 1998.
+\end{itemize}

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a891c0a1/Algorithms Reference/PCA.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/PCA.tex b/Algorithms Reference/PCA.tex
new file mode 100644
index 0000000..f85879d
--- /dev/null
+++ b/Algorithms Reference/PCA.tex	
@@ -0,0 +1,121 @@
+\subsection{Principle Component Analysis}
+\label{pca}
+
+\noindent{\bf Description}
+
+Principle Component Analysis (PCA) is a simple, non-parametric method to transform the given data set with possibly correlated columns into a set of linearly uncorrelated or orthogonal columns, called {\em principle components}. The principle components are ordered in such a way that the first component accounts for the largest possible variance, followed by remaining principle components in the decreasing order of the amount of variance captured from the data. PCA is often used as a dimensionality reduction technique, where the original data is projected or rotated onto a low-dimensional space with basis vectors defined by top-$K$ (for a given value of $K$) principle components.
+\\
+
+\noindent{\bf Usage}
+
+\begin{tabbing}
+\texttt{-f} \textit{path}/\texttt{PCA.dml -nvargs} 
+\=\texttt{INPUT=}\textit{path}/\textit{file} 
+  \texttt{K=}\textit{int} \\
+\>\texttt{CENTER=}\textit{0/1}
+  \texttt{SCALE=}\textit{0/1}\\
+\>\texttt{PROJDATA=}\textit{0/1}
+  \texttt{OFMT=}\textit{csv}/\textit{text}\\
+\>\texttt{MODEL=}\textit{path}$\vert$\textit{file}
+  \texttt{OUTPUT=}\textit{path}/\textit{file}
+\end{tabbing}
+
+\noindent{\bf Arguments}
+
+\begin{itemize}
+\item INPUT: Location (on HDFS) to read the input matrix.
+\item K: Indicates dimension of the new vector space constructed from $K$ principle components. It must be a value between $1$ and the number of columns in the input data.
+\item CENTER (default: {\tt 0}): Indicates whether or not to {\em center} input data prior to the computation of principle components.
+\item SCALE (default: {\tt 0}): Indicates whether or not to {\em scale} input data prior to the computation of principle components.
+\item PROJDATA: Indicates whether or not the input data must be projected on to new vector space defined over principle components.
+\item OFMT (default: {\tt csv}): Specifies the output format. Choice of comma-separated values (csv) or as a sparse-matrix (text).
+\item MODEL: Either the location (on HDFS) where the computed model is stored; or the location of an existing model.
+\item OUTPUT: Location (on HDFS) to store the data rotated on to the new vector space.
+\end{itemize}
+
+\noindent{\bf Details}
+
+Principle Component Analysis (PCA) is a non-parametric procedure for orthogonal linear transformation of the input data to a new coordinate system, such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In other words, PCA first selects a normalized direction in $m$-dimensional space ($m$ is the number of columns in the input data) along which the variance in input data is maximized -- this is referred to as the first principle component. It then repeatedly finds other directions (principle components) in which the variance is maximized. At every step, PCA restricts the search for only those directions that are perpendicular to all previously selected directions. By doing so, PCA aims to reduce the redundancy among input variables. To understand the notion of redundancy, consider an extreme scenario with a data set comprising of two v
 ariables, where the first one denotes some quantity expressed in meters, and the other variable represents the same quantity but in inches. Both these variables evidently capture redundant information, and hence one of them can be removed. In a general scenario, keeping solely the linear combination of input variables would both express the data more concisely and reduce the number of variables. This is why PCA is often used as a dimensionality reduction technique.
+
+The specific method to compute such a new coordinate system is as follows -- compute a covariance matrix $C$ that measures the strength of correlation among all pairs of variables in the input data; factorize $C$ according to eigen decomposition to calculate its eigenvalues and eigenvectors; and finally, order eigenvectors in the decreasing order of their corresponding eigenvalue. The computed eigenvectors (also known as {\em loadings}) define the new coordinate system and the square root of eigen values provide the amount of variance in the input data explained by each coordinate or eigenvector. 
+\\
+
+%As an example, consider the data in Table~\ref{tab:pca_data}. 
+\begin{comment}
+\begin{table}
+\parbox{.35\linewidth}{
+\centering
+\begin{tabular}{cc}
+  \hline
+  x & y \\
+  \hline
+  2.5 & 2.4  \\
+  0.5 & 0.7  \\
+  2.2 & 2.9  \\
+  1.9 & 2.2  \\
+  3.1 & 3.0  \\
+  2.3 & 2.7  \\
+  2 & 1.6  \\
+  1 & 1.1  \\
+  1.5 & 1.6  \\
+  1.1 & 0.9  \\
+	\hline
+\end{tabular}
+\caption{Input Data}
+\label{tab:pca_data}
+}
+\hfill
+\parbox{.55\linewidth}{
+\centering
+\begin{tabular}{cc}
+  \hline
+  x & y \\
+  \hline
+  .69  & .49  \\
+  -1.31  & -1.21  \\
+  .39  & .99  \\
+  .09  & .29  \\
+  1.29  & 1.09  \\
+  .49  & .79  \\
+  .19  & -.31  \\
+  -.81  & -.81  \\
+  -.31  & -.31  \\
+  -.71  & -1.01  \\
+  \hline
+\end{tabular}
+\caption{Data after centering and scaling}
+\label{tab:pca_scaled_data}
+}
+\end{table}
+\end{comment}
+
+\noindent{\bf Returns}
+When MODEL is not provided, PCA procedure is applied on INPUT data to generate MODEL as well as the rotated data OUTPUT (if PROJDATA is set to $1$) in the new coordinate system. 
+The produced model consists of basis vectors MODEL$/dominant.eigen.vectors$ for the new coordinate system; eigen values MODEL$/dominant.eigen.values$; and the standard deviation MODEL$/dominant.eigen.standard.deviations$ of principle components.
+When MODEL is provided, INPUT data is rotated according to the coordinate system defined by MODEL$/dominant.eigen.vectors$. The resulting data is stored at location OUTPUT.
+\\
+
+\noindent{\bf Examples}
+
+\begin{verbatim}
+hadoop jar SystemML.jar -f PCA.dml -nvargs 
+            INPUT=/user/biuser/input.mtx  K=10
+            CENTER=1  SCALE=1
+            OFMT=csv PROJDATA=1
+				    # location to store model and rotated data
+            OUTPUT=/user/biuser/pca_output/   
+\end{verbatim}
+
+\begin{verbatim}
+hadoop jar SystemML.jar -f PCA.dml -nvargs 
+            INPUT=/user/biuser/test_input.mtx  K=10
+            CENTER=1  SCALE=1
+            OFMT=csv PROJDATA=1
+				    # location of an existing model
+            MODEL=/user/biuser/pca_output/       
+				    # location of rotated data
+            OUTPUT=/user/biuser/test_output.mtx  
+\end{verbatim}
+
+
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a891c0a1/Algorithms Reference/SystemML_Algorithms_Reference.bib
----------------------------------------------------------------------
diff --git a/Algorithms Reference/SystemML_Algorithms_Reference.bib b/Algorithms Reference/SystemML_Algorithms_Reference.bib
new file mode 100644
index 0000000..8dce564
--- /dev/null
+++ b/Algorithms Reference/SystemML_Algorithms_Reference.bib	
@@ -0,0 +1,138 @@
+
+@article {Lin2008:logistic,
+   author       = {Chih-Jen Lin and Ruby C.\ Weng and S.\ Sathiya Keerthi},
+   title        = {Trust Region {N}ewton Method for Large-Scale Logistic Regression},
+   journal      = {Journal of Machine Learning Research},
+   month        = {April},
+   year         = {2008},
+   volume       = {9},
+   pages        = {627--650}
+}
+
+@book {Agresti2002:CDA,
+   author       = {Alan Agresti},
+   title        = {Categorical Data Analysis},
+   edition      = {Second},
+   series       = {Wiley Series in Probability and Statistics},
+   publisher    = {Wiley-Interscience},
+   year         = {2002},
+   pages        = {710}
+}
+
+@article {Nelder1972:GLM,
+   author       = {John Ashworth Nelder and Robert William Maclagan Wedderburn},
+   title        = {Generalized Linear Models},
+   journal      = {Journal of the Royal Statistical Society, Series~A (General)},
+   year         = {1972},
+   volume       = {135},
+   number       = {3},
+   pages        = {370--384}
+}
+
+@book {McCullagh1989:GLM,
+   author       = {Peter McCullagh and John Ashworth Nelder},
+   title        = {Generalized Linear Models},
+   edition      = {Second},
+   series       = {Monographs on Statistics and Applied Probability},
+   number       = {37},
+   year         = {1989},
+   publisher    = {Chapman~\&~Hall/CRC}, 
+   pages        = {532}
+}
+
+@book {Gill2000:GLM,
+   author       = {Jeff Gill},
+   title        = {Generalized Linear Models: A Unified Approach},
+   series       = {Sage University Papers Series on Quantitative Applications in the Social Sciences},
+   number       = {07-134},
+   year         = {2000},
+   publisher    = {Sage Publications},
+   pages        = {101}
+}
+
+@inproceedings {AgrawalKSX2002:hippocratic,
+   author       = {Rakesh Agrawal and Jerry Kiernan and Ramakrishnan Srikant and Yirong Xu},
+   title        = {Hippocratic Databases},
+   booktitle    = {Proceedings of the 28-th International Conference on Very Large Data Bases ({VLDB} 2002)},
+   address      = {Hong Kong, China},
+   month        = {August 20--23},
+   year         = {2002},
+   pages        = {143--154}
+}
+
+@book {Nocedal2006:Optimization,
+   title        = {Numerical Optimization},
+   author       = {Jorge Nocedal and Stephen Wright},
+   series       = {Springer Series in Operations Research and Financial Engineering},
+   pages        = {664},
+   edition      = {Second},
+   publisher    = {Springer},
+   year         = {2006}
+}
+
+@book {Hartigan1975:clustering,
+   author       = {John A.\ Hartigan},
+   title        = {Clustering Algorithms},
+   publisher    = {John Wiley~\&~Sons Inc.},
+   series       = {Probability and Mathematical Statistics},
+   month        = {April},
+   year         = {1975},
+   pages        = {365}
+}
+
+@inproceedings {ArthurVassilvitskii2007:kmeans,
+   title        = {{\tt k-means++}: The Advantages of Careful Seeding},
+   author       = {David Arthur and Sergei Vassilvitskii},
+   booktitle    = {Proceedings of the 18th Annual {ACM-SIAM} Symposium on Discrete Algorithms ({SODA}~2007)},
+   month        = {January 7--9}, 
+   year         = {2007},
+   address      = {New Orleans~{LA}, {USA}},
+   pages        = {1027--1035}
+}
+
+@article {AloiseDHP2009:kmeans,
+   author       = {Daniel Aloise and Amit Deshpande and Pierre Hansen and Preyas Popat},
+   title        = {{NP}-hardness of {E}uclidean Sum-of-squares Clustering},
+   journal      = {Machine Learning},
+   publisher    = {Kluwer Academic Publishers},
+   volume       = {75},
+   number       = {2}, 
+   month        = {May}, 
+   year         = {2009},
+   pages        = {245--248}
+}
+
+@article {Cochran1954:chisq,
+   author       = {William G.\ Cochran},
+   title        = {Some Methods for Strengthening the Common $\chi^2$ Tests},
+   journal      = {Biometrics},
+   volume       = {10},
+   number       = {4},
+   month        = {December},
+   year         = {1954},
+   pages        = {417--451}
+}
+
+@article {AcockStavig1979:CramersV,
+   author       = {Alan C.\ Acock and Gordon R.\ Stavig},
+   title        = {A Measure of Association for Nonparametric Statistics},
+   journal      = {Social Forces},
+   publisher    = {Oxford University Press},
+   volume       = {57},
+   number       = {4},
+   month        = {June},
+   year         = {1979},
+   pages        = {1381--1386}
+}
+
+@article {Stevens1946:scales,
+   author       = {Stanley Smith Stevens},
+   title        = {On the Theory of Scales of Measurement},
+   journal      = {Science},
+   month        = {June 7},
+   year         = {1946},
+   volume       = {103},
+   number       = {2684},
+   pages        = {677--680}
+}
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a891c0a1/Algorithms Reference/SystemML_Algorithms_Reference.pdf
----------------------------------------------------------------------
diff --git a/Algorithms Reference/SystemML_Algorithms_Reference.pdf b/Algorithms Reference/SystemML_Algorithms_Reference.pdf
new file mode 100644
index 0000000..4d4ea6a
Binary files /dev/null and b/Algorithms Reference/SystemML_Algorithms_Reference.pdf differ

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a891c0a1/Algorithms Reference/SystemML_Algorithms_Reference.tex
----------------------------------------------------------------------
diff --git a/Algorithms Reference/SystemML_Algorithms_Reference.tex b/Algorithms Reference/SystemML_Algorithms_Reference.tex
new file mode 100644
index 0000000..7dccad0
--- /dev/null
+++ b/Algorithms Reference/SystemML_Algorithms_Reference.tex	
@@ -0,0 +1,142 @@
+\documentclass[letter]{article}
+\usepackage{graphicx,amsmath,amssymb,amsthm,subfigure,color,url,multirow,rotating,comment}
+%\usepackage{tikz}
+\usepackage[normalem]{ulem}
+\usepackage[np,autolanguage]{numprint}
+
+\usepackage[]{hyperref}
+\hypersetup{
+    unicode=false,          % non-Latin characters in Acrobat&#146;s bookmarks
+    pdftoolbar=true,        % show Acrobat&#146;s toolbar?
+    pdfmenubar=true,        % show Acrobat&#146;s menu?
+    pdffitwindow=true,      % window fit to page when opened
+    pdfstartview={FitV},    % fits the width of the page to the window
+    pdftitle={SystemML Algorithms Reference},    % title
+    pdfauthor={SystemML Team}, % author
+    pdfsubject={Documentation},   % subject of the document
+    pdfkeywords={},         % list of keywords
+    pdfnewwindow=true,      % links in new window
+    bookmarksnumbered=true, % put section numbers in bookmarks
+    bookmarksopen=true,     % open up bookmark tree
+    bookmarksopenlevel=1,   % \maxdimen level to which bookmarks are open
+    colorlinks=true,        % false: boxed links; true: colored links
+    linkcolor=black,        % color of internal links  
+    citecolor=blue,         % color of links to bibliography
+    filecolor=black,        % color of file links
+    urlcolor=black          % color of external links
+}
+
+
+\newtheorem{definition}{Definition}
+\newtheorem{example}{Example}
+
+\newcommand{\Paragraph}[1]{\vspace*{1ex} \noindent {\bf #1} \hspace*{1ex}}
+\newenvironment{Itemize}{\vspace{-0.5ex}\begin{itemize}\setlength{\itemsep}{-0.2ex}
+}{\end{itemize}\vspace{-0.5ex}}
+\newenvironment{Enumerate}{\vspace{-0.5ex}\begin{enumerate}\setlength{\itemsep}{-0.2ex}
+}{\end{enumerate}\vspace{-0.5ex}}
+\newenvironment{Description}{\vspace{-0.5ex}\begin{description}\setlength{\itemsep}{-0.2ex}
+}{\end{description}\vspace{-0.5ex}}
+
+
+\newcommand{\SystemML}{\texttt{SystemML} }
+\newcommand{\hml}{\texttt{hadoop jar SystemML.jar} }
+\newcommand{\pxp}{\mathbin{\texttt{\%\textasteriskcentered\%}}}
+\newcommand{\todo}[1]{{{\color{red}TODO: #1}}}
+\newcommand{\Normal}{\ensuremath{\mathop{\mathrm{Normal}}\nolimits}}
+\newcommand{\Prob}{\ensuremath{\mathop{\mathrm{Prob}\hspace{0.5pt}}\nolimits}}
+\newcommand{\E}{\ensuremath{\mathop{\mathrm{E}}\nolimits}}
+\newcommand{\mean}{\ensuremath{\mathop{\mathrm{mean}}\nolimits}}
+\newcommand{\Var}{\ensuremath{\mathop{\mathrm{Var}}\nolimits}}
+\newcommand{\Cov}{\ensuremath{\mathop{\mathrm{Cov}}\nolimits}}
+\newcommand{\stdev}{\ensuremath{\mathop{\mathrm{st.dev}}\nolimits}}
+\newcommand{\atan}{\ensuremath{\mathop{\mathrm{arctan}}\nolimits}}
+\newcommand{\diag}{\ensuremath{\mathop{\mathrm{diag}}\nolimits}}
+\newcommand{\const}{\ensuremath{\mathop{\mathrm{const}}\nolimits}}
+\newcommand{\eps}{\varepsilon}
+
+\sloppy
+
+%%%%%%%%%%%%%%%%%%%%% 
+% header
+%%%%%%%%%%%%%%%%%%%%%
+
+\title{\LARGE{{\SystemML Algorithms Reference}}} 
+\date{\today}
+
+%%%%%%%%%%%%%%%%%%%%%
+% document start
+%%%%%%%%%%%%%%%%%%%%%
+\begin{document}	
+
+%\pagenumbering{roman}
+\maketitle
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Descriptive Statistics}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\input{DescriptiveStats}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Classification}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\input{LogReg}
+
+\subsection{Support Vector Machines}
+
+\input{BinarySVM}
+
+\input{MultiSVM}
+
+\input{NaiveBayes}
+
+%\input{DecisionTrees}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Clustering}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\input{Kmeans}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Regression}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\input{LinReg}
+
+\newpage
+
+\input{GLM}
+
+\newpage
+
+\input{GLMpredict.tex}
+
+\newpage
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Matrix Factorization}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\input{pca}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%{\color{red}\section{Sequence Mining}}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%\section{Matrix Factorization}
+
+%%{\color{red}\subsection{GNMF}}
+
+\bibliographystyle{abbrv}
+
+\bibliography{SystemML_ALgorithms_Reference}
+
+	
+%%%%%%%%%%%%%%%%%%%%%
+% document end
+%%%%%%%%%%%%%%%%%%%%%
+\end{document}
+
+

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a891c0a1/Language Reference/PyDML Language Reference.doc
----------------------------------------------------------------------
diff --git a/Language Reference/PyDML Language Reference.doc b/Language Reference/PyDML Language Reference.doc
new file mode 100644
index 0000000..b43b6db
Binary files /dev/null and b/Language Reference/PyDML Language Reference.doc differ

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a891c0a1/Language Reference/Python syntax for DML.doc
----------------------------------------------------------------------
diff --git a/Language Reference/Python syntax for DML.doc b/Language Reference/Python syntax for DML.doc
new file mode 100644
index 0000000..ee43a6b
Binary files /dev/null and b/Language Reference/Python syntax for DML.doc differ

http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/a891c0a1/Language Reference/README.txt
----------------------------------------------------------------------
diff --git a/Language Reference/README.txt b/Language Reference/README.txt
new file mode 100644
index 0000000..0f22aa6
--- /dev/null
+++ b/Language Reference/README.txt	
@@ -0,0 +1,87 @@
+Usage
+-----
+The machine learning algorithms described in 
+$BIGINSIGHTS_HOME/machine-learning/docs/SystemML_Algorithms_Reference.pdf can be invoked
+from the hadoop command line using the described, algorithm-specific parameters. 
+
+Generic command line arguments arguments are provided by the help command below.
+
+   hadoop jar SystemML.jar -? or -help 
+
+
+Recommended configurations
+--------------------------
+1) JVM Heap Sizes: 
+We recommend an equal-sized JVM configuration for clients, mappers, and reducers. For the client
+process this can be done via 
+
+   export HADOOP_CLIENT_OPTS="-Xmx2048m -Xms2048m -Xmn256m" 
+   
+where Xmx specifies the maximum heap size, Xms the initial heap size, and Xmn is size of the young 
+generation. For Xmn values of equal or less than 15% of the max heap size, we guarantee the memory budget.
+
+The above option may also be set through BigR setting the "ml.jvm" option, e.g.
+   bigr.set.server.option("jaql.fence.jvm.parameters", "-Xmx2g -Xms2g -Xmn256m")
+
+For mapper or reducer JVM configurations, the following properties can be specified in mapred-site.xml, 
+where 'child' refers to both mapper and reducer. If map and reduce are specified individually, they take 
+precedence over the generic property.
+
+  <property>
+    <name>mapreduce.child.java.opts</name> <!-- synonym: mapred.child.java.opts -->
+    <value>-Xmx2048m -Xms2048m -Xmn256m</value>
+  </property>
+  <property>
+    <name>mapreduce.map.java.opts</name> <!-- synonym: mapred.map.java.opts -->
+    <value>-Xmx2048m -Xms2048m -Xmn256m</value>
+  </property>
+  <property>
+    <name>mapreduce.reduce.java.opts</name> <!-- synonym: mapred.reduce.java.opts -->
+    <value>-Xmx2048m -Xms2048m -Xmn256m</value>
+  </property>
+ 
+
+2) CP Memory Limitation:
+There exist size limitations for in-memory matrices. Dense in-memory matrices are limited to 16GB 
+independent of their dimension. Sparse in-memory matrices are limited to 2G rows and 2G columns 
+but the overall matrix can be larger. These limitations do only apply to in-memory matrices but 
+NOT in HDFS or involved in MR computations. Setting HADOOP_CLIENT_OPTS below those limitations 
+prevents runtime errors.
+
+3) Transparent Huge Pages (on Red Hat Enterprise Linux 6):
+Hadoop workloads might show very high System CPU utilization if THP is enabled. In case of such 
+behavior, we recommend to disable THP with
+   
+   echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
+   
+4) JVM Reuse:
+Performance benefits from JVM reuse because data sets that fit into the mapper memory budget are 
+reused across tasks per slot. However, Hadoop 1.0.3 JVM Reuse is incompatible with security (when 
+using the LinuxTaskController). The workaround is to use the DefaultTaskController. SystemML provides 
+a configuration property in $BIGINSIGHTS_HOME/machine-learning/SystemML-config.xml to enable JVM reuse 
+on a per job level without changing the global cluster configuration. 
+   
+   <jvmreuse>false</jvmreuse> 
+   
+5) Number of Reducers:
+The number of reducers can have significant impact on performance. SystemML provides a configuration
+property to set the default number of reducers per job without changing the global cluster configuration.
+In general, we recommend a setting of twice the number of nodes. Smaller numbers create less intermediate
+files, larger numbers increase the degree of parallelism for compute and parallel write. In 
+$BIGINSIGHTS_HOME/machine-learning/SystemML-config.xml, set:
+   
+   <!-- default number of reduce tasks per MR job, default: 2 x number of nodes -->
+   <numreducers>12</numreducers> 
+
+6) SystemML temporary directories:
+SystemML uses temporary directories in two different locations: (1) on local file system for temping from 
+the client process, and (2) on HDFS for intermediate results between different MR jobs and between MR jobs 
+and in-memory operations. Locations of these directories can be configured in 
+$BIGINSIGHTS_HOME/machine-learning/SystemML-config.xml with the following properties
+
+   <!-- local fs tmp working directory-->
+   <localtmpdir>/tmp/systemml</localtmpdir>
+
+   <!-- hdfs tmp working directory--> 
+   <scratch>scratch_space</scratch> 
+ 
\ No newline at end of file