You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemml.apache.org by du...@apache.org on 2015/12/02 02:04:59 UTC
[21/47] incubator-systemml git commit: [SYSML-53] - Add SystemML
Quick Start Guide
[SYSML-53] - Add SystemML Quick Start Guide
Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/f4b41e0d
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/f4b41e0d
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/f4b41e0d
Branch: refs/heads/gh-pages
Commit: f4b41e0db8e35debcddf5913c5cfca5e4f4e5ed9
Parents: 66eb331
Author: Christian Kadner <ck...@us.ibm.com>
Authored: Tue Sep 8 17:47:28 2015 -0700
Committer: Luciano Resende <lr...@apache.org>
Committed: Tue Sep 8 17:47:28 2015 -0700
----------------------------------------------------------------------
index.md | 3 +-
quick-start-guide.md | 398 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 399 insertions(+), 2 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/f4b41e0d/index.md
----------------------------------------------------------------------
diff --git a/index.md b/index.md
index 4dbaf86..49f69d8 100644
--- a/index.md
+++ b/index.md
@@ -16,8 +16,7 @@ and (3) automatic optimization.
For more information about SystemML, please consult the following references:
* [SystemML GitHub README](http://www.github.com/SparkTC/systemml)
-* Quick Start - **Coming Soon**
+* [Quick Start Guide](quick-start-guide.html)
* [Algorithms Reference](algorithms-reference.html)
* [DML (Declarative Machine Learning) Language Reference](dml-language-reference.html)
* PYDML (Python-Like Declarative Machine Learning) Language Reference - **Coming Soon**
-
http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/f4b41e0d/quick-start-guide.md
----------------------------------------------------------------------
diff --git a/quick-start-guide.md b/quick-start-guide.md
new file mode 100644
index 0000000..c8d5f15
--- /dev/null
+++ b/quick-start-guide.md
@@ -0,0 +1,398 @@
+---
+layout: global
+title: SystemML Quick Start Guide
+author: Christian Kadner
+description: SystemML Quick Start Guide
+displayTitle: SystemML Quick Start Guide
+---
+
+* This will become a table of contents (this text will be scraped).
+{:toc}
+
+<br/>
+
+This tutorial provides a quick introduction to using SystemML by
+running existing SystemML algorithms in standalone mode. More information
+about running SystemML in distributed execution mode (Hadoop, Spark) will
+be added soon.
+For more in depth information, please refer to the
+[Algorithms Reference](algorithms-reference.html) and the
+[SystemML Language Reference](dml-language-reference.html).
+
+
+<br/>
+
+# What is SystemML
+
+SystemML enables large-scale machine learning (ML) via a high-level declarative
+language with R-like syntax called [DML](dml-language-reference.html).
+This language allows Data Scientists to
+express their ML algorithms with full flexibility but without the need to fine-tune
+distributed runtime execution plans and system configurations.
+These ML programs are dynamically compiled and optimized based on data
+and cluster characteristics using rule and cost-based optimization techniques.
+The compiler automatically generates hybrid runtime execution plans ranging
+from in-memory, single node execution to distributed computation for Hadoop M/R
+or Spark Batch execution.
+SystemML features a suite of algorithms for Descriptive Statistics, Classification,
+Clustering, Regression, and Matrix Factorization. Detailed descriptions of these
+algorithms can be found in the [Algorithms Reference](algorithms-reference.html).
+
+<br/>
+
+# Distributed vs Standalone Execution Mode
+
+SystemML can operate in standalone mode, allowing data
+scientists to quikly develop algorithms on a single machine.
+For large-scale production environments, SystemML algorithm execution can be
+distributed across a multi-node cluster using [Apache Hadoop](https://hadoop.apache.org/)
+or [Apache Spark](http://spark.apache.org/).
+We will make use of standalone mode throughout this tutorial.
+
+The examples described in `docs/SystemML_Algorithms_Reference.pdf` are written
+primarily for the hadoop distributed environment. To run those examples in
+standalone mode, modify the commands by replacing "`hadoop jar SystemML.jar -f ...`"
+with "`./runStandaloneSystemML.sh ...`" on Mac/UNIX or
+with "`./runStandaloneSystemML.bat ...`" on Windows.
+
+<br/>
+
+# Contents of the SystemML Standalone Package
+
+To follow along with this guide, first build a standalone package of SystemML
+{% comment %} TODO: Where to download a packaged release of SystemML, ...standalone jar
+
+ $ wget http://spark.tc/system-ml/system-ml-LATEST-standalone.tar.gz -O - | tar -xz -C ~/systemml-tutorial
+
+
+or build it from source {% endcomment %} using [Apache Maven](http://maven.apache.org)
+and unpack it to your working directory, i.e. ```~/systemml-tutorial```.
+
+ $ git clone https://github.com/SparkTC/systemml.git
+ $ mvn clean package
+ $ tar -xzf `find . -name 'system-ml*standalone.tar.gz'` -C ~/systemml-tutorial
+
+The extracted package should have these contents:
+
+ $ ls -lF ~/systemml-tutorial
+ total 56
+ drwxr-xr-x 23 algorithms/
+ drwxr-xr-x 5 docs/
+ drwxr-xr-x 35 lib/
+ -rw-r--r-- 1 log4j.properties
+ -rw-r--r-- 1 readme.txt
+ -rwxr-xr-x 1 runStandaloneSystemML.bat*
+ -rwxr-xr-x 1 runStandaloneSystemML.sh*
+ -rw-r--r-- 1 SystemML-config.xml
+
+Refer to `docs/SystemML_Algorithms_Reference.pdf` for more information
+about each algorithm included in the `algorithms` folder.
+
+For the rest of the tutorial we switch our working directory to ```~/systemml-tutorial```.
+
+ $ cd ~/systemml-tutorial
+
+<br/>
+
+# Choosing Test Data
+
+In this tutorial we will use the [Haberman's Survival Data Set](http://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival)
+which can be downloaded in CSV format from the [Center for Machine Learning and Intelligent Systems](http://cml.ics.uci.edu/)
+
+ $ wget -P data/ http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.data
+
+The [Haberman Data Set](http://archive.ics.uci.edu/ml/machine-learning-databases/haberman/haberman.names)
+has 306 instances and 4 attributes (including the class attribute):
+
+ 1. Age of patient at time of operation (numerical)
+ 2. Patient's year of operation (year - 1900, numerical)
+ 3. Number of positive axillary nodes detected (numerical)
+ 4. Survival status (class attribute)
+ * `1` = the patient survived 5 years or longer
+ * `2` = the patient died within 5 year
+
+
+We will need to create a metadata file (MTD) which stores metadata information
+about the content of the data file. The name of the MTD file associated with the
+data file `<filename>` must be `<filename>.mtd`.
+
+ $ echo '{"rows": 306, "cols": 4, "format": "csv"}' > data/haberman.data.mtd
+
+<br/>
+
+# Example 1 - Univariate Statistics
+
+Let's start with a simple example, computing certain [univariate statistics](systemml-algorithms-descriptive-statistics.html#univariate-statistics)
+for each feature column using the algorithm `Univar-Stats.dml` which requires 3
+[arguments](systemml-algorithms-descriptive-statistics.html#arguments):
+
+* `X`: location of the input data file to analyze
+* `TYPES`: location of the file that contains the feature column types encoded by integer numbers: `1` = scale, `2` = nominal, `3` = ordinal
+* `STATS`: location of the output matrix of computed statistics will be stored
+
+We need to create a file `types.csv` that describes the type of each column in
+the data along with it's metadata file `types.csv.mtd`.
+
+ $ echo '1,1,1,2' > data/types.csv
+ $ echo '{"rows": 1, "cols": 4, "format": "csv"}' > data/types.csv.mtd
+
+
+To run the `Univar-Stats.dml` algorithm, issue the following command:
+
+ $ ./runStandaloneSystemML.sh algorithms/Univar-Stats.dml -nvargs X=data/haberman.data TYPES=data/types.csv STATS=data/univarOut.mtx
+
+The resulting matrix has one row per each univariate statistic and one column
+per input feature. The output file `univarOut.mtx` describes that
+matrix. The elements of the first column denote the number of the statistic,
+the elements of the second column refer to the number of the feature column in
+the input data, and the elements of the third column show the value of the
+univariate statistic.
+
+ 1 1 30.0
+ 1 2 58.0
+ 2 1 83.0
+ 2 2 69.0
+ 2 3 52.0
+ 3 1 53.0
+ 3 2 11.0
+ 3 3 52.0
+ 4 1 52.45751633986928
+ 4 2 62.85294117647059
+ 4 3 4.026143790849673
+ 5 1 116.71458266366658
+ 5 2 10.558630665380907
+ 5 3 51.691117539912135
+ 6 1 10.803452349303281
+ 6 2 3.2494046632238507
+ 6 3 7.189653506248555
+ 7 1 0.6175922641866753
+ 7 2 0.18575610076612029
+ 7 3 0.41100513466216837
+ 8 1 0.20594669940735139
+ 8 2 0.051698529971741194
+ 8 3 1.7857418611299172
+ 9 1 0.1450718616532357
+ 9 2 0.07798443581479181
+ 9 3 2.954633471088322
+ 10 1 -0.6150152487211726
+ 10 2 -1.1324380182967442
+ 10 3 11.425776549251449
+ 11 1 0.13934809593495995
+ 11 2 0.13934809593495995
+ 11 3 0.13934809593495995
+ 12 1 0.277810485320835
+ 12 2 0.277810485320835
+ 12 3 0.277810485320835
+ 13 1 52.0
+ 13 2 63.0
+ 13 3 1.0
+ 14 1 52.16013071895425
+ 14 2 62.80392156862745
+ 14 3 1.2483660130718954
+ 15 4 2.0
+ 16 4 1.0
+ 17 4 1.0
+
+The following table lists the number and name of each univariate statistic. The row
+numbers below correspond to the elements of the first column in the output
+matrix above. The signs “+” show applicability to scale or/and to categorical
+features.
+
+ | Row | Name of Statistic | Scale | Categ. |
+ | :-: |:-------------------------- |:-----:| :-----:|
+ | 1 | Minimum | + | |
+ | 2 | Maximum | + | |
+ | 3 | Range | + | |
+ | 4 | Mean | + | |
+ | 5 | Variance | + | |
+ | 6 | Standard deviation | + | |
+ | 7 | Standard error of mean | + | |
+ | 8 | Coefficient of variation | + | |
+ | 9 | Skewness | + | |
+ | 10 | Kurtosis | + | |
+ | 11 | Standard error of skewness | + | |
+ | 12 | Standard error of kurtosis | + | |
+ | 13 | Median | + | |
+ | 14 | Inter quartile mean | + | |
+ | 15 | Number of categories | | + |
+ | 16 | Mode | | + |
+ | 17 | Number of modes | | + |
+
+
+<br/>
+<br/>
+
+# Example 2 - Binary-class Support Vector Machines
+
+Let's take the same `haberman.data` to explore the
+[binary-class support vector machines](systemml-algorithms-classification.html#binary-class-support-vector-machines) algorithm `l2-svm.dml`.
+This example also illustrates how to use of the sampling algorithm `sample.dml`
+and the data split algorithm `spliXY.dml`.
+
+## Sampling the Test Data
+
+First we need to use the `sample.dml` algorithm to separate the input into one
+training data set and one data set for model prediction.
+
+Parameters:
+
+ * `X` : (input) input data set: filename of input data set
+ * `sv` : (input) sampling vector: filename of 1-column vector w/ percentages. sum(sv) must be 1.
+ * `O` : (output) folder name w/ samples generated
+ * `ofmt` : (output) format of O: "csv", "binary" (default)
+
+
+We will create the file `perc.csv` and `perc.csv.mtd` to define the sampling vector with a sampling rate of
+50% to generate 2 data sets:
+
+ $ printf "0.5\n0.5" > data/perc.csv
+ $ echo '{"rows": 2, "cols": 1, "format": "csv"}' > data/perc.csv.mtd
+
+Let's run the sampling algortihm to create the two data samples:
+
+ $ ./runStandaloneSystemML.sh algorithms/utils/sample.dml -nvargs X=data/haberman.data sv=data/perc.csv O=data/haberman.part ofmt="csv"
+
+
+
+## Splitting Labels from Features
+
+Next we use the `splitXY.dml` algortithm to separate the feature columns from
+the label column(s).
+
+Parameters:
+
+ * `X` : (input) filename of data matrix
+ * `y` : (input) colIndex: starting index is 1
+ * `OX` : (output) filename of output matrix with all columns except y
+ * `OY` : (output) filename of output matrix with y column
+ * `ofmt` : (output) format of OX and OY output matrix: "csv", "binary" (default)
+
+We specify `y=4` as the 4th column contains the labels to be predicted and run
+the `splitXY.dml` algorithm on our training and test data sets.
+
+ $ ./runStandaloneSystemML.sh algorithms/utils/splitXY.dml -nvargs X=data/haberman.part/1 y=4 OX=data/haberman.train.data.csv OY=data/haberman.train.labels.csv ofmt="csv"
+
+ $ ./runStandaloneSystemML.sh algorithms/utils/splitXY.dml -nvargs X=data/haberman.part/2 y=4 OX=data/haberman.test.data.csv OY=data/haberman.test.labels.csv ofmt="csv"
+
+## Training and Testing the Model
+
+Now we need to train our model using the `l2-svm.dml` algorithm.
+
+[Parameters](systemml-algorithms-classification.html#arguments-1):
+
+ * `X` : (input) filename of training data features
+ * `Y` : (input) filename of training data labels
+ * `model` : (output) filename of model that contains the learnt weights
+ * `fmt` : (output) format of model: "csv", "text" (sparse-matrix)
+ * `Log` : (output) log file for metrics and progress while training
+ * `confusion` : (output) filename of confusion matrix computed using a held-out test set (optional)
+
+The `l2-svm.dml` algorithm is used on our training data sample to train the model.
+
+ $ ./runStandaloneSystemML.sh algorithms/l2-svm.dml -nvargs X=data/haberman.train.data.csv Y=data/haberman.train.labels.csv model=data/l2-svm-model.csv fmt="csv" Log=data/l2-svm-log.csv
+
+The `l2-svm-predict.dml` algorithm is used on our test data sample to predict the labels based on the trained model.
+
+ $ ./runStandaloneSystemML.sh algorithms/l2-svm-predict.dml -nvargs X=data/haberman.test.data.csv Y=data/haberman.test.labels.csv model=data/l2-svm-model.csv fmt="csv" confusion=data/l2-svm-confusion.csv
+
+The console output should show the accuracy of the trained model in percent, i.e.:
+
+ 15/09/01 01:32:51 INFO api.DMLScript: BEGIN DML run 09/01/2015 01:32:51
+ 15/09/01 01:32:51 INFO conf.DMLConfig: Updating localtmpdir with value /tmp/systemml
+ 15/09/01 01:32:51 INFO conf.DMLConfig: Updating scratch with value scratch_space
+ 15/09/01 01:32:51 INFO conf.DMLConfig: Updating optlevel with value 2
+ 15/09/01 01:32:51 INFO conf.DMLConfig: Updating numreducers with value 10
+ 15/09/01 01:32:51 INFO conf.DMLConfig: Updating jvmreuse with value false
+ 15/09/01 01:32:51 INFO conf.DMLConfig: Updating defaultblocksize with value 1000
+ 15/09/01 01:32:51 INFO conf.DMLConfig: Updating dml.yarn.appmaster with value false
+ 15/09/01 01:32:51 INFO conf.DMLConfig: Updating dml.yarn.appmaster.mem with value 2048
+ 15/09/01 01:32:51 INFO conf.DMLConfig: Updating dml.yarn.mapreduce.mem with value 2048
+ 15/09/01 01:32:51 INFO conf.DMLConfig: Updating dml.yarn.app.queue with value default
+ 15/09/01 01:32:51 INFO conf.DMLConfig: Updating cp.parallel.matrixmult with value true
+ 15/09/01 01:32:51 INFO conf.DMLConfig: Updating cp.parallel.textio with value true
+ Accuracy (%): 74.14965986394557
+ 15/09/01 01:32:52 INFO api.DMLScript: SystemML Statistics:
+ Total execution time: 0.130 sec.
+ Number of executed MR Jobs: 0.
+
+
+The generated file `l2-svm-confusion.csv` should contain the following confusion matrix of this form:
+
+ |t1 t2|
+ |t3 t4|
+
+ * The model correctly predicted label 1 `t1` times
+ * The model incorrectly predicted label 1 as opposed to label 2 `t2` times
+ * The model incorrectly predicted label 2 as opposed to label 1 `t3` times
+ * The model correctly predicted label 2 `t4` times.
+
+If the confusion matrix looks like this ..
+
+ 107.0,38.0
+ 0.0,2.0
+
+... then the accuracy of the model is (t1+t4)/(t1+t2+t3+t4) = (107+2)/107+38+0+2) = 0.741496599
+
+<br/>
+
+Refer to the *SystemML Algorithms Reference* (`docs/SystemML_Algorithms_Reference.pdf`) for more details.
+
+<br/>
+
+# Troubleshooting
+
+If you encounter a `"java.lang.OutOfMemoryError"` you can edit the invocation
+script (`runStandaloneSystemML.sh` or `runStandaloneSystemML.bat`) to increase
+the memory available to the JVM, i.e:
+
+ java -Xmx16g -Xms4g -Xmn1g -cp ${CLASSPATH} com.ibm.bi.dml.api.DMLScript \
+ -f ${SCRIPT_FILE} -exec singlenode -config=SystemML-config.xml \
+ $@
+
+<br/>
+
+# Next Steps
+
+Check out the [SystemML Algorithm Reference](systemml-algorithms.html) and run
+some of the other pre-packaged algorithms.
+
+<style type="text/css">
+<!--
+ ol { list-style-type: none; counter-reset: item; margin: 1em; }
+ ol > li { display: table; counter-increment: item; margin-bottom: 0.1em; }
+ ol > li:before { content: counters(item, ".") ". "; display: table-cell; padding-right: 0.6em; }
+ li ol > li { margin: 0; }
+ li ol > li:before { content: counters(item, ".") " "; }
+-->
+</style>
+
+1. Descriptive Statistics
+ - [Univariate Statistics](systemml-algorithms-descriptive-statistics.html#univariate-statistics) (`Univar-Stats.dml`)
+ - [Bivariate Statistics](systemml-algorithms-descriptive-statistics.html#bivariate-statistics) (`bivar-stats.dml`)
+ - [Stratified Bivariate Statistics](systemml-algorithms-descriptive-statistics.html#stratified-bivariate-statistics) (`stratstats.dml`)
+2. Classification
+ - [Multinomial Logistic Regression](systemml-algorithms-classification.html#multinomial-logistic-regression) (`MultiLogReg.dml`)
+ - [Binary-Class Support Vector Machines](systemml-algorithms-classification.html#binary-class-support-vector-machines) (`l2-svm.dml`, `l2-svm-predict.dml`)
+ - [Multi-Class Support Vector Machines](systemml-algorithms-classification.html#multi-class-support-vector-machines) (`m-svm.dml`, `m-svm-predict.dml`)
+ - [Naive Bayes](systemml-algorithms-classification.html#naive-bayes) (`naive-bayes.dml`, `naive-bayes-predict.dml`)
+3. Clustering
+ - [K-Means Clustering](systemml-algorithms-clustering.html#k-means-clustering) (`Kmeans.dml`, `Kmeans-predict.dml`)
+4. Regression
+ - [Linear Regression](systemml-algorithms-regression.html#linear-regression) (`LinearRegDS.dml`, `LinearRegCG.dml`)
+ - [Generalized Linear Models](systemml-algorithms-regression.html#generalized-linear-models) (`GLM.dml`)
+ - [Regression Scoring and Prediction](systemml-algorithms-regression.html#regression-scoring-and-prediction) (`GLM-predict.dml`)
+5. Matrix Factorization
+ - [Principle Component Analysis](systemml-algorithms-matrix-factorization.html#principle-component-analysis) (`PCA.dml`)
+6. Interpolation
+ - [Cubic Spline Interpolation](systemml-algorithms-interpolations.html#cubic-spline)
+
+
+{% comment %} TODO: update the algorithms in the standalone package {% endcomment %}
+
+{% comment %} TODO: update the algorithms reference doc to include all algorithms, currently missing decision-tree.dml {% endcomment %}
+
+<br/>
+
+Follow the [Programming Guide](systemml-programming-guide.html) to implement
+your own SystemML algorithms.
+