You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemml.apache.org by de...@apache.org on 2016/01/16 23:12:50 UTC

incubator-systemml git commit: Update README.md to be more correct

Repository: incubator-systemml
Updated Branches:
  refs/heads/master cdc39636f -> 09895c30a


Update README.md to be more correct

Change algorithm categories from 5 to 6 (add Survival Analysis).
Remove whitespace from end of lines.
Minor text updates.
Replace triple back ticks with single back ticks.
Update perc.csv info to make more sense.
Add Conclusion header.

Closes #43.


Project: http://git-wip-us.apache.org/repos/asf/incubator-systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-systemml/commit/09895c30
Tree: http://git-wip-us.apache.org/repos/asf/incubator-systemml/tree/09895c30
Diff: http://git-wip-us.apache.org/repos/asf/incubator-systemml/diff/09895c30

Branch: refs/heads/master
Commit: 09895c30aa7d35ff4f57891711087c4c61ef5522
Parents: cdc3963
Author: Deron Eriksson <de...@us.ibm.com>
Authored: Sat Jan 16 14:09:20 2016 -0800
Committer: Deron Eriksson <de...@us.ibm.com>
Committed: Sat Jan 16 14:09:20 2016 -0800

----------------------------------------------------------------------
 README.md | 102 ++++++++++++++++++++++++++++-----------------------------
 1 file changed, 51 insertions(+), 51 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-systemml/blob/09895c30/README.md
----------------------------------------------------------------------
diff --git a/README.md b/README.md
index 5e7cc34..7f89aa9 100644
--- a/README.md
+++ b/README.md
@@ -19,6 +19,10 @@ limitations under the License.
 
 # SystemML
 
+SystemML is now an **Apache Incubator** project! Please see the [**Apache SystemML (incubating)**](http://systemml.apache.org/)
+website for more information. The latest project documentation can be found at the
+[**SystemML Documentation**](http://apache.github.io/incubator-systemml/) website on GitHub.
+
 SystemML is a flexible, scalable machine learning system.
 SystemML's distinguishing characteristics are:
 
@@ -26,10 +30,6 @@ SystemML's distinguishing characteristics are:
   2. **Multiple execution modes**, including Standalone, Spark Batch, Spark MLContext, Hadoop Batch, and JMLC.
   3. **Automatic optimization** based on data and cluster characteristics to ensure both efficiency and scalability.
 
-[SystemML](http://systemml.apache.org/) is now an Apache Incubator project.
-The latest project documentation can be found at the 
-[SystemML Documentation](http://apache.github.io/incubator-systemml/) website on GitHub.
-
 
 ### Algorithm Customizability
 
@@ -48,9 +48,10 @@ physical data representations.
 SystemML computations can be executed in a variety of different modes. To begin with, SystemML
 can be operated in Standalone mode on a single machine, allowing data scientists to develop
 algorithms locally without need of a distributed cluster. In order to scale up, algorithms can also be distributed
-across Spark or Hadoop.
-This flexibility allows the utilization of an organization's existing resources and expertise. In addition, SystemML
-can be operated via Scala, Java, and Python. SystemML also features an embedded API for scoring models.
+across a cluster using Spark or Hadoop.
+This flexibility allows the utilization of an organization's existing resources and expertise.
+In addition, SystemML features a Spark MLContext API that allows for programmatic interaction via Scala and Java.
+SystemML also features an embedded API for scoring models.
 
 
 ### Automatic Optimization
@@ -83,14 +84,14 @@ To build the SystemML distributions (`.tar.gz`, `.zip`, etc.), run:
 
 SystemML features a comprehensive set of integration tests. To perform these tests, run:
 
-    mvn verify 
+    mvn verify
 
 Note: these tests require [R](https://www.r-project.org/) to be installed and available as part of the PATH variable on
-the machine on which you are running these tests. 
+the machine on which you are running these tests.
 
 If required, please install the following packages in R:
 
-    install.packages(c("batch", "bitops", "boot", "caTools", "data.table", "doMC", "doSNOW", "ggplot2", "glmnet", "lda", "Matrix", "matrixStats", "moments", "plotrix", "psych", "reshape", "topicmodels", "wordcloud"), dependencies=TRUE) 
+    install.packages(c("batch", "bitops", "boot", "caTools", "data.table", "doMC", "doSNOW", "ggplot2", "glmnet", "lda", "Matrix", "matrixStats", "moments", "plotrix", "psych", "reshape", "topicmodels", "wordcloud"), dependencies=TRUE)
 
 
 * * *
@@ -98,20 +99,20 @@ If required, please install the following packages in R:
 ## Running SystemML in Standalone Mode
 
 SystemML can run in distributed mode as well as in local standalone mode. We'll operate in standalone mode in this
-guide. 
-After you built SystemML from source (```mvn clean package```) the standalone mode can be executed either on Mac/Unix 
-using the ```./bin/systemml``` script or on Windows using the ```.\bin\systemml.bat``` batch file. 
+guide.
+After you build SystemML from source (`mvn clean package`), the standalone mode can be executed either on Linux or OS X
+using the `./bin/systemml` script, or on Windows using the `.\bin\systemml.bat` batch file.
 
-If you run from the script from the project root folder ```./``` or from the ```./bin``` folder, then the output files
-from running SystemML will be created inside the ```./temp``` folder to keep them separate from the SystemML source
-files managed by Git. The output files for all of the examples in this guide will be created under the ```./temp```
+If you run from the script from the project root folder `./` or from the `./bin` folder, then the output files
+from running SystemML will be created inside the `./temp` folder to keep them separate from the SystemML source
+files managed by Git. The output files for all of the examples in this guide will be created under the `./temp`
 folder.
 
-The runtime behavior and logging behavior of SystemML can be customized by editing the files 
-```./conf/SystemML-config.xml``` and ```./conf/log4j.properties```. Both files will be created from their corresponding
-```*.template``` files during the first execution of the SystemML executable script.
+The runtime behavior and logging behavior of SystemML can be customized by editing the files
+`./conf/SystemML-config.xml` and `./conf/log4j.properties`. Both files will be created from their corresponding
+`*.template` files during the first execution of the SystemML executable script.
 
-When invoking the ```./bin/systemml``` or ```.\bin\systemml.bat``` with any of the prepackaged DML scripts you can omit
+When invoking the `./bin/systemml` or `.\bin\systemml.bat` with any of the prepackaged DML scripts you can omit
 the relative path to the DML script file. The following two commands are equivalent:
 
     ./bin/systemml ./scripts/datagen/genLinearRegressionData.dml -nvargs numSamples=1000 numFeatures=50 maxFeatureValue=5 maxWeight=5 addNoise=FALSE b=0 sparsity=0.7 output=linRegData.csv format=csv perc=0.5
@@ -125,10 +126,9 @@ In this guide we invoke the command with the relative folder to make it easier t
 
 ## ML Algorithms
 
-SystemML features a suite of algorithms that can be grouped into five broad categories:
-Descriptive Statistics, Classification, Clustering, Regression, and Matrix Factorization. Detailed descriptions of
-these algorithms can be found in the Algorithm Reference packaged with SystemML.
-
+SystemML features a suite of algorithms that can be grouped into six broad categories:
+Descriptive Statistics, Classification, Clustering, Regression, Matrix Factorization, and Survival Analysis.
+Detailed descriptions of these algorithms can be found in the SystemML Algorithms Reference.
 
 * * *
 
@@ -136,9 +136,9 @@ these algorithms can be found in the Algorithm Reference packaged with SystemML.
 
 As an example of the capabilities and power of SystemML and DML, let's consider the Linear Regression algorithm.
 We require sets of data to train and test our model. To obtain this data, we can either use real data or
-generate data for our algorithm. The 
+generate data for our algorithm. The
 [UCI Machine Learning Repository Datasets](https://archive.ics.uci.edu/ml/datasets.html) is one location for real data.
-Use of real data typically involves some degree of data wrangling. In the following example, we will use SystemML to 
+Use of real data typically involves some degree of data wrangling. In the following example, we will use SystemML to
 generate random data to train and test our model.
 
 This example consists of the following parts:
@@ -150,7 +150,7 @@ This example consists of the following parts:
   * [Train Model on First Sample](#train-model-on-first-sample)
   * [Test Model on Second Sample](#test-model-on-second-sample)
 
-SystemML is distributed in several packages, including a standalone package. We'll operate in Standalone mode in this 
+SystemML is distributed in several packages, including a standalone package. We'll operate in Standalone mode in this
 example.
 
 <a name="run-dml-script-to-generate-random-data" />
@@ -159,36 +159,33 @@ example.
 
 We can execute the `genLinearRegressionData.dml` script in Standalone mode using either the `systemml` or `systemml.bat`
 file.
-In this example, we'll generate a matrix of 1000 rows of 50 columns of test data, with sparsity 0.7. In addition to 
+In this example, we'll generate a matrix of 1000 rows of 50 columns of test data, with sparsity 0.7. In addition to
 this, a 51<sup>st</sup> column consisting of labels will
 be appended to the matrix.
 
     ./bin/systemml ./scripts/datagen/genLinearRegressionData.dml -nvargs numSamples=1000 numFeatures=50 maxFeatureValue=5 maxWeight=5 addNoise=FALSE b=0 sparsity=0.7 output=linRegData.csv format=csv perc=0.5
 
-This generates the following files inside the ```./temp``` folder:
+This generates the following files inside the `./temp` folder:
 
     linRegData.csv      # 1000 rows of 51 columns of doubles (50 data columns and 1 label column), csv format
-    linRegData.csv.mtd  # metadata file
-
+    linRegData.csv.mtd  # Metadata file
+    perc.csv            # Used to generate two subsets of the data (for training and testing)
+    perc.csv.mtd        # Metadata file
+    scratch_space       # SystemML scratch_space directory
 
 <a name="divide-generated-data-into-two-sample-groups" />
 
 ### Divide Generated Data into Two Sample Groups
 
-Next, we'll create two subsets of the generated data, each of size ~50%. We can accomplish this using the `sample.dml` 
-script.
-This script will randomly sample rows from the `linRegData.csv` file and place them into 2 files.
-
-To do this, we need to create a csv file for the `sv` named argument (see `sample.dml` for more details),
-which I called `perc.csv`. This file was generated in previous step and looks like:
+Next, we'll create two subsets of the generated data, each of size ~50%. We can accomplish this using the `sample.dml`
+script with the `perc.csv` file created in the previous step:
 
     0.5
     0.5
 
 
-This will create two sample groups of roughly 50 percent each. 
-
-Now, the `sample.dml` script can be run.
+The `sample.dml` script will randomly sample rows from the `linRegData.csv` file and place them into 2 files based
+on the percentages specified in `perc.csv`. This will create two sample groups of roughly 50 percent each.
 
     ./bin/systemml ./scripts/utils/sample.dml -nvargs X=linRegData.csv sv=perc.csv O=linRegDataParts ofmt=csv
 
@@ -246,7 +243,7 @@ This splits column 51 off the data, resulting in the following files:
 ### Train Model on First Sample
 
 Now, we can train our model based on the first sample. To do this, we utilize the `LinearRegDS.dml` (Linear Regression
-Direct Solve) script. Note that SystemML also includes a `LinearRegCG.dml` (Linear Regression Conjugate Gradient) 
+Direct Solve) script. Note that SystemML also includes a `LinearRegCG.dml` (Linear Regression Conjugate Gradient)
 algorithm for situations where the number of features is large.
 
     ./bin/systemml ./scripts/algorithms/LinearRegDS.dml -nvargs X=linRegData.train.data.csv Y=linRegData.train.labels.csv B=betas.csv fmt=csv
@@ -285,11 +282,11 @@ Now that we have our `betas.csv`, we can test our model with our second set of d
 
 To test our model on the second sample, we can use the `GLM-predict.dml` script. This script can be used for both
 prediction and scoring. Here, we're using it for scoring since we include the `Y` named argument. Our `betas.csv`
-file is specified as the `B` named argument.  
+file is specified as the `B` named argument.
 
     ./bin/systemml ./scripts/algorithms/GLM-predict.dml -nvargs X=linRegData.test.data.csv Y=linRegData.test.labels.csv B=betas.csv fmt=csv
 
-This generates the following statistics to standard output.
+This generates statistics similar to the following to standard output.
 
 	LOGLHOOD_Z,,FALSE,NaN
 	LOGLHOOD_Z_PVAL,,FALSE,NaN
@@ -324,22 +321,25 @@ to the value obtained from the model training phase.
 For convenience, we can encapsulate our DML invocations in a single script:
 
 	#!/bin/bash
-	
+
 	./bin/systemml ./scripts/datagen/genLinearRegressionData.dml -nvargs numSamples=1000 numFeatures=50 maxFeatureValue=5 maxWeight=5 addNoise=FALSE b=0 sparsity=0.7 output=linRegData.csv format=csv perc=0.5
-	
+
 	./bin/systemml ./scripts/utils/sample.dml -nvargs X=linRegData.csv sv=perc.csv O=linRegDataParts ofmt=csv
-	
+
 	./bin/systemml ./scripts/utils/splitXY.dml -nvargs X=linRegDataParts/1 y=51 OX=linRegData.train.data.csv OY=linRegData.train.labels.csv ofmt=csv
-	
+
 	./bin/systemml ./scripts/utils/splitXY.dml -nvargs X=linRegDataParts/2 y=51 OX=linRegData.test.data.csv OY=linRegData.test.labels.csv ofmt=csv
-	
+
 	./bin/systemml ./scripts/algorithms/LinearRegDS.dml -nvargs X=linRegData.train.data.csv Y=linRegData.train.labels.csv B=betas.csv fmt=csv
-	
+
 	./bin/systemml ./scripts/algorithms/GLM-predict.dml -nvargs X=linRegData.test.data.csv Y=linRegData.test.labels.csv B=betas.csv fmt=csv
 
 
-In this example, we've seen a small part of the capabilities of SystemML. For more detailed information, please 
+* * *
+
+## Conclusion and Next Steps
+
+In this example, we've seen a small part of the capabilities of SystemML. For more detailed information, please
 consult the [Apache SystemML (incubating)](http://systemml.apache.org/) website and the
 [SystemML Documentation](http://apache.github.io/incubator-systemml/) website on GitHub.
 
-