You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemml.apache.org by ni...@apache.org on 2017/07/05 18:17:30 UTC
systemml git commit: [SYSTEMML-540] Extended Caffe2DML to support image segmentation problems

Repository: systemml
Updated Branches:
  refs/heads/gh-pages 05792e0e9 -> 0ff267404


[SYSTEMML-540] Extended Caffe2DML to support image segmentation problems

- This commit extends Caffe2DML to support image segmentation problem,
  depthwise convolution and has couple of bugfixes regarding loading
  existing caffe model.
- Additionally, we have added a summary() method to Caffe2DML to print
network.

Closes #527.


Project: http://git-wip-us.apache.org/repos/asf/systemml/repo
Commit: http://git-wip-us.apache.org/repos/asf/systemml/commit/0ff26740
Tree: http://git-wip-us.apache.org/repos/asf/systemml/tree/0ff26740
Diff: http://git-wip-us.apache.org/repos/asf/systemml/diff/0ff26740

Branch: refs/heads/gh-pages
Commit: 0ff267404237640807915ffbf6b7a49791bda6c5
Parents: 05792e0
Author: Niketan Pansare <np...@us.ibm.com>
Authored: Wed Jul 5 11:02:57 2017 -0700
Committer: Niketan Pansare <np...@us.ibm.com>
Committed: Wed Jul 5 11:02:57 2017 -0700

----------------------------------------------------------------------
 beginners-guide-caffe2dml.md | 534 ++++++++++++++++++++++++++++++++++----
 python-reference.md          |   4 +
 2 files changed, 487 insertions(+), 51 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/systemml/blob/0ff26740/beginners-guide-caffe2dml.md
----------------------------------------------------------------------
diff --git a/beginners-guide-caffe2dml.md b/beginners-guide-caffe2dml.md
index f15e025..7671c32 100644
--- a/beginners-guide-caffe2dml.md
+++ b/beginners-guide-caffe2dml.md
@@ -32,24 +32,14 @@ limitations under the License.
 Caffe2DML is an **experimental API** that converts an Caffe specification to DML. 
 It is designed to fit well into the mllearn framework and hence supports NumPy, Pandas as well as PySpark DataFrame.
 
-## Examples
+### Training Lenet 
 
-### Train Lenet on MNIST dataset
-
-#### MNIST dataset
-
-The MNIST dataset was constructed from two datasets of the US National Institute of Standards and Technology (NIST). The training set consists of handwritten digits from 250 different people, 50 percent high school students, and 50 percent employees from the Census Bureau. Note that the test set contains handwritten digits from different people following the same split.
-In the below example, we are using mlxtend package to load the mnist dataset into Python NumPy arrays, but you are free to download it directly from http://yann.lecun.com/exdb/mnist/.
-
-```bash
-pip install mlxtend
-```
-
-#### Lenet network
-
-Lenet is a simple convolutional neural network, proposed by Yann LeCun in 1998. It has 2 convolutions/pooling and fully connected layer. 
+To create a Caffe2DML object, one needs to create a solver and network file that conforms 
+to the [Caffe specification](http://caffe.berkeleyvision.org/).
+In this example, we will train Lenet which is a simple convolutional neural network, proposed by Yann LeCun in 1998. 
+It has 2 convolutions/pooling and fully connected layer. 
 Similar to Caffe, the network has been modified to add dropout. 
-For more detail, please see http://yann.lecun.com/exdb/lenet/
+For more detail, please see [http://yann.lecun.com/exdb/lenet/](http://yann.lecun.com/exdb/lenet/).
 
 The [solver specification](https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/mnist_lenet/lenet_solver.proto)
 specifies to Caffe2DML to use following configuration when generating the training DML script:  
@@ -58,67 +48,167 @@ specifies to Caffe2DML to use following configuration when generating the traini
 - `display: 100`: Display training loss after every 100 iterations.
 - `test_interval: 500`: Display validation loss after every 500 iterations.
 - `test_iter: 10`: Validation data size = 10 * BATCH_SIZE.
- 
+
+```python
+from systemml.mllearn import Caffe2DML
+import urllib
+
+# Download the Lenet network
+urllib.urlretrieve('https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/mnist_lenet/lenet.proto', 'lenet.proto')
+urllib.urlretrieve('https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/mnist_lenet/lenet_solver.proto', 'lenet_solver.proto')
+# Train Lenet On MNIST using scikit-learn like API
+
+# MNIST dataset contains 28 X 28 gray-scale (number of channel=1).
+lenet = Caffe2DML(spark, solver='lenet_solver.proto', input_shape=(1, 28, 28))
+lenet.summary()
+```
+
+Output:
+
+```
++-----+---------------+--------------+------------+---------+-----------+---------+
+| Name|           Type|        Output|      Weight|     Bias|        Top|   Bottom|
++-----+---------------+--------------+------------+---------+-----------+---------+
+|mnist|           Data| (, 1, 28, 28)|            |         |mnist,mnist|         |
+|conv1|    Convolution|(, 32, 28, 28)|   [32 X 25]| [32 X 1]|      conv1|    mnist|
+|relu1|           ReLU|(, 32, 28, 28)|            |         |      relu1|    conv1|
+|pool1|        Pooling|(, 32, 14, 14)|            |         |      pool1|    relu1|
+|conv2|    Convolution|(, 64, 14, 14)|  [64 X 800]| [64 X 1]|      conv2|    pool1|
+|relu2|           ReLU|(, 64, 14, 14)|            |         |      relu2|    conv2|
+|pool2|        Pooling|  (, 64, 7, 7)|            |         |      pool2|    relu2|
+|  ip1|   InnerProduct| (, 512, 1, 1)|[3136 X 512]|[1 X 512]|        ip1|    pool2|
+|relu3|           ReLU| (, 512, 1, 1)|            |         |      relu3|      ip1|
+|drop1|        Dropout| (, 512, 1, 1)|            |         |      drop1|    relu3|
+|  ip2|   InnerProduct|  (, 10, 1, 1)|  [512 X 10]| [1 X 10]|        ip2|    drop1|
+| loss|SoftmaxWithLoss|  (, 10, 1, 1)|            |         |       loss|ip2,mnist|
++-----+---------------+--------------+------------+---------+-----------+---------+
+``` 
+
+To train the above lenet model, we use the MNIST dataset. 
+The MNIST dataset was constructed from two datasets of the US National Institute of Standards and Technology (NIST). 
+The training set consists of handwritten digits from 250 different people, 50 percent high school students, and 50 percent employees from the Census Bureau. Note that the test set contains handwritten digits from different people following the same split.
+In this example, we are using mlxtend package to load the mnist dataset into Python NumPy arrays, but you are free to download it directly from http://yann.lecun.com/exdb/mnist/.
+
+```bash
+pip install mlxtend
+```
+
+We first split the MNIST dataset into train and test.  
 
 ```python
 from mlxtend.data import mnist_data
 import numpy as np
 from sklearn.utils import shuffle
-import urllib
-from systemml.mllearn import Caffe2DML
-
 # Download the MNIST dataset
 X, y = mnist_data()
 X, y = shuffle(X, y)
-
 # Split the data into training and test
 n_samples = len(X)
 X_train = X[:int(.9 * n_samples)]
 y_train = y[:int(.9 * n_samples)]
 X_test = X[int(.9 * n_samples):]
 y_test = y[int(.9 * n_samples):]
+```
 
-# Download the Lenet network
-urllib.urlretrieve('https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/mnist_lenet/lenet.proto', 'lenet.proto')
-urllib.urlretrieve('https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/mnist_lenet/lenet_solver.proto', 'lenet_solver.proto')
+Finally, we use the training and test dataset to perform training and prediction using scikit-learn like API.
 
-# Train Lenet On MNIST using scikit-learn like API
-# MNIST dataset contains 28 X 28 gray-scale (number of channel=1).
-lenet = Caffe2DML(sqlCtx, solver='lenet_solver.proto', input_shape=(1, 28, 28))
+```python
+# Since Caffe2DML is a mllearn API, it allows for scikit-learn like method for training.
+lenet.fit(X_train, y_train)
+# Either perform prediction: lenet.predict(X_test) or scoring:
+lenet.score(X_test, y_test)
+```
 
-# debug=True prints will print the generated DML script along with classification report. Please donot test this flag in production.
-lenet.set(debug=True)
+Output:
+```
+Iter:100, training loss:0.189008481420049, training accuracy:92.1875
+Iter:200, training loss:0.21657020576713149, training accuracy:96.875
+Iter:300, training loss:0.05780939180052287, training accuracy:98.4375
+Iter:400, training loss:0.03406193840071965, training accuracy:100.0
+Iter:500, training loss:0.02847187709112875, training accuracy:100.0
+Iter:500, validation loss:222.736109642486, validation accuracy:96.49077868852459
+Iter:600, training loss:0.04867848427394318, training accuracy:96.875
+Iter:700, training loss:0.043060905384304224, training accuracy:98.4375
+Iter:800, training loss:0.01861298388336358, training accuracy:100.0
+Iter:900, training loss:0.03495462005933769, training accuracy:100.0
+Iter:1000, training loss:0.04598737325942163, training accuracy:98.4375
+Iter:1000, validation loss:180.04232316810746, validation accuracy:97.28483606557377
+Iter:1100, training loss:0.05630274512793694, training accuracy:98.4375
+Iter:1200, training loss:0.027278141291535066, training accuracy:98.4375
+Iter:1300, training loss:0.04356275106270366, training accuracy:98.4375
+Iter:1400, training loss:0.00780793048139091, training accuracy:100.0
+Iter:1500, training loss:0.004135965492374173, training accuracy:100.0
+Iter:1500, validation loss:156.61636761709374, validation accuracy:97.48975409836065
+Iter:1600, training loss:0.007939063305475983, training accuracy:100.0
+Iter:1700, training loss:0.0025769653351162196, training accuracy:100.0
+Iter:1800, training loss:0.0023251742357435204, training accuracy:100.0
+Iter:1900, training loss:0.0016795711023936644, training accuracy:100.0
+Iter:2000, training loss:0.03676045262879483, training accuracy:98.4375
+Iter:2000, validation loss:173.66147359346, validation accuracy:97.48975409836065
+0.97399999999999998
+```
 
-# If you want to see the statistics as well as the plan
-lenet.setStatistics(True).setExplain(True)
+### Additional Configuration
 
-# If you want to force GPU execution. Please make sure the required dependency are available.  
-# lenet.setGPU(True).setForceGPU(True)
-# Example usage of train_algo, test_algo. Assume 2 gpus on driver
-# lenet.set(train_algo="allreduce_parallel_batches", test_algo="minibatch", parallel_batches=2)
+- Print the generated DML script along with classification report:  `lenet.set(debug=True)`
+- Print the heavy hitters instruction and the execution plan (advanced users): `lenet.setStatistics(True).setExplain(True)`
+- (Optional but recommended) Enable [native BLAS](http://apache.github.io/systemml/native-backend): `lenet.setConfigProperty("native.blas", "auto")`
+- Enable experimental feature such as codegen: `lenet.setConfigProperty("codegen.enabled", "true").setConfigProperty("codegen.plancache", "true")`
+- Force GPU execution (please make sure the required jcuda dependency are included): lenet.setGPU(True).setForceGPU(True)
 
-# (Optional but recommended) Enable native BLAS. 
-lenet.setConfigProperty("native.blas", "auto")
+Unlike Caffe where default train and test algorithm is `minibatch`, you can specify the
+algorithm using the parameters `train_algo` and `test_algo` (valid values are: `minibatch`, `allreduce_parallel_batches`, 
+and `allreduce`). Here are some common settings:
 
-# In case you want to enable experimental feature such as codegen
-# lenet.setConfigProperty("codegen.enabled", "true").setConfigProperty("codegen.plancache", "true")
+|                                                                          | PySpark script                                                                                                                           | Changes to Network/Solver                                              |
+|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|
+| Single-node CPU execution (similar to Caffe with solver_mode: CPU)       | `lenet.set(train_algo="minibatch", test_algo="minibatch")`                                                                               | Ensure that `batch_size` is set to appropriate value (for example: 64) |
+| Single-node single-GPU execution                                         | `lenet.set(train_algo="minibatch", test_algo="minibatch").setGPU(True).setForceGPU(True)`                                                | Ensure that `batch_size` is set to appropriate value (for example: 64) |
+| Single-node multi-GPU execution (similar to Caffe with solver_mode: GPU) | `lenet.set(train_algo="allreduce_parallel_batches", test_algo="minibatch", parallel_batches=num_gpu).setGPU(True).setForceGPU(True)`     | Ensure that `batch_size` is set to appropriate value (for example: 64) |
+| Distributed prediction                                                   | `lenet.set(test_algo="allreduce")`                                                                                                       |                                                                        |
+| Distributed synchronous training                                         | `lenet.set(train_algo="allreduce_parallel_batches", parallel_batches=num_cluster_cores)`                                                 | Ensure that `batch_size` is set to appropriate value (for example: 64) |
 
-# Since Caffe2DML is a mllearn API, it allows for scikit-learn like method for training.
+### Saving the trained model
+
+```python
 lenet.fit(X_train, y_train)
-lenet.predict(X_test)
+lenet.save('trained_weights')
+new_lenet = Caffe2DML(spark, solver='lenet_solver.proto', input_shape=(1, 28, 28))
+new_lenet.load('trained_weights')
+new_lenet.score(X_test, y_test)
 ```
 
-For more detail on enabling native BLAS, please see the documentation for the [native backend](http://apache.github.io/systemml/native-backend).
+### Loading a pretrained caffemodel
 
-Common settings for `train_algo` and `test_algo` parameters:
+We provide a converter utility to convert `.caffemodel` trained using Caffe to SystemML format.
 
-|                                                                          | PySpark script                                                                                                                           | Changes to Network/Solver                                              |
-|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|
-| Single-node CPU execution (similar to Caffe with solver_mode: CPU)       | `caffe2dml.set(train_algo="minibatch", test_algo="minibatch")`                                                                           | Ensure that `batch_size` is set to appropriate value (for example: 64) |
-| Single-node single-GPU execution                                         | `caffe2dml.set(train_algo="minibatch", test_algo="minibatch").setGPU(True).setForceGPU(True)`                                            | Ensure that `batch_size` is set to appropriate value (for example: 64) |
-| Single-node multi-GPU execution (similar to Caffe with solver_mode: GPU) | `caffe2dml.set(train_algo="allreduce_parallel_batches", test_algo="minibatch", parallel_batches=num_gpu).setGPU(True).setForceGPU(True)` | Ensure that `batch_size` is set to appropriate value (for example: 64) |
-| Distributed prediction                                                   | `caffe2dml.set(test_algo="allreduce")`                                                                                                   |                                                                        |
-| Distributed synchronous training                                         | `caffe2dml.set(train_algo="allreduce_parallel_batches", parallel_batches=num_cluster_cores)`                                             | Ensure that `batch_size` is set to appropriate value (for example: 64) |
+```python
+# First download deploy file and caffemodel
+import urllib
+urllib.urlretrieve('https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/imagenet/vgg19/VGG_ILSVRC_19_layers_deploy.proto', 'VGG_ILSVRC_19_layers_deploy.proto')
+urllib.urlretrieve('http://www.robots.ox.ac.uk/~vgg/software/very_deep/caffe/VGG_ILSVRC_19_layers.caffemodel', 'VGG_ILSVRC_19_layers.caffemodel')
+# Save the weights into trained_vgg_weights directory
+import systemml as sml
+sml.convert_caffemodel(sc, 'VGG_ILSVRC_19_layers_deploy.proto', 'VGG_ILSVRC_19_layers.caffemodel',  'trained_vgg_weights')
+```
+
+We can then use the `trained_vgg_weights` directory for performing prediction or fine-tuning.
+
+```python
+# Download the VGG network
+urllib.urlretrieve('https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/imagenet/vgg19/VGG_ILSVRC_19_layers_network.proto', 'VGG_ILSVRC_19_layers_network.proto')
+urllib.urlretrieve('https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/imagenet/vgg19/VGG_ILSVRC_19_layers_solver.proto', 'VGG_ILSVRC_19_layers_solver.proto')
+# Storing the labels.txt in the weights directory allows predict to return a label (for example: 'cougar, puma, catamount, mountain lion, painter, panther, Felis concolor') rather than the column index of one-hot encoded vector (for example: 287).
+urllib.urlretrieve('https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/imagenet/labels.txt', os.path.join('trained_vgg_weights', 'labels.txt'))
+from systemml.mllearn import Caffe2DML
+vgg = Caffe2DML(sqlCtx, solver='VGG_ILSVRC_19_layers_solver.proto', input_shape=(3, 224, 224))
+vgg.load('trained_vgg_weights')
+# We can then perform prediction:
+from PIL import Image
+X_test = sml.convertImageToNumPyArr(Image.open('test.jpg'), img_shape=(3, 224, 224))
+vgg.predict(X_test)
+# OR Fine-Tuning: vgg.fit(X_train, y_train)
+```
 
 ## Frequently asked questions
 
@@ -291,16 +381,358 @@ train_df = sc.parallelize(list_jpeg_files, int(len(list_jpeg_files)/10)).map(lam
 train_df.write.parquet('kaggle-cats-dogs.parquet')
 ```
 
+An alternative way to load images into a PySpark DataFrame for prediction, is to use MLLib's LabeledPoint class:
+
+```python
+list_jpeg_files = os.listdir(train_dir)
+train_df = sc.parallelize(list_jpeg_files, int(len(list_jpeg_files)/10)).map(lambda filename : LabeledPoint(0, sml.convertImageToNumPyArr(Image.open(os.path.join(train_dir, filename)), img_shape=img_shape)[0,:])).toDF().select('features')
+# Note: convertVectorColumnsToML has an additional serialization cost
+train_df = MLUtils.convertVectorColumnsToML(train_df)
+```
+ 
+
 #### Can I use Caffe2DML via Scala ?
 
 Though we recommend using Caffe2DML via its Python interfaces, it is possible to use it by creating an object of the class
 `org.apache.sysml.api.dl.Caffe2DML`. It is important to note that Caffe2DML's scala API is packaged in `systemml-*-extra.jar`.
 
+#### How can I get summary information of my network ?
+ 
+
+```python
+lenet.summary()
+```
+
+Output:
+
+```
++-----+---------------+--------------+------------+---------+-----------+---------+
+| Name|           Type|        Output|      Weight|     Bias|        Top|   Bottom|
++-----+---------------+--------------+------------+---------+-----------+---------+
+|mnist|           Data| (, 1, 28, 28)|            |         |mnist,mnist|         |
+|conv1|    Convolution|(, 32, 28, 28)|   [32 X 25]| [32 X 1]|      conv1|    mnist|
+|relu1|           ReLU|(, 32, 28, 28)|            |         |      relu1|    conv1|
+|pool1|        Pooling|(, 32, 14, 14)|            |         |      pool1|    relu1|
+|conv2|    Convolution|(, 64, 14, 14)|  [64 X 800]| [64 X 1]|      conv2|    pool1|
+|relu2|           ReLU|(, 64, 14, 14)|            |         |      relu2|    conv2|
+|pool2|        Pooling|  (, 64, 7, 7)|            |         |      pool2|    relu2|
+|  ip1|   InnerProduct| (, 512, 1, 1)|[3136 X 512]|[1 X 512]|        ip1|    pool2|
+|relu3|           ReLU| (, 512, 1, 1)|            |         |      relu3|      ip1|
+|drop1|        Dropout| (, 512, 1, 1)|            |         |      drop1|    relu3|
+|  ip2|   InnerProduct|  (, 10, 1, 1)|  [512 X 10]| [1 X 10]|        ip2|    drop1|
+| loss|SoftmaxWithLoss|  (, 10, 1, 1)|            |         |       loss|ip2,mnist|
++-----+---------------+--------------+------------+---------+-----------+---------+
+``` 
 
 #### How can I view the script generated by Caffe2DML ?
 
 To view the generated DML script (and additional debugging information), please set the `debug` parameter to True.
 
 ```python
-caffe2dmlObject.set(debug=True)
+lenet.set(debug=True)
+```
+
+Output:
+```
+001|debug = TRUE
+002|source("nn/layers/softmax.dml") as softmax
+003|source("nn/layers/cross_entropy_loss.dml") as cross_entropy_loss
+004|source("nn/layers/conv2d_builtin.dml") as conv2d_builtin
+005|source("nn/layers/relu.dml") as relu
+006|source("nn/layers/max_pool2d_builtin.dml") as max_pool2d_builtin
+007|source("nn/layers/affine.dml") as affine
+008|source("nn/layers/dropout.dml") as dropout
+009|source("nn/optim/sgd_momentum.dml") as sgd_momentum
+010|source("nn/layers/l2_reg.dml") as l2_reg
+011|X_full_path = ifdef($X, " ")
+012|X_full = read(X_full_path)
+013|y_full_path = ifdef($y, " ")
+014|y_full = read(y_full_path)
+015|num_images = nrow(y_full)
+016|# Convert to one-hot encoding (Assumption: 1-based labels)
+017|y_full = table(seq(1,num_images,1), y_full, num_images, 10)
+018|weights = ifdef($weights, " ")
+019|# Initialize the layers and solvers
+020|X_full = X_full * 0.00390625
+021|BATCH_SIZE = 64
+022|[conv1_weight,conv1_bias] = conv2d_builtin::init(32,1,5,5)
+023|[conv2_weight,conv2_bias] = conv2d_builtin::init(64,32,5,5)
+024|[ip1_weight,ip1_bias] = affine::init(3136,512)
+025|[ip2_weight,ip2_bias] = affine::init(512,10)
+026|conv1_weight_v = sgd_momentum::init(conv1_weight)
+027|conv1_bias_v = sgd_momentum::init(conv1_bias)
+028|conv2_weight_v = sgd_momentum::init(conv2_weight)
+029|conv2_bias_v = sgd_momentum::init(conv2_bias)
+030|ip1_weight_v = sgd_momentum::init(ip1_weight)
+031|ip1_bias_v = sgd_momentum::init(ip1_bias)
+032|ip2_weight_v = sgd_momentum::init(ip2_weight)
+033|ip2_bias_v = sgd_momentum::init(ip2_bias)
+034|num_validation = 10 * BATCH_SIZE
+035|# Sanity check to ensure that validation set is not too large
+036|if(num_validation > ceil(0.3 * num_images)) {
+037|    max_test_iter = floor(ceil(0.3 * num_images) / BATCH_SIZE)
+038|    stop("Too large validation size. Please reduce test_iter to " + max_test_iter)
+039|}
+040|X = X_full[(num_validation+1):num_images,]; y = y_full[(num_validation+1):num_images,]; X_val = X_full[1:num_validation,]; y_val = y_full[1:num_validation,]; num_images = nrow(y)
+041|num_iters_per_epoch = ceil(num_images / BATCH_SIZE)
+042|max_epochs = ceil(2000 / num_iters_per_epoch)
+043|iter = 0
+044|lr = 0.01
+045|for(e in 1:max_epochs) {
+046|    for(i in 1:num_iters_per_epoch) {
+047|            beg = ((i-1) * BATCH_SIZE) %% num_images + 1; end = min(beg + BATCH_SIZE - 1, num_images); Xb = X[beg:end,]; yb = y[beg:end,];
+048|            iter = iter + 1
+049|            # Perform forward pass
+050|            [out3,ignoreHout_3,ignoreWout_3] = conv2d_builtin::forward(Xb,conv1_weight,conv1_bias,1,28,28,5,5,1,1,2,2)
+051|            out4 = relu::forward(out3)
+052|            [out5,ignoreHout_5,ignoreWout_5] = max_pool2d_builtin::forward(out4,32,28,28,2,2,2,2,0,0)
+053|            [out6,ignoreHout_6,ignoreWout_6] = conv2d_builtin::forward(out5,conv2_weight,conv2_bias,32,14,14,5,5,1,1,2,2)
+054|            out7 = relu::forward(out6)
+055|            [out8,ignoreHout_8,ignoreWout_8] = max_pool2d_builtin::forward(out7,64,14,14,2,2,2,2,0,0)
+056|            out9 = affine::forward(out8,ip1_weight,ip1_bias)
+057|            out10 = relu::forward(out9)
+058|            [out11,mask11] = dropout::forward(out10,0.5,-1)
+059|            out12 = affine::forward(out11,ip2_weight,ip2_bias)
+060|            out13 = softmax::forward(out12)
+061|            # Perform backward pass
+062|            dProbs = cross_entropy_loss::backward(out13,yb); dOut13 = softmax::backward(dProbs,out12); dOut13_12 = dOut13; dOut13_2 = dOut13;
+063|            [dOut12,ip2_dWeight,ip2_dBias] = affine::backward(dOut13_12,out11,ip2_weight,ip2_bias); dOut12_11 = dOut12;
+064|            dOut11 = dropout::backward(dOut12_11,out10,0.5,mask11); dOut11_10 = dOut11;
+065|            dOut10 = relu::backward(dOut11_10,out9); dOut10_9 = dOut10;
+066|            [dOut9,ip1_dWeight,ip1_dBias] = affine::backward(dOut10_9,out8,ip1_weight,ip1_bias); dOut9_8 = dOut9;
+067|            dOut8 = max_pool2d_builtin::backward(dOut9_8,7,7,out7,64,14,14,2,2,2,2,0,0); dOut8_7 = dOut8;
+068|            dOut7 = relu::backward(dOut8_7,out6); dOut7_6 = dOut7;
+069|            [dOut6,conv2_dWeight,conv2_dBias] = conv2d_builtin::backward(dOut7_6,14,14,out5,conv2_weight,conv2_bias,32,14,14,5,5,1,1,2,2); dOut6_5 = dOut6;
+070|            dOut5 = max_pool2d_builtin::backward(dOut6_5,14,14,out4,32,28,28,2,2,2,2,0,0); dOut5_4 = dOut5;
+071|            dOut4 = relu::backward(dOut5_4,out3); dOut4_3 = dOut4;
+072|            [dOut3,conv1_dWeight,conv1_dBias] = conv2d_builtin::backward(dOut4_3,28,28,Xb,conv1_weight,conv1_bias,1,28,28,5,5,1,1,2,2); dOut3_2 = dOut3;
+073|            # Update the parameters
+074|            conv1_dWeight_reg = l2_reg::backward(conv1_weight, 5.000000237487257E-4)
+075|            conv1_dWeight = conv1_dWeight + conv1_dWeight_reg
+076|            [conv1_weight,conv1_weight_v] = sgd_momentum::update(conv1_weight,conv1_dWeight,(lr * 1.0),0.8999999761581421,conv1_weight_v)
+077|            [conv1_bias,conv1_bias_v] = sgd_momentum::update(conv1_bias,conv1_dBias,(lr * 2.0),0.8999999761581421,conv1_bias_v)
+078|            conv2_dWeight_reg = l2_reg::backward(conv2_weight, 5.000000237487257E-4)
+079|            conv2_dWeight = conv2_dWeight + conv2_dWeight_reg
+080|            [conv2_weight,conv2_weight_v] = sgd_momentum::update(conv2_weight,conv2_dWeight,(lr * 1.0),0.8999999761581421,conv2_weight_v)
+081|            [conv2_bias,conv2_bias_v] = sgd_momentum::update(conv2_bias,conv2_dBias,(lr * 2.0),0.8999999761581421,conv2_bias_v)
+082|            ip1_dWeight_reg = l2_reg::backward(ip1_weight, 5.000000237487257E-4)
+083|            ip1_dWeight = ip1_dWeight + ip1_dWeight_reg
+084|            [ip1_weight,ip1_weight_v] = sgd_momentum::update(ip1_weight,ip1_dWeight,(lr * 1.0),0.8999999761581421,ip1_weight_v)
+085|            [ip1_bias,ip1_bias_v] = sgd_momentum::update(ip1_bias,ip1_dBias,(lr * 2.0),0.8999999761581421,ip1_bias_v)
+086|            ip2_dWeight_reg = l2_reg::backward(ip2_weight, 5.000000237487257E-4)
+087|            ip2_dWeight = ip2_dWeight + ip2_dWeight_reg
+088|            [ip2_weight,ip2_weight_v] = sgd_momentum::update(ip2_weight,ip2_dWeight,(lr * 1.0),0.8999999761581421,ip2_weight_v)
+089|            [ip2_bias,ip2_bias_v] = sgd_momentum::update(ip2_bias,ip2_dBias,(lr * 2.0),0.8999999761581421,ip2_bias_v)
+090|            # Compute training loss & accuracy
+091|            if(iter  %% 100 == 0) {
+092|                    loss = 0
+093|                    accuracy = 0
+094|                    tmp_loss = cross_entropy_loss::forward(out13,yb)
+095|                    loss = loss + tmp_loss
+096|                    true_yb = rowIndexMax(yb)
+097|                    predicted_yb = rowIndexMax(out13)
+098|                    accuracy = mean(predicted_yb == true_yb)*100
+099|                    training_loss = loss
+100|                    training_accuracy = accuracy
+101|                    print("Iter:" + iter + ", training loss:" + training_loss + ", training accuracy:" + training_accuracy)
+102|                    if(debug) {
+103|                            num_rows_error_measures = min(10, ncol(yb))
+104|                            error_measures = matrix(0, rows=num_rows_error_measures, cols=5)
+105|                            for(class_i in 1:num_rows_error_measures) {
+106|                                    tp = sum( (true_yb == predicted_yb) * (true_yb == class_i) )
+107|                                    tp_plus_fp = sum( (predicted_yb == class_i) )
+108|                                    tp_plus_fn = sum( (true_yb == class_i) )
+109|                                    precision = tp / tp_plus_fp
+110|                                    recall = tp / tp_plus_fn
+111|                                    f1Score = 2*precision*recall / (precision+recall)
+112|                                    error_measures[class_i,1] = class_i
+113|                                    error_measures[class_i,2] = precision
+114|                                    error_measures[class_i,3] = recall
+115|                                    error_measures[class_i,4] = f1Score
+116|                                    error_measures[class_i,5] = tp_plus_fn
+117|                            }
+118|                            print("class    \tprecision\trecall  \tf1-score\tnum_true_labels\n" + toString(error_measures, decimal=7, sep="\t"))
+119|                    }
+120|            }
+121|            # Compute validation loss & accuracy
+122|            if(iter  %% 500 == 0) {
+123|                    loss = 0
+124|                    accuracy = 0
+125|                    validation_loss = 0
+126|                    validation_accuracy = 0
+127|                    for(iVal in 1:num_iters_per_epoch) {
+128|                            beg = ((iVal-1) * BATCH_SIZE) %% num_validation + 1; end = min(beg + BATCH_SIZE - 1, num_validation); Xb = X_val[beg:end,]; yb = y_val[beg:end,];
+129|                            # Perform forward pass
+130|                            [out3,ignoreHout_3,ignoreWout_3] = conv2d_builtin::forward(Xb,conv1_weight,conv1_bias,1,28,28,5,5,1,1,2,2)
+131|                            out4 = relu::forward(out3)
+132|                            [out5,ignoreHout_5,ignoreWout_5] = max_pool2d_builtin::forward(out4,32,28,28,2,2,2,2,0,0)
+133|                            [out6,ignoreHout_6,ignoreWout_6] = conv2d_builtin::forward(out5,conv2_weight,conv2_bias,32,14,14,5,5,1,1,2,2)
+134|                            out7 = relu::forward(out6)
+135|                            [out8,ignoreHout_8,ignoreWout_8] = max_pool2d_builtin::forward(out7,64,14,14,2,2,2,2,0,0)
+136|                            out9 = affine::forward(out8,ip1_weight,ip1_bias)
+137|                            out10 = relu::forward(out9)
+138|                            [out11,mask11] = dropout::forward(out10,0.5,-1)
+139|                            out12 = affine::forward(out11,ip2_weight,ip2_bias)
+140|                            out13 = softmax::forward(out12)
+141|                            tmp_loss = cross_entropy_loss::forward(out13,yb)
+142|                            loss = loss + tmp_loss
+143|                            true_yb = rowIndexMax(yb)
+144|                            predicted_yb = rowIndexMax(out13)
+145|                            accuracy = mean(predicted_yb == true_yb)*100
+146|                            validation_loss = validation_loss + loss
+147|                            validation_accuracy = validation_accuracy + accuracy
+148|                    }
+149|                    validation_accuracy = validation_accuracy / num_iters_per_epoch
+150|                    print("Iter:" + iter + ", validation loss:" + validation_loss + ", validation accuracy:" + validation_accuracy)
+151|            }
+152|    }
+153|    # Learning rate
+154|    lr = (0.009999999776482582 * 0.949999988079071^e)
+155|}
+
+Iter:100, training loss:0.24014199350958168, training accuracy:87.5
+class           precision       recall          f1-score        num_true_labels
+1.0000000       1.0000000       1.0000000       1.0000000       3.0000000
+2.0000000       1.0000000       1.0000000       1.0000000       8.0000000
+3.0000000       0.8888889       0.8888889       0.8888889       9.0000000
+4.0000000       0.7500000       0.7500000       0.7500000       4.0000000
+5.0000000       0.7500000       1.0000000       0.8571429       3.0000000
+6.0000000       0.8333333       1.0000000       0.9090909       5.0000000
+7.0000000       1.0000000       1.0000000       1.0000000       8.0000000
+8.0000000       0.8571429       0.7500000       0.8000000       8.0000000
+9.0000000       1.0000000       0.5714286       0.7272727       7.0000000
+10.0000000      0.7272727       0.8888889       0.8000000       9.0000000
+
+Iter:200, training loss:0.09555593867171894, training accuracy:98.4375
+class           precision       recall          f1-score        num_true_labels
+1.0000000       1.0000000       1.0000000       1.0000000       10.0000000
+2.0000000       1.0000000       1.0000000       1.0000000       3.0000000
+3.0000000       1.0000000       1.0000000       1.0000000       9.0000000
+4.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+5.0000000       1.0000000       1.0000000       1.0000000       7.0000000
+6.0000000       1.0000000       1.0000000       1.0000000       8.0000000
+7.0000000       1.0000000       0.6666667       0.8000000       3.0000000
+8.0000000       1.0000000       1.0000000       1.0000000       9.0000000
+9.0000000       0.8571429       1.0000000       0.9230769       6.0000000
+10.0000000      1.0000000       1.0000000       1.0000000       3.0000000
+
+Iter:300, training loss:0.058686794512570216, training accuracy:98.4375
+class           precision       recall          f1-score        num_true_labels
+1.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+2.0000000       1.0000000       1.0000000       1.0000000       9.0000000
+3.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+4.0000000       1.0000000       1.0000000       1.0000000       8.0000000
+5.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+6.0000000       1.0000000       0.8750000       0.9333333       8.0000000
+7.0000000       1.0000000       1.0000000       1.0000000       5.0000000
+8.0000000       1.0000000       1.0000000       1.0000000       2.0000000
+9.0000000       0.8888889       1.0000000       0.9411765       8.0000000
+10.0000000      1.0000000       1.0000000       1.0000000       8.0000000
+
+Iter:400, training loss:0.08742103541529415, training accuracy:96.875
+class           precision       recall          f1-score        num_true_labels
+1.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+2.0000000       0.8000000       1.0000000       0.8888889       8.0000000
+3.0000000       1.0000000       0.8333333       0.9090909       6.0000000
+4.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+5.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+6.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+7.0000000       1.0000000       1.0000000       1.0000000       7.0000000
+8.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+9.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+10.0000000      1.0000000       0.9230769       0.9600000       13.0000000
+
+Iter:500, training loss:0.05873836245880005, training accuracy:98.4375
+class           precision       recall          f1-score        num_true_labels
+1.0000000       1.0000000       1.0000000       1.0000000       3.0000000
+2.0000000       1.0000000       1.0000000       1.0000000       5.0000000
+3.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+4.0000000       1.0000000       1.0000000       1.0000000       9.0000000
+5.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+6.0000000       1.0000000       0.8571429       0.9230769       7.0000000
+7.0000000       0.8571429       1.0000000       0.9230769       6.0000000
+8.0000000       1.0000000       1.0000000       1.0000000       9.0000000
+9.0000000       1.0000000       1.0000000       1.0000000       10.0000000
+10.0000000      1.0000000       1.0000000       1.0000000       5.0000000
+
+Iter:500, validation loss:260.1580978627665, validation accuracy:96.43954918032787
+Iter:600, training loss:0.07584116043829209, training accuracy:98.4375
+class           precision       recall          f1-score        num_true_labels
+1.0000000       1.0000000       1.0000000       1.0000000       8.0000000
+2.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+3.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+4.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+5.0000000       1.0000000       1.0000000       1.0000000       5.0000000
+6.0000000       1.0000000       1.0000000       1.0000000       8.0000000
+7.0000000       1.0000000       1.0000000       1.0000000       8.0000000
+8.0000000       1.0000000       0.9230769       0.9600000       13.0000000
+9.0000000       1.0000000       1.0000000       1.0000000       5.0000000
+10.0000000      0.8333333       1.0000000       0.9090909       5.0000000
+
+Iter:700, training loss:0.07973166944626336, training accuracy:98.4375
+class           precision       recall          f1-score        num_true_labels
+1.0000000       1.0000000       1.0000000       1.0000000       5.0000000
+2.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+3.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+4.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+5.0000000       1.0000000       1.0000000       1.0000000       5.0000000
+6.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+7.0000000       1.0000000       1.0000000       1.0000000       10.0000000
+8.0000000       0.8000000       1.0000000       0.8888889       4.0000000
+9.0000000       1.0000000       1.0000000       1.0000000       8.0000000
+10.0000000      1.0000000       0.9166667       0.9565217       12.0000000
+
+Iter:800, training loss:0.0063778595034221855, training accuracy:100.0
+class           precision       recall          f1-score        num_true_labels
+1.0000000       1.0000000       1.0000000       1.0000000       9.0000000
+2.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+3.0000000       1.0000000       1.0000000       1.0000000       7.0000000
+4.0000000       1.0000000       1.0000000       1.0000000       7.0000000
+5.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+6.0000000       1.0000000       1.0000000       1.0000000       9.0000000
+7.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+8.0000000       1.0000000       1.0000000       1.0000000       8.0000000
+9.0000000       1.0000000       1.0000000       1.0000000       2.0000000
+10.0000000      1.0000000       1.0000000       1.0000000       6.0000000
+
+Iter:900, training loss:0.019673112167879484, training accuracy:100.0
+class           precision       recall          f1-score        num_true_labels
+1.0000000       1.0000000       1.0000000       1.0000000       3.0000000
+2.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+3.0000000       1.0000000       1.0000000       1.0000000       3.0000000
+4.0000000       1.0000000       1.0000000       1.0000000       5.0000000
+5.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+6.0000000       1.0000000       1.0000000       1.0000000       10.0000000
+7.0000000       1.0000000       1.0000000       1.0000000       7.0000000
+8.0000000       1.0000000       1.0000000       1.0000000       7.0000000
+9.0000000       1.0000000       1.0000000       1.0000000       12.0000000
+10.0000000      1.0000000       1.0000000       1.0000000       7.0000000
+
+Iter:1000, training loss:0.06137978002508307, training accuracy:96.875
+class           precision       recall          f1-score        num_true_labels
+1.0000000       1.0000000       1.0000000       1.0000000       5.0000000
+2.0000000       1.0000000       1.0000000       1.0000000       7.0000000
+3.0000000       1.0000000       1.0000000       1.0000000       8.0000000
+4.0000000       0.8333333       0.8333333       0.8333333       6.0000000
+5.0000000       1.0000000       1.0000000       1.0000000       5.0000000
+6.0000000       1.0000000       1.0000000       1.0000000       10.0000000
+7.0000000       1.0000000       1.0000000       1.0000000       3.0000000
+8.0000000       0.8888889       0.8888889       0.8888889       9.0000000
+9.0000000       1.0000000       1.0000000       1.0000000       7.0000000
+10.0000000      1.0000000       1.0000000       1.0000000       4.0000000
+
+Iter:1000, validation loss:238.62301345198944, validation accuracy:97.02868852459017
+Iter:1100, training loss:0.023325103696013115, training accuracy:100.0
+class           precision       recall          f1-score        num_true_labels
+1.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+2.0000000       1.0000000       1.0000000       1.0000000       10.0000000
+3.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+4.0000000       1.0000000       1.0000000       1.0000000       4.0000000
+5.0000000       1.0000000       1.0000000       1.0000000       2.0000000
+6.0000000       1.0000000       1.0000000       1.0000000       10.0000000
+7.0000000       1.0000000       1.0000000       1.0000000       7.0000000
+8.0000000       1.0000000       1.0000000       1.0000000       6.0000000
+9.0000000       1.0000000       1.0000000       1.0000000       9.0000000
+10.0000000      1.0000000       1.0000000       1.0000000       6.0000000
+...
 ```

http://git-wip-us.apache.org/repos/asf/systemml/blob/0ff26740/python-reference.md
----------------------------------------------------------------------
diff --git a/python-reference.md b/python-reference.md
index 7de3fb0..119c1d0 100644
--- a/python-reference.md
+++ b/python-reference.md
@@ -406,6 +406,10 @@ model.transform(df_test)
 </div>
 </div>
 
+Please note that when training using mllearn API (i.e. `model.fit(X_df)`), SystemML 
+expects that labels have been converted to 1-based value.
+This avoids unnecessary decoding overhead for large dataset if the label columns has already been decoded.
+For scikit-learn API, there is no such requirement.
 
 The table below describes the parameter available for mllearn algorithms: