You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@systemml.apache.org by ni...@apache.org on 2019/03/27 15:57:15 UTC

[systemml] branch master updated: [SYSTEMML-540] Updated Keras2DML to match Keras API and improved rmvar performance

This is an automated email from the ASF dual-hosted git repository.

niketanpansare pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/systemml.git


The following commit(s) were added to refs/heads/master by this push:
     new 8857b89  [SYSTEMML-540] Updated Keras2DML to match Keras API and improved rmvar performance
8857b89 is described below

commit 8857b8925004c369c39d969c4312472a0119d425
Author: Niketan Pansare <np...@us.ibm.com>
AuthorDate: Wed Mar 27 08:57:02 2019 -0700

    [SYSTEMML-540] Updated Keras2DML to match Keras API and improved rmvar performance
    
    - Improved performance of rmvar by refactoring LocalVariableMap. With this change, the end-to-end performance of a sample run of ResNet-200 improved from 55 seconds to 31 seconds.
    - The parameters batch_size, max_iter, test_iter, test_interval, display of Keras2DML constructor is removed. Instead, batch_size, epochs, validation_split, validation_data parameters of fit() method.
    - Updated the Caffe2DML generator to include the above parameters.
    - Updated the documentation.
    
    Closes #859.
---
 docs/beginners-guide-keras2dml.md                  | 165 ++-----
 docs/gpu.md                                        |  12 +-
 docs/index.md                                      |   4 +-
 docs/reference-guide-caffe2dml.md                  |  68 +++
 ...e-keras2dml.md => reference-guide-keras2dml.md} | 142 +-----
 .../runtime/controlprogram/LocalVariableMap.java   |  75 ++-
 src/main/python/systemml/mllearn/estimators.py     |  41 +-
 src/main/python/tests/test_nn_numpy.py             |   4 +-
 .../scala/org/apache/sysml/api/dl/Caffe2DML.scala  | 523 ++++++++++++++-------
 .../scala/org/apache/sysml/api/dl/CaffeLayer.scala |  11 +-
 .../org/apache/sysml/api/dl/CaffeSolver.scala      |   1 +
 .../sysml/api/ml/BaseSystemMLClassifier.scala      |  15 +-
 12 files changed, 610 insertions(+), 451 deletions(-)

diff --git a/docs/beginners-guide-keras2dml.md b/docs/beginners-guide-keras2dml.md
index 2259397..788a489 100644
--- a/docs/beginners-guide-keras2dml.md
+++ b/docs/beginners-guide-keras2dml.md
@@ -27,34 +27,23 @@ limitations under the License.
 
 <br/>
 
-## Introduction
+# Introduction
 
-Keras2DML is an **experimental API** that converts a Keras specification to DML through the intermediate Caffe2DML module. 
+Keras2DML converts a Keras specification to DML through the intermediate Caffe2DML module. 
 It is designed to fit well into the mllearn framework and hence supports NumPy, Pandas as well as PySpark DataFrame.
 
-### Getting Started 
-
-To create a Keras2DML object, one needs to create a Keras model through the Funcitonal API. please see the [Functional API.](https://keras.io/models/model/)
-This module utilizes the existing [Caffe2DML](beginners-guide-caffe2dml) backend to convert Keras models into DML. Keras models are 
-parsed and translated into Caffe prototext and caffemodel files which are then piped into Caffe2DML. Thus one can follow the Caffe2DML
-documentation for further information.
-
-### Model Conversion
-
-Keras models are parsed based on their layer structure and corresponding weights and translated into the relative Caffe layer and weight
-configuration. Be aware that currently this is a translation into Caffe and there will be loss of information from keras models such as 
-intializer information, and other layers which do not exist in Caffe. 
-
 First, install SystemML and other dependencies for the below demo:
 
 ```
-pip install systemml keras tensorflow mlxtend
+pip install systemml keras tensorflow
 ``` 
 
 To create a Keras2DML object, simply pass the keras object to the Keras2DML constructor. It's also important to note that your models
 should be compiled so that the loss can be accessed for Caffe2DML.
 
+# Training Lenet on the MNIST dataset
 
+Download the MNIST dataset using [mlxtend package](https://pypi.python.org/pypi/mlxtend).
 
 ```python
 # pyspark --driver-memory 20g
@@ -115,130 +104,34 @@ sysml_model.fit(X_train, y_train)
 sysml_model.score(X_test, y_test)
 ```
 
-# Frequently asked questions
-
-#### How can I get the training and prediction DML script for the Keras model?
-
-The training and prediction DML scripts can be generated using `get_training_script()` and `get_prediction_script()` methods.
+# Prediction using a pretrained ResNet-50
 
 ```python
-from systemml.mllearn import Keras2DML
-sysml_model = Keras2DML(spark, keras_model, input_shape=(3,224,224))
-print(sysml_model.get_training_script())
-```
-
-#### What is the mapping between Keras' parameters and Caffe's solver specification ? 
-
-|                                                        | Specified via the given parameter in the Keras2DML constructor | From input Keras' model                                                                 | Corresponding parameter in the Caffe solver file |
-|--------------------------------------------------------|----------------------------------------------------------------|-----------------------------------------------------------------------------------------|--------------------------------------------------|
-| Solver type                                            |                                                                | `type(keras_model.optimizer)`. Supported types: `keras.optimizers.{SGD, Adagrad, Adam}` | `type`                                           |
-| Maximum number of iterations                           | `max_iter`                                                     | The `epoch` parameter in the `fit` method is not supported.                             | `max_iter`                                       |
-| Validation dataset                                     | `test_iter` (explained in the below section)                   | The `validation_data` parameter in the `fit` method is not supported.                   | `test_iter`                                      |
-| Monitoring the loss                                    | `display, test_interval` (explained in the below section)      | The `LossHistory` callback in the `fit` method is not supported.                        | `display, test_interval`                         |
-| Learning rate schedule                                 | `lr_policy`                                                    | The `LearningRateScheduler` callback in the `fit` method is not supported.              | `lr_policy` (default: step)                      |
-| Base learning rate                                     |                                                                | `keras_model.optimizer.lr`                                                              | `base_lr`                                        |
-| Learning rate decay over each update                   |                                                                | `keras_model.optimizer.decay`                                                           | `gamma`                                          |
-| Global regularizer to use for all layers               | `regularization_type,weight_decay`                             | The current version of Keras2DML doesnot support custom regularizers per layer.         | `regularization_type,weight_decay`               |
-| If type of the optimizer is `keras.optimizers.SGD`     |                                                                | `momentum, nesterov`                                                                    | `momentum, type`                                 |
-| If type of the optimizer is `keras.optimizers.Adam`    |                                                                | `beta_1, beta_2, epsilon`. The parameter `amsgrad` is not supported.                    | `momentum, momentum2, delta`                     |
-| If type of the optimizer is `keras.optimizers.Adagrad` |                                                                | `epsilon`                                                                               | `delta`                                          |
-
-#### How do I specify the batch size and the number of epochs ?
+# pyspark --driver-memory 20g
+# Disable Tensorflow from using GPU to avoid unnecessary evictions by SystemML runtime
+import os
+os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
+os.environ['CUDA_VISIBLE_DEVICES'] = ''
 
-Since Keras2DML is a mllearn API, it doesnot accept the batch size and number of epochs as the parameter in the `fit` method.
-Instead, these parameters are passed via `batch_size` and `max_iter` parameters in the Keras2DML constructor.
-For example, the equivalent Python code for `keras_model.fit(features, labels, epochs=10, batch_size=64)` is as follows:
+# Set channel first layer
+from keras import backend as K
+K.set_image_data_format('channels_first')
 
-```python
 from systemml.mllearn import Keras2DML
-epochs = 10
-batch_size = 64
-num_samples = features.shape[0]
-max_iter = int(epochs*math.ceil(num_samples/batch_size))
-sysml_model = Keras2DML(spark, keras_model, batch_size=batch_size, max_iter=max_iter, ...)
-sysml_model.fit(features, labels)
-``` 
-
-#### What optimizer and loss does Keras2DML use by default if `keras_model` is not compiled ?
-
-If the user does not `compile` the keras model, then we throw an error.
-
-For classification applications, you can consider using cross entropy loss and SGD optimizer with nesterov momentum:
-
-```python 
-keras_model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.SGD(lr=0.01, momentum=0.95, decay=5e-4, nesterov=True))
+import systemml as sml
+import keras, urllib
+from PIL import Image
+from keras.applications.resnet50 import preprocess_input, decode_predictions, ResNet50
+
+keras_model = ResNet50(weights='imagenet',include_top=True,pooling='None',input_shape=(3,224,224))
+keras_model.compile(optimizer='sgd', loss= 'categorical_crossentropy')
+
+sysml_model = Keras2DML(spark,keras_model,input_shape=(3,224,224), weights='weights_dir', labels='https://raw.githubusercontent.com/apache/systemml/master/scripts/nn/examples/caffe2dml/models/imagenet/labels.txt')
+sysml_model.summary()
+urllib.urlretrieve('https://upload.wikimedia.org/wikipedia/commons/f/f4/Cougar_sitting.jpg', 'test.jpg')
+img_shape = (3, 224, 224)
+input_image = sml.convertImageToNumPyArr(Image.open('test.jpg'), img_shape=img_shape)
+sysml_model.predict(input_image)
 ```
 
-Please refer to [Keras's documentation](https://keras.io/losses/) for more detail.
-
-#### What is the learning rate schedule used ?
-
-Keras2DML does not support the `LearningRateScheduler` callback. 
-Instead one can set the custom learning rate schedule to one of the following schedules by using the `lr_policy` parameter of the constructor:
-- `step`: return `base_lr * gamma ^ (floor(iter / step))` (default schedule)
-- `fixed`: always return `base_lr`.
-- `exp`: return `base_lr * gamma ^ iter`
-- `inv`: return `base_lr * (1 + gamma * iter) ^ (- power)`
-- `poly`: the effective learning rate follows a polynomial decay, to be zero by the max_iter. return `base_lr (1 - iter/max_iter) ^ (power)`
-- `sigmoid`: the effective learning rate follows a sigmod decay return b`ase_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))`
-
-#### How to set the size of the validation dataset ?
-
-The size of the validation dataset is determined by the parameters `test_iter` and the batch size. For example: If the batch size is 64 and 
-`test_iter` is set to 10 in the `Keras2DML`'s constructor, then the validation size is 640. This setting generates following DML code internally:
-
-```python
-num_images = nrow(y_full)
-BATCH_SIZE = 64
-num_validation = 10 * BATCH_SIZE
-X = X_full[(num_validation+1):num_images,]; y = y_full[(num_validation+1):num_images,]
-X_val = X_full[1:num_validation,]; y_val = y_full[1:num_validation,]
-num_images = nrow(y)
-``` 
-
-#### How to monitor loss via command-line ?
-
-To monitor loss, please set the parameters `display`, `test_iter` and `test_interval` in the `Keras2DML`'s constructor.  
-For example: for the expression `Keras2DML(..., display=100, test_iter=10, test_interval=500)`, we
-- display the training loss and accuracy every 100 iterations and
-- carry out validation every 500 training iterations and display validation loss and accuracy.
-
-#### How do you ensure that Keras2DML produce same results as other Keras' backend?
-
-To verify that Keras2DML produce same results as other Keras' backend, we have [Python unit tests](https://github.com/apache/systemml/blob/master/src/main/python/tests/test_nn_numpy.py)
-that compare the results of Keras2DML with that of TensorFlow. We assume that Keras team ensure that all their backends are consistent with their TensorFlow backend.
-
-#### How can I train very deep models on GPU?
-
-Unlike Keras where default train and test algorithm is `minibatch`, you can specify the
-algorithm using the parameters `train_algo` and `test_algo` (valid values are: `minibatch`, `allreduce_parallel_batches`, 
-`looped_minibatch`, and `allreduce`). Here are some common settings:
-
-|                                                                          | PySpark script                                                                                                                           | Changes to Network/Solver                                              |
-|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|
-| Single-node CPU execution (similar to Caffe with solver_mode: CPU)       | `lenet.set(train_algo="minibatch", test_algo="minibatch")`                                                                               | Ensure that `batch_size` is set to appropriate value (for example: 64) |
-| Single-node single-GPU execution                                         | `lenet.set(train_algo="minibatch", test_algo="minibatch").setGPU(True).setForceGPU(True)`                                                | Ensure that `batch_size` is set to appropriate value (for example: 64) |
-| Single-node multi-GPU execution (similar to Caffe with solver_mode: GPU) | `lenet.set(train_algo="allreduce_parallel_batches", test_algo="minibatch", parallel_batches=num_gpu).setGPU(True).setForceGPU(True)`     | Ensure that `batch_size` is set to appropriate value (for example: 64) |
-| Distributed prediction                                                   | `lenet.set(test_algo="allreduce")`                                                                                                       |                                                                        |
-| Distributed synchronous training                                         | `lenet.set(train_algo="allreduce_parallel_batches", parallel_batches=num_cluster_cores)`                                                 | Ensure that `batch_size` is set to appropriate value (for example: 64) |
-
-Here are high-level guidelines to train very deep models on GPU with Keras2DML (and Caffe2DML):
-
-1. If there exists at least one layer/operator that does not fit on the device, please allow SystemML's optimizer to perform operator placement based on the memory estimates `sysml_model.setGPU(True)`.
-2. If each individual layer/operator fits on the device but not the entire network with a batch size of 1, then 
-- Rely on SystemML's GPU Memory Manager to perform automatic eviction (recommended): `sysml_model.setGPU(True) # Optional: .setForceGPU(True)`
-- Or enable Nvidia's Unified Memory:  `sysml_model.setConfigProperty('sysml.gpu.memory.allocator', 'unified_memory')`
-3. If the entire neural network does not fit in the GPU memory with the user-specified `batch_size`, but fits in the GPU memory with `local_batch_size` such that `1 << local_batch_size < batch_size`, then
-- Use either of the above two options.
-- Or enable `train_algo` that performs multiple forward-backward pass with batch size `local_batch_size`, aggregate gradients and finally updates the model: 
-```python
-sysml_model = Keras2DML(spark, keras_model, batch_size=local_batch_size)
-sysml_model.set(train_algo="looped_minibatch", parallel_batches=int(batch_size/local_batch_size))
-sysml_model.setGPU(True).setForceGPU(True)
-```
-- Or add `int(batch_size/local_batch_size)` GPUs and perform single-node multi-GPU training with batch size `local_batch_size`:
-```python
-sysml_model = Keras2DML(spark, keras_model, batch_size=local_batch_size)
-sysml_model.set(train_algo="allreduce_parallel_batches", parallel_batches=int(batch_size/local_batch_size))
-sysml_model.setGPU(True).setForceGPU(True)
-```
+Please see [Keras2DML's reference guide](http://apache.github.io/systemml/reference-guide-keras2dml) for more details.
diff --git a/docs/gpu.md b/docs/gpu.md
index f334b47..c5cdb56 100644
--- a/docs/gpu.md
+++ b/docs/gpu.md
@@ -185,6 +185,16 @@ $ ./bin/x86_64/linux/release/deviceQuery
 $ ./bin/x86_64/linux/release/bandwidthTest 
 $ ./bin/x86_64/linux/release/matrixMulCUBLAS 
 ```
+- Test CUDA and CuDNN with SystemML
+```
+$ git clone https://github.com/apache/systemml.git
+$ cd systemml
+$ mvn -Dit.test=org.apache.sysml.test.gpu.AggregateTernaryTests verify -PgpuTests
+$ mvn -Dit.test=org.apache.sysml.test.gpu.NeuralNetworkOpTests verify -PgpuTests
+```
+
+If you get an `java.lang.UnsatisfiedLinkError: libcusparse.so.9.0: cannot open shared object file: No such file or directory` error, then
+CUDA toolkit is not installed correctly or it is not included in the `LD_LIBRARY_PATH`.
 
 ### How to install CUDA 9 on Centos 7 with yum?
 
@@ -211,4 +221,4 @@ cd gcc-5.3.0
 num_cores=`grep -c ^processor /proc/cpuinfo`
 make -j $num_cores
 sudo make install
-```
+```
\ No newline at end of file
diff --git a/docs/index.md b/docs/index.md
index 4ceaee6..3169b15 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -76,8 +76,8 @@ for running SystemML from Spark via Scala, Python, or Java.
 machine in R-like and Python-like declarative languages.
 * [JMLC](jmlc) - Java Machine Learning Connector.
 * [Deep Learning with SystemML](deep-learning)
-  * *Experimental* Caffe2DML API for Deep Learning ([beginner's guide](beginners-guide-caffe2dml), [reference guide](reference-guide-caffe2dml)) - Converts a Caffe specification to DML.
-  * *Experimental* [Keras2DML API](beginners-guide-keras2dml) for Deep Learning.
+  * Keras2DML API for Deep Learning ([beginner's guide](beginners-guide-keras2dml), [reference guide](reference-guide-keras2dml)) - Converts a Keras model to DML.
+  * Caffe2DML API for Deep Learning ([beginner's guide](beginners-guide-caffe2dml), [reference guide](reference-guide-caffe2dml)) - Converts a Caffe specification to DML.
 
 ## Language Guides
 
diff --git a/docs/reference-guide-caffe2dml.md b/docs/reference-guide-caffe2dml.md
index 6242e03..1a3d154 100644
--- a/docs/reference-guide-caffe2dml.md
+++ b/docs/reference-guide-caffe2dml.md
@@ -1135,3 +1135,71 @@ class           precision       recall          f1-score        num_true_labels
 ...
 ```
 
+#### Design document of Caffe2DML
+
+1. Caffe2DML is designed to fit well into the mllearn framework. Hence, the key methods that were to be implemented are:
+- `getTrainingScript` for the `Estimator` class.
+- `getPredictionScript` for the `Model` class.
+
+These methods should be the starting point of any developer to understand the DML generated for training and prediction respectively.
+
+2. To simplify the DML generation in `getTrainingScript` and `getPredictionScript method`, we use DMLGenerator interface. 
+This interface generates DML string for common operations such as loops (such as if, for, while) as well as built-in functions (read, write), etc. 
+Also, this interface helps in "code reading" of the Caffe2DML class.
+
+3. Here is an analogy for SystemML developers to think of various moving components of Caffe2DML:
+- Like `Dml.g4` in the `org.apache.sysml.parser.dml` package, `caffe.proto` in the `src/main/proto/caffe` directory
+is used to generate classes to parse the input files.
+
+```
+Dml.g4      ---> antlr  ---> DmlLexer.java, DmlListener.java, DmlParser.java
+caffe.proto ---> protoc ---> target/generated-sources/caffe/Caffe.java
+```
+
+- Just like the classes generated by Dml.g4 are used to parse input DML file,
+the `target/generated-sources/caffe/Caffe.java` class is used to parse the input caffe network/deploy prototxt and solver files.
+
+- You can think of `.caffemodel` file as DML file with matrix values encoded in it (please see below example).
+So it is possible to read `.caffemodel` file with the `Caffe.java` class. This is done in Utils.scala's `readCaffeNet` method.
+
+```
+X = matrix("1.2 3.5 0.999 7.123", rows=2, cols=2)
+...
+```
+
+- Just like we convert the AST generated by antlr into our DMLProgram representation, we convert
+caffe's abstraction into the below given mapping classes for layer, solver and learning rate.
+These mapping classes maps the corresponding Caffe abstraction to the SystemML-NN library.
+This greatly simplifies adding new layers into Caffe2DML:
+```
+trait CaffeLayer {
+  // Any layer that wants to reuse SystemML-NN has to override following methods that help in generating the DML for the given layer:
+  def sourceFileName:String;
+  def init(dmlScript:StringBuilder):Unit;
+  def forward(dmlScript:StringBuilder, isPrediction:Boolean):Unit;
+  def backward(dmlScript:StringBuilder, outSuffix:String):Unit;
+  ...
+}
+trait CaffeSolver {
+  def sourceFileName:String;
+  def update(dmlScript:StringBuilder, layer:CaffeLayer):Unit;
+  def init(dmlScript:StringBuilder, layer:CaffeLayer):Unit;
+}
+```
+
+4. To simplify the traversal of the network, we created a Network interface:
+```
+trait Network {
+  def getLayers(): List[String]
+  def getCaffeLayer(layerName:String):CaffeLayer
+  def getBottomLayers(layerName:String): Set[String]
+  def getTopLayers(layerName:String): Set[String]
+  def getLayerID(layerName:String): Int
+}
+```
+
+5. One of the key design restriction of Caffe2DML is that every layer is identified uniquely by its name.
+This restriction simplifies the code significantly.
+To shield from network files that violates this restriction, Caffe2DML performs rewrites in CaffeNetwork class (search for condition 1-5 in Caffe2DML class).
+
+6. Like Caffe, Caffe2DML also expects the layers to be in sorted order.
diff --git a/docs/beginners-guide-keras2dml.md b/docs/reference-guide-keras2dml.md
similarity index 67%
copy from docs/beginners-guide-keras2dml.md
copy to docs/reference-guide-keras2dml.md
index 2259397..a576ee7 100644
--- a/docs/beginners-guide-keras2dml.md
+++ b/docs/reference-guide-keras2dml.md
@@ -27,93 +27,10 @@ limitations under the License.
 
 <br/>
 
-## Introduction
 
-Keras2DML is an **experimental API** that converts a Keras specification to DML through the intermediate Caffe2DML module. 
-It is designed to fit well into the mllearn framework and hence supports NumPy, Pandas as well as PySpark DataFrame.
+# Layers supported in Keras2DML
 
-### Getting Started 
-
-To create a Keras2DML object, one needs to create a Keras model through the Funcitonal API. please see the [Functional API.](https://keras.io/models/model/)
-This module utilizes the existing [Caffe2DML](beginners-guide-caffe2dml) backend to convert Keras models into DML. Keras models are 
-parsed and translated into Caffe prototext and caffemodel files which are then piped into Caffe2DML. Thus one can follow the Caffe2DML
-documentation for further information.
-
-### Model Conversion
-
-Keras models are parsed based on their layer structure and corresponding weights and translated into the relative Caffe layer and weight
-configuration. Be aware that currently this is a translation into Caffe and there will be loss of information from keras models such as 
-intializer information, and other layers which do not exist in Caffe. 
-
-First, install SystemML and other dependencies for the below demo:
-
-```
-pip install systemml keras tensorflow mlxtend
-``` 
-
-To create a Keras2DML object, simply pass the keras object to the Keras2DML constructor. It's also important to note that your models
-should be compiled so that the loss can be accessed for Caffe2DML.
-
-
-
-```python
-# pyspark --driver-memory 20g
-
-# Disable Tensorflow from using GPU to avoid unnecessary evictions by SystemML runtime
-import os
-os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
-os.environ['CUDA_VISIBLE_DEVICES'] = ''
-
-# Import dependencies
-from mlxtend.data import mnist_data
-import numpy as np
-from sklearn.utils import shuffle
-from keras.models import Sequential
-from keras.layers import Input, Dense, Conv2D, MaxPooling2D, Dropout,Flatten
-from keras import backend as K
-from keras.models import Model
-from keras.optimizers import SGD
-
-# Set channel first layer
-K.set_image_data_format('channels_first')
-
-# Download the MNIST dataset
-X, y = mnist_data()
-X, y = shuffle(X, y)
-
-# Split the data into training and test
-n_samples = len(X)
-X_train = X[:int(.9 * n_samples)]
-y_train = y[:int(.9 * n_samples)]
-X_test = X[int(.9 * n_samples):]
-y_test = y[int(.9 * n_samples):]
-
-# Define Lenet in Keras
-keras_model = Sequential()
-keras_model.add(Conv2D(32, kernel_size=(5, 5), activation='relu', input_shape=(1,28,28), padding='same'))
-keras_model.add(MaxPooling2D(pool_size=(2, 2)))
-keras_model.add(Conv2D(64, (5, 5), activation='relu', padding='same'))
-keras_model.add(MaxPooling2D(pool_size=(2, 2)))
-keras_model.add(Flatten())
-keras_model.add(Dense(512, activation='relu'))
-keras_model.add(Dropout(0.5))
-keras_model.add(Dense(10, activation='softmax'))
-keras_model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True))
-keras_model.summary()
-
-# Scale the input features
-scale = 0.00390625
-X_train = X_train*scale
-X_test = X_test*scale
-
-# Train Lenet using SystemML
-from systemml.mllearn import Keras2DML
-sysml_model = Keras2DML(spark, keras_model, weights='weights_dir')
-# sysml_model.setConfigProperty("sysml.native.blas", "auto")
-# sysml_model.setGPU(True).setForceGPU(True)
-sysml_model.fit(X_train, y_train)
-sysml_model.score(X_test, y_test)
-```
+TODO:
 
 # Frequently asked questions
 
@@ -132,7 +49,6 @@ print(sysml_model.get_training_script())
 |                                                        | Specified via the given parameter in the Keras2DML constructor | From input Keras' model                                                                 | Corresponding parameter in the Caffe solver file |
 |--------------------------------------------------------|----------------------------------------------------------------|-----------------------------------------------------------------------------------------|--------------------------------------------------|
 | Solver type                                            |                                                                | `type(keras_model.optimizer)`. Supported types: `keras.optimizers.{SGD, Adagrad, Adam}` | `type`                                           |
-| Maximum number of iterations                           | `max_iter`                                                     | The `epoch` parameter in the `fit` method is not supported.                             | `max_iter`                                       |
 | Validation dataset                                     | `test_iter` (explained in the below section)                   | The `validation_data` parameter in the `fit` method is not supported.                   | `test_iter`                                      |
 | Monitoring the loss                                    | `display, test_interval` (explained in the below section)      | The `LossHistory` callback in the `fit` method is not supported.                        | `display, test_interval`                         |
 | Learning rate schedule                                 | `lr_policy`                                                    | The `LearningRateScheduler` callback in the `fit` method is not supported.              | `lr_policy` (default: step)                      |
@@ -143,21 +59,11 @@ print(sysml_model.get_training_script())
 | If type of the optimizer is `keras.optimizers.Adam`    |                                                                | `beta_1, beta_2, epsilon`. The parameter `amsgrad` is not supported.                    | `momentum, momentum2, delta`                     |
 | If type of the optimizer is `keras.optimizers.Adagrad` |                                                                | `epsilon`                                                                               | `delta`                                          |
 
-#### How do I specify the batch size and the number of epochs ?
+#### How do I specify the batch size and the number of epochs?
 
-Since Keras2DML is a mllearn API, it doesnot accept the batch size and number of epochs as the parameter in the `fit` method.
-Instead, these parameters are passed via `batch_size` and `max_iter` parameters in the Keras2DML constructor.
-For example, the equivalent Python code for `keras_model.fit(features, labels, epochs=10, batch_size=64)` is as follows:
+Like Keras, the user can provide `batch_size` and `epochs` via the `fit` method. For example: `sysml_model.fit(features, labels, epochs=10, batch_size=64)`.
 
-```python
-from systemml.mllearn import Keras2DML
-epochs = 10
-batch_size = 64
-num_samples = features.shape[0]
-max_iter = int(epochs*math.ceil(num_samples/batch_size))
-sysml_model = Keras2DML(spark, keras_model, batch_size=batch_size, max_iter=max_iter, ...)
-sysml_model.fit(features, labels)
-``` 
+Note, we do not support `verbose` and `callbacks` parameters in our `fit` method. Please use SparkContext's `setLogLevel` method to control the verbosity.
 
 #### What optimizer and loss does Keras2DML use by default if `keras_model` is not compiled ?
 
@@ -184,24 +90,9 @@ Instead one can set the custom learning rate schedule to one of the following sc
 
 #### How to set the size of the validation dataset ?
 
-The size of the validation dataset is determined by the parameters `test_iter` and the batch size. For example: If the batch size is 64 and 
-`test_iter` is set to 10 in the `Keras2DML`'s constructor, then the validation size is 640. This setting generates following DML code internally:
-
-```python
-num_images = nrow(y_full)
-BATCH_SIZE = 64
-num_validation = 10 * BATCH_SIZE
-X = X_full[(num_validation+1):num_images,]; y = y_full[(num_validation+1):num_images,]
-X_val = X_full[1:num_validation,]; y_val = y_full[1:num_validation,]
-num_images = nrow(y)
-``` 
-
-#### How to monitor loss via command-line ?
-
-To monitor loss, please set the parameters `display`, `test_iter` and `test_interval` in the `Keras2DML`'s constructor.  
-For example: for the expression `Keras2DML(..., display=100, test_iter=10, test_interval=500)`, we
-- display the training loss and accuracy every 100 iterations and
-- carry out validation every 500 training iterations and display validation loss and accuracy.
+Like Keras, the validation dataset can be set in two ways:
+1. `validation_split` parameter (of type `float` between 0 and 1) in the `fit` method: It is the fraction of the training data to be used as validation data. The model will set apart this fraction of the training data, will not train on it, and will evaluate the loss and any model metrics on this data at the end of each epoch.
+2. `validation_data` parameter (of type `(x_val, y_val)` where `x_val` and `y_val` are NumPy arrays) in the `fit` method: on which to evaluate the loss at the end of each epoch. The model will not be trained on this data.  validation_data will override validation_split.
 
 #### How do you ensure that Keras2DML produce same results as other Keras' backend?
 
@@ -232,13 +123,26 @@ Here are high-level guidelines to train very deep models on GPU with Keras2DML (
 - Use either of the above two options.
 - Or enable `train_algo` that performs multiple forward-backward pass with batch size `local_batch_size`, aggregate gradients and finally updates the model: 
 ```python
-sysml_model = Keras2DML(spark, keras_model, batch_size=local_batch_size)
+sysml_model = Keras2DML(spark, keras_model)
 sysml_model.set(train_algo="looped_minibatch", parallel_batches=int(batch_size/local_batch_size))
 sysml_model.setGPU(True).setForceGPU(True)
+sysml_model.fit(X, y, batch_size=local_batch_size)
 ```
 - Or add `int(batch_size/local_batch_size)` GPUs and perform single-node multi-GPU training with batch size `local_batch_size`:
 ```python
-sysml_model = Keras2DML(spark, keras_model, batch_size=local_batch_size)
+sysml_model = Keras2DML(spark, keras_model)
 sysml_model.set(train_algo="allreduce_parallel_batches", parallel_batches=int(batch_size/local_batch_size))
 sysml_model.setGPU(True).setForceGPU(True)
+sysml_model.fit(X, y, batch_size=local_batch_size)
 ```
+
+#### Design document of Keras2DML.
+
+Keras2DML internally utilizes the existing [Caffe2DML](beginners-guide-caffe2dml) backend to convert Keras models into DML. Keras models are 
+parsed and translated into Caffe prototext and caffemodel files which are then piped into Caffe2DML. 
+
+Keras models are parsed based on their layer structure and corresponding weights and translated into the relative Caffe layer and weight
+configuration. Be aware that currently this is a translation into Caffe and there will be loss of information from keras models such as 
+intializer information, and other layers which do not exist in Caffe. 
+
+Read the [Caffe2DML's reference guide](http://apache.github.io/systemml/reference-guide-caffe2dml) for the design documentation. 
\ No newline at end of file
diff --git a/src/main/java/org/apache/sysml/runtime/controlprogram/LocalVariableMap.java b/src/main/java/org/apache/sysml/runtime/controlprogram/LocalVariableMap.java
index 58a8694..cf9e79a 100644
--- a/src/main/java/org/apache/sysml/runtime/controlprogram/LocalVariableMap.java
+++ b/src/main/java/org/apache/sysml/runtime/controlprogram/LocalVariableMap.java
@@ -19,6 +19,7 @@
 
 package org.apache.sysml.runtime.controlprogram;
 
+import java.util.ArrayList;
 import java.util.HashMap;
 import java.util.HashSet;
 import java.util.Map;
@@ -49,16 +50,47 @@ public class LocalVariableMap implements Cloneable
 	private final HashMap<String, Data> localMap;
 	private final long localID;
 	
+	// ------------------------------------------------------------------------------------
+	// Data structures that improve performance of rmvar by maintaining the values of localMap.
+	// With this change, ResNet-200 performance improved the runtime from 55 seconds to 31 seconds. 
+	private final ArrayList<Data> listValues;
+	private final ArrayList<Data> nonListValues;
+	// ------------------------------------------------------------------------------------
+	
 	//optional set of registered outputs
 	private HashSet<String> outputs = null;
 	
 	public LocalVariableMap() {
 		localMap = new HashMap<>();
+		listValues = new ArrayList<>();
+		nonListValues = new ArrayList<>();
 		localID = _seq.getNextID();
 	}
 	
+	private void addValue(Data val) {
+		if(val != null) {
+			if(val instanceof ListObject)
+				listValues.add(val);
+			else
+				nonListValues.add(val);
+		}
+	}
+	private void removeValue(Data val) {
+		if(val != null) {
+			if(val instanceof ListObject)
+				listValues.remove(val);
+			else
+				nonListValues.remove(val);
+		}
+	}
+	
 	public LocalVariableMap(LocalVariableMap vars) {
 		localMap = new HashMap<>(vars.localMap);
+		listValues = new ArrayList<>();
+		nonListValues = new ArrayList<>();
+		for(Data val : localMap.values()) {
+			addValue(val);
+		}
 		localID = _seq.getNextID();
 	}
 
@@ -88,34 +120,61 @@ public class LocalVariableMap implements Cloneable
 	 * @param val the data value object (such as envelope)
 	 */
 	public void put(String name, Data val) {
+		if(localMap.containsKey(name))
+			removeValue(localMap.get(name));
 		localMap.put( name, val );
+		addValue(val);
 	}
 	
 	public void putAll(Map<String, Data> vals) {
-		localMap.putAll(vals);
+		for(Entry<String, Data> kv : vals.entrySet()) {
+			put(kv.getKey(), kv.getValue());
+		}
 	}
 
 	public Data remove( String name ) {
-		return localMap.remove( name );
+		Data ret = localMap.remove( name );
+		removeValue(ret);
+		return ret;
 	}
 
 	public void removeAll() {
 		localMap.clear();
+		listValues.clear();
+		nonListValues.clear();
 	}
 	
 	public void removeAllIn(Set<String> blacklist) {
-		localMap.entrySet().removeIf(
-			e -> blacklist.contains(e.getKey()));
+		for(String e : blacklist) {
+			if(localMap.containsKey(e)) {
+				remove(e);
+			}
+		}
 	}
 	
 	public void removeAllNotIn(Set<String> blacklist) {
-		localMap.entrySet().removeIf(
-			e -> !blacklist.contains(e.getKey()));
+		HashSet<String> removeKeys = new HashSet<>();
+		for(String e : localMap.keySet()) {
+			if(!blacklist.contains(e)) {
+				removeKeys.add(e);
+			}
+		}
+		for(String e : removeKeys) {
+			remove(e);
+		}
 	}
 
 	public boolean hasReferences( Data d ) {
-		return localMap.values().stream().anyMatch(e -> (e instanceof ListObject) ?
-			((ListObject)e).getData().contains(d) : e == d);
+		if(nonListValues.contains(d)) {
+			return true;
+		}
+		else if(listValues.size() > 0) {
+			return listValues.stream().anyMatch(e -> (e instanceof ListObject) ?
+					((ListObject)e).getData().contains(d) : e == d);
+		}
+		else {
+			return false;
+		}
 	}
 	
 	public void setRegisteredOutputs(HashSet<String> outputs) {
diff --git a/src/main/python/systemml/mllearn/estimators.py b/src/main/python/systemml/mllearn/estimators.py
index 0b47d8c..2c32731 100644
--- a/src/main/python/systemml/mllearn/estimators.py
+++ b/src/main/python/systemml/mllearn/estimators.py
@@ -153,6 +153,16 @@ class BaseSystemMLEstimator(Estimator):
         self.estimator.setConfigProperty(propertyName, propertyValue)
         return self
 
+    def getConfigProperty(self, propertyName):
+        """
+        Get configuration property, such as getConfigProperty("sysml.localtmpdir").
+
+        Parameters
+        ----------
+        propertyName: String
+        """
+        return self.estimator.getConfigProperty(propertyName)
+
     def _fit_df(self):
         global default_jvm_stdout, default_jvm_stdout_parallel_flush
         try:
@@ -1018,7 +1028,7 @@ class Keras2DML(Caffe2DML):
     """
 
     def __init__(self, sparkSession, keras_model, input_shape=None, transferUsingDF=False, load_keras_weights=True, weights=None, labels=None,
-                 batch_size=64, max_iter=2000, test_iter=0, test_interval=500, display=100, lr_policy="step", weight_decay=0, regularization_type="L2"):
+                 lr_policy="step", weight_decay=0, regularization_type="L2"):
         """
         Performs training/prediction for a given keras model.
 
@@ -1031,11 +1041,6 @@ class Keras2DML(Caffe2DML):
         load_keras_weights: whether to load weights from the keras_model. If False, the weights will be initialized to random value using NN libraries' init method  (default: True)
         weights: directory whether learned weights are stored (default: None)
         labels: file containing mapping between index and string labels (default: None)
-        batch_size: size of the input batch (default: 64)
-        max_iter: maximum number of iterations (default: 2000)
-        test_iter: test_iter for caffe solver (default: 0)
-        test_interval: test_interval for caffe solver (default: 500)
-        display: display for caffe solver (default: 100)
         lr_policy: learning rate policy for caffe solver (default: "step")
         weight_decay: regularation strength (default: 0, recommended: 5e-4)
         regularization_type: regularization type (default: "L2")
@@ -1063,6 +1068,8 @@ class Keras2DML(Caffe2DML):
         createJavaObject(sparkSession._sc, 'dummy')
         if not hasattr(keras_model, 'optimizer'):
             raise Exception('Please compile the model before passing it to Keras2DML')
+        # Default values for Caffe solver. These will be overriden by the fit method.
+        batch_size, max_iter, test_iter, test_interval, display = 32, 2000, 0, 500, 100
         convertKerasToCaffeNetwork(
             keras_model,
             self.name + ".proto",
@@ -1079,6 +1086,7 @@ class Keras2DML(Caffe2DML):
             weight_decay,
             regularization_type)
         self.weights = tempfile.mkdtemp() if weights is None else weights
+        self.keras_model = keras_model
         if load_keras_weights:
             convertKerasToSystemMLModel(
                 sparkSession, keras_model, self.weights)
@@ -1103,3 +1111,24 @@ class Keras2DML(Caffe2DML):
     def close(self):
         import shutil
         shutil.rmtree(weights)
+
+    def fit(self, X, y=None, batch_size=32, epochs=1, validation_split=0.0, validation_data=None):
+        # verbose, callbacks flags are not supported
+        self.estimator.modifySolverParam("batch_size", str(batch_size))
+        self.estimator.modifySolverParam("epochs", str(epochs))
+        self.estimator.modifySolverParam("validation_split", str(validation_split))
+        if batch_size <= 0 or epochs <= 0 or validation_split < 0:
+            raise ValueError('Incorrect parameters to fit method: batch_size=' +  str(batch_size) + ', epochs=' +
+                             str(epochs) + ', validation_split=' + str(validation_split))
+        if validation_data is not None:
+            X_val, y_val = validation_data[0], validation_data[1]
+            if isinstance(y_val, np.ndarray) and len(y_val.shape) == 1:
+                # Since we know that mllearn always needs a column vector
+                y_val = np.matrix(y_val).T
+                y_val = convertToMatrixBlock(self.sc, y_val)
+                self.estimator.setValidationData(convertToMatrixBlock(self.sc, X_val), y_val)
+        if y is not None:
+            super(Keras2DML, self).fit(X, y)
+        else:
+            super(Keras2DML, self).fit(X)
+
diff --git a/src/main/python/tests/test_nn_numpy.py b/src/main/python/tests/test_nn_numpy.py
index d30c692..8ffe1b1 100644
--- a/src/main/python/tests/test_nn_numpy.py
+++ b/src/main/python/tests/test_nn_numpy.py
@@ -95,7 +95,7 @@ def get_one_hot_encoded_labels(output_shape):
     return one_hot_labels
 
 def get_sysml_model(keras_model):
-    sysml_model = Keras2DML(spark, keras_model, weights=tmp_dir, max_iter=1, batch_size=batch_size)
+    sysml_model = Keras2DML(spark, keras_model, weights=tmp_dir)
     # For apples-to-apples comparison of output probabilities:
     # By performing one-hot encoding outside, we ensure that the ordering of the TF columns
     # matches that of SystemML
@@ -131,7 +131,7 @@ def base_test(layers, add_dense=False, test_backward=True):
     sysml_preds = sysml_model.predict_proba(sysml_matrix)
     if test_backward:
         one_hot_labels = get_one_hot_encoded_labels(keras_model.layers[-1].output_shape)
-        sysml_model.fit(sysml_matrix, one_hot_labels)
+        sysml_model.fit(sysml_matrix, one_hot_labels, batch_size=batch_size)
         sysml_preds = sysml_model.predict_proba(sysml_matrix)
     keras_preds = keras_model.predict(keras_tensor)
     if test_backward:
diff --git a/src/main/scala/org/apache/sysml/api/dl/Caffe2DML.scala b/src/main/scala/org/apache/sysml/api/dl/Caffe2DML.scala
index 9950d69..b1381ae 100644
--- a/src/main/scala/org/apache/sysml/api/dl/Caffe2DML.scala
+++ b/src/main/scala/org/apache/sysml/api/dl/Caffe2DML.scala
@@ -31,7 +31,7 @@ import scala.collection.JavaConversions._
 import java.util.ArrayList
 import caffe.Caffe.Phase
 import caffe.Caffe
-import java.util.HashSet
+import java.util.{ HashSet, HashMap }
 import org.apache.sysml.api.DMLScript
 import java.io.File
 import org.apache.spark.SparkContext
@@ -110,9 +110,9 @@ trait Network {
 
 5. One of the key design restriction of Caffe2DML is that every layer is identified uniquely by its name.
 This restriction simplifies the code significantly.
-To shield from network files that violates this restriction, Caffe2DML performs rewrites in CaffeNetwork class (search for condition 1-5).
+To shield from network files that violates this restriction, Caffe2DML performs rewrites in CaffeNetwork class (search for condition 1-5 in Caffe2DML class).
 
-6. Caffe2DML also expects the layers to be in sorted order.
+6. Like Caffe, Caffe2DML also expects the layers to be in sorted order.
 
 ***************************************************************************************/
 object Caffe2DML {
@@ -209,16 +209,20 @@ class Caffe2DML(val sc: SparkContext,
     val that = new Caffe2DML(sc, solverParam, solver, net, lrPolicy, numChannels, height, width)
     copyValues(that, extra)
   }
+  var numInputRows = -1
   def fit(X_file: String, y_file: String): Caffe2DMLModel = {
+    numInputRows = -1
     mloutput = baseFit(X_file, y_file, sc)
     new Caffe2DMLModel(this)
   }
   // Note: will update the y_mb as this will be called by Python mllearn
   def fit(X_mb: MatrixBlock, y_mb: MatrixBlock): Caffe2DMLModel = {
+    numInputRows = X_mb.getNumRows
     mloutput = baseFit(X_mb, y_mb, sc)
     new Caffe2DMLModel(this)
   }
   def fit(df: ScriptsUtils.SparkDataType): Caffe2DMLModel = {
+    numInputRows = -1
     mloutput = baseFit(df, sc)
     new Caffe2DMLModel(this)
   }
@@ -256,18 +260,22 @@ class Caffe2DML(val sc: SparkContext,
   def getTrainAlgo(): String = if (inputs.containsKey("$train_algo")) inputs.get("$train_algo") else Caffe2DML.MINIBATCH_ALGORITHM
   def getTestAlgo(): String  = if (inputs.containsKey("$test_algo")) inputs.get("$test_algo") else Caffe2DML.MINIBATCH_ALGORITHM
 
-  // Prints the summary of network
-  def summary(sparkSession: org.apache.spark.sql.SparkSession): Unit = {
-    val layers = net.getLayers .map(l => (l, net.getCaffeLayer(l)))
-    val numDataLayers = layers.filter(l => l._2.isInstanceOf[Data]).length
-    val batchSizes = layers.filter(l => l._2.isInstanceOf[Data]).map(l => l._2.param.getDataParam.getBatchSize).distinct
+  def getBatchSize():Int = {
+    val layers = net.getLayers.map(l => (l, net.getCaffeLayer(l)))
+    val batchSizes = layers.filter(l => l._2.isInstanceOf[Data]).map(l => l._2.asInstanceOf[Data].getBatchSize).distinct
     if(batchSizes.size > 1) {
       Caffe2DML.LOG.warn("Multiple data layers with different batch sizes:" + batchSizes.mkString(",") + ". Using the batch size:" + batchSizes.get(0))
     }
     else if(batchSizes.size == 0) {
       Caffe2DML.LOG.warn("No data layers found and hence ignoring the memory computation.")
     }
-    val batchSize = if(batchSizes.size > 0) batchSizes.get(0) else -1 
+    if(batchSizes.size > 0) batchSizes.get(0) else -1
+  }
+  // Prints the summary of network
+  def summary(sparkSession: org.apache.spark.sql.SparkSession): Unit = {
+    val layers = net.getLayers.map(l => (l, net.getCaffeLayer(l)))
+    val numDataLayers = layers.filter(l => l._2.isInstanceOf[Data]).length
+    val batchSize = getBatchSize() 
     val header = Seq("Name", "Type", "Output", "Weight", "Bias", "Top", "Bottom", "Memory* (train/test)")
     val entries = layers
       .map(l => {
@@ -311,6 +319,38 @@ class Caffe2DML(val sc: SparkContext,
   
   // Comma is included
   def getParforParameters():String = if (inputs.containsKey("$parfor_parameters")) inputs.get("$parfor_parameters") else ""
+  
+  // batch_size, epochs
+  val modifiedSolverParam = new HashMap[String, String]
+  def modifySolverParam(key:String, value:String): Unit = {
+    modifiedSolverParam.put(key, value)
+  }
+  def getModifiedSolverParamInt(key:String, defaultVal:Int):Int = {
+    if(modifiedSolverParam.containsKey(key)) {
+      modifiedSolverParam.get(key).toInt
+    }
+    else {
+      defaultVal
+    }
+  }
+  def getModifiedSolverParamDouble(key:String, defaultVal:Double):Double = {
+    if(modifiedSolverParam.containsKey(key)) {
+      modifiedSolverParam.get(key).toDouble
+    }
+    else {
+      defaultVal
+    }
+  }
+  def getMaxIter():String = {
+    val isEpochSet = getModifiedSolverParamInt("epochs", -1) >= 0
+    if(isEpochSet) {
+      // Set by Keras2DML
+      nrow(Caffe2DML.X) + "*" + getModifiedSolverParamInt("epochs", -1)
+    }
+    else {
+      solverParam.getMaxIter.toString
+    }
+  }
 
   def getInputBooleanValue(key:String):Boolean = {
     if(inputs.containsKey(key))
@@ -327,6 +367,14 @@ class Caffe2DML(val sc: SparkContext,
       }
     } 
   }
+  
+  var _X_val_mb: MatrixBlock = null
+  var _y_val_mb: MatrixBlock = null
+  def setValidationData(X_mb: MatrixBlock, y_mb: MatrixBlock): Unit = {
+    _X_val_mb = X_mb
+    _y_val_mb = y_mb
+  }
+  
   // ================================================================================================
   // The below method parses the provided network and solver file and generates DML script.
   def getTrainingScript(isSingleNode: Boolean): (Script, String, String) = {
@@ -346,8 +394,6 @@ class Caffe2DML(val sc: SparkContext,
     // Initialize the layers and solvers. Reads weights and bias if $weights is set.
     initWeights(net, solver, inputs.containsKey("$weights"), layersToIgnore)
     
-    val performFusedBackwardUpdate = getInputBooleanValue("$perform_fused_backward_update")
-
     // Split into training and validation set
     // Initializes Caffe2DML.X, Caffe2DML.y, Caffe2DML.XVal, Caffe2DML.yVal and Caffe2DML.numImages
     val shouldValidate = solverParam.getTestInterval > 0 && solverParam.getTestIterCount > 0 && solverParam.getTestIter(0) > 0
@@ -365,74 +411,97 @@ class Caffe2DML(val sc: SparkContext,
       assign(tabDMLScript, "parallel_batches", "$parallel_batches")
     }
     // ----------------------------------------------------------------------------
-    // Main logic
+    val epochs = getModifiedSolverParamInt("epochs", -1)
+    val epochPrefix = "\"Epoch:\" + e + \"/\" + " + epochs
+    val iterPrefix = "\"Iter:\" + iter"
+    val iterCaffeDisplayCondition = "iter  %% " + solverParam.getDisplay + " == 0"
+    val iterCaffeValidationCondition = "iter  %% " + solverParam.getTestInterval + " == 0"
+    val noCondition = null
+    val isKeras2DML = (epochs > 0)
+    // Main script generation logic based on train_algo parameter
+    // The below code relies on control structure generation helper utilities (such as forBlock and ifBlock) and
+    // helper methods that encapsulate block of code (such as forward, backward, update, getBatchSize, etc).
+    // The latter helper methods are used instead of operator overloading to simplify debugging and avoid IDE-related issues. We can revisit this if necessary.
     getTrainAlgo.toLowerCase match {
       case Caffe2DML.MINIBATCH_ALGORITHM => {
-        assign(tabDMLScript, "e", "0")
-        assign(tabDMLScript, "max_iter", ifdef("$max_iter", solverParam.getMaxIter.toString))
-        forBlock("iter", "1", "max_iter") {
-          getTrainingBatch(tabDMLScript)
-          // -------------------------------------------------------
-          // Perform forward, backward and update on minibatch
-          forward;
-          if(performFusedBackwardUpdate) {
-            backwardUpdate
-          }
-          else {
-            backward; update
-          }
-          // -------------------------------------------------------
-          if(solverParam.getDisplay > 0) {
-            ifBlock("iter  %% " + solverParam.getDisplay + " == 0") {
-              displayTrainingLoss(lossLayers(0), performOneHotEncoding)
+        if(isKeras2DML) {
+          // Keras2DML script generation for minibatch algorithm
+          assign(tabDMLScript, "iter", "1")
+          assign(tabDMLScript, "epochs", epochs.toString)
+          assign(tabDMLScript, "steps_per_epochs", ceil(nrow(Caffe2DML.X) + "/" + Caffe2DML.batchSize))
+          // Keras2DML generates two loops: one for epoch and inner minibatch training loop for iterating within a epoch
+          forBlock("e", "1", "epochs") {
+            assign(tabDMLScript, "start_index", "1")
+            val batchSize = getBatchSize()
+            if(batchSize > 0 && numInputRows > 0 && numInputRows % batchSize == 0) {
+              // Avoids dynamic recompilation + reduces the script size
+              innerMinibatchTrainingLoop(if(numInputRows == batchSize) "1" else "steps_per_epochs")
             }
-            if(shouldValidate) {
-              ifBlock("iter  %% " + solverParam.getTestInterval + " == 0") {
-                displayValidationLoss(lossLayers(0), performOneHotEncoding)
-              }
+            else {
+              // Avoids dynamic recompilation
+              innerMinibatchTrainingLoop("(steps_per_epochs-1)")
+              getTrainingBatch(tabDMLScript)
+              forwardBackwardUpdate
+              increment("iter")
             }
-          }
-          performSnapshot
-          ifBlock("iter %% num_batches_per_epoch == 0") {
-            // After every epoch, update the learning rate
-            assign(tabDMLScript, "e", "e + 1")
-            tabDMLScript.append("# Learning rate\n")
+            displayKerasValidationLoss(lossLayers(0), performOneHotEncoding, noCondition, epochPrefix)
             lrPolicy.updateLearningRate(tabDMLScript)
           }
         }
+        else {
+          // Caffe2DML script generation
+          assign(tabDMLScript, "e", "0")
+          assign(tabDMLScript, "max_iter", ifdef("$max_iter", getMaxIter))
+          // Caffe2DML generates only one loop that iterates for specified number of iterations
+          forBlock("iter", "1", "max_iter") {
+            getTrainingBatch(tabDMLScript)
+            forwardBackwardUpdate
+            if(solverParam.getDisplay > 0) {
+              displayTrainingLoss(lossLayers(0), performOneHotEncoding, iterCaffeDisplayCondition)
+              if(shouldValidate) {
+                displayValidationLoss(lossLayers(0), performOneHotEncoding, iterCaffeValidationCondition)
+              }
+            }
+            performSnapshot
+            ifBlock("iter %% num_batches_per_epoch == 0") { 
+              // After every epoch, update the learning rate
+              increment("e"); lrPolicy.updateLearningRate(tabDMLScript)
+            }
+          }
+        }
       }
       case Caffe2DML.BATCH_ALGORITHM => {
-        assign(tabDMLScript, "max_iter", ifdef("$max_iter", solverParam.getMaxIter.toString))
-        assign(tabDMLScript, "max_epochs", ceil("(max_iter*" + Caffe2DML.batchSize + ")/" + Caffe2DML.numImages))
-        forBlock("e", "1", "max_epochs") {
+        if(isKeras2DML) {
+          assign(tabDMLScript, "epochs", epochs.toString)
+        }
+        else {
+          assign(tabDMLScript, "max_iter", ifdef("$max_iter", getMaxIter))
+          assign(tabDMLScript, "epochs", ceil("(max_iter*" + Caffe2DML.batchSize + ")/" + Caffe2DML.numImages))
+        }
+        forBlock("e", "1", "epochs") {
           assign(tabDMLScript, "iter", "num_batches_per_epoch*e")
           assign(tabDMLScript, "Xb", Caffe2DML.X)
           assign(tabDMLScript, "yb", Caffe2DML.y)
-          // -------------------------------------------------------
-          // Perform forward, backward and update on entire dataset
-          forward
-          if(performFusedBackwardUpdate) {
-            backwardUpdate
+          forwardBackwardUpdate
+          if(isKeras2DML) {
+            displayKerasValidationLoss(lossLayers(0), performOneHotEncoding, noCondition, epochPrefix)
           }
           else {
-            backward; update
-          }
-          // -------------------------------------------------------
-          if(solverParam.getDisplay > 0) {
-            // Show training/validation loss every epoch
-            displayTrainingLoss(lossLayers(0), performOneHotEncoding)
-            if(shouldValidate)
-              displayValidationLoss(lossLayers(0), performOneHotEncoding)
+            if(solverParam.getDisplay > 0) {
+              // Show training/validation loss every epoch
+              displayTrainingLoss(lossLayers(0), performOneHotEncoding)
+              if(shouldValidate)
+                displayValidationLoss(lossLayers(0), performOneHotEncoding, noCondition)
+            }
+            performSnapshot
           }
-          performSnapshot
           // After every epoch, update the learning rate
-          tabDMLScript.append("# Learning rate\n")
           lrPolicy.updateLearningRate(tabDMLScript)
         }
       }
       case Caffe2DML.LOOPED_MINIBATCH_ALGORITHM | Caffe2DML.ALLREDUCE_PARALLEL_BATCHES_ALGORITHM => {
         assign(tabDMLScript, "e", "0")
-        assign(tabDMLScript, "max_iter", ifdef("$max_iter", solverParam.getMaxIter.toString))
+        assign(tabDMLScript, "max_iter", ifdef("$max_iter", getMaxIter))
         forBlock("iter", "1", "max_iter", "parallel_batches") {  
           assign(tabDMLScript, "allreduce_start_index", "((iter-1) * " + Caffe2DML.batchSize + ") %% " + Caffe2DML.numImages + " + 1; ")
           ifBlock("(allreduce_start_index + parallel_batches*" + Caffe2DML.batchSize + " - 1) > " + Caffe2DML.numImages) {
@@ -448,7 +517,6 @@ class Caffe2DML(val sc: SparkContext,
             Caffe2DML.USE_PLUS_EQ = true
             forBlock("j", "1", "parallel_batches", "1") _
           }
-          
           iterBlock {
             // Get a mini-batch in this group
             assign(tabDMLScript, "beg", "allreduce_start_index + (j-1)*" + Caffe2DML.batchSize)
@@ -456,26 +524,24 @@ class Caffe2DML(val sc: SparkContext,
             rightIndexing(tabDMLScript, "Xb", Caffe2DML.X, "beg", "end")
             rightIndexing(tabDMLScript, "yb", Caffe2DML.y, "beg", "end")
             forward; backward
-            if(performFusedBackwardUpdate && inputs.containsKey("$perform_fused_backward_update")) {
-              // Warn user only if the user explicitly ask for it
-              Caffe2DML.LOG.warn("Fused backward update is not supported for allreduce_parallel_batches")
-            }
             flattenGradients
-            if(solverParam.getDisplay > 0) {
-              ifBlock("(iter + j - 1)  %% " + solverParam.getDisplay + " == 0") {
-                displayTrainingLoss(lossLayers(0), performOneHotEncoding)
-              }
+            if(!isKeras2DML && solverParam.getDisplay > 0) {
+              displayTrainingLoss(lossLayers(0), performOneHotEncoding, "(iter + j - 1)  %% " + solverParam.getDisplay + " == 0")
             }
           }
           aggregateAggGradients
           update
-          if(solverParam.getDisplay > 0 && shouldValidate) {
-            val iterMatrix = matrix("seq(iter, iter+parallel_batches-1)", "parallel_batches", "1")
-            ifBlock(sum(iterMatrix + " %% " + solverParam.getTestInterval + " == 0") + " > 0") {
-              displayValidationLoss(lossLayers(0), performOneHotEncoding)
+          val iterMatrix = matrix("seq(iter, iter+parallel_batches-1)", "parallel_batches", "1")
+          val validationCondition = sum(iterMatrix + " %% " + solverParam.getTestInterval + " == 0") + " > 0"
+          if(isKeras2DML) {
+            displayKerasValidationLoss(lossLayers(0), performOneHotEncoding, validationCondition, iterPrefix)
+          }
+          else {
+            if(solverParam.getDisplay > 0 && shouldValidate) {
+              displayValidationLoss(lossLayers(0), performOneHotEncoding, validationCondition)
             }
+            performSnapshot
           }
-          performSnapshot
           Caffe2DML.USE_PLUS_EQ = old_USE_PLUS_EQ
         }
       }
@@ -492,28 +558,22 @@ class Caffe2DML(val sc: SparkContext,
             assign(tabDMLScript, "Xb", Caffe2DML.X + "[j,]")
             assign(tabDMLScript, "yb", Caffe2DML.y + "[j,]")
             forward; backward
-            if(performFusedBackwardUpdate && inputs.containsKey("$perform_fused_backward_update")) {
-              // Warn user only if the user explicitly ask for it
-              Caffe2DML.LOG.warn("Fused backward update is not supported for allreduce_parallel_batches")
-            }
             flattenGradients
           }
           aggregateAggGradients
           update
           // -------------------------------------------------------
-          if(solverParam.getDisplay > 0) {
-            ifBlock("iter  %% " + solverParam.getDisplay + " == 0") {
-              assign(tabDMLScript, "Xb", Caffe2DML.X + "[beg:end,]")
-              assign(tabDMLScript, "yb", Caffe2DML.y + "[beg:end,]")
-              displayTrainingLoss(lossLayers(0), performOneHotEncoding)
-            }
-            if(shouldValidate) {
-              ifBlock("iter  %% " + solverParam.getTestInterval + " == 0") {
-                displayValidationLoss(lossLayers(0), performOneHotEncoding)
-              }
+          if(isKeras2DML) {
+            displayKerasValidationLoss(lossLayers(0), performOneHotEncoding, iterCaffeValidationCondition, iterPrefix)
+          }
+          else {
+            if(solverParam.getDisplay > 0) {
+              displayTrainingLoss(lossLayers(0), performOneHotEncoding, iterCaffeDisplayCondition, true)
+              if(shouldValidate) 
+                displayValidationLoss(lossLayers(0), performOneHotEncoding, iterCaffeValidationCondition)
             }
+            performSnapshot
           }
-          performSnapshot
         }
       }
       case _ => throw new DMLRuntimeException("Unsupported train algo:" + getTrainAlgo)
@@ -527,6 +587,9 @@ class Caffe2DML(val sc: SparkContext,
 
     // Set input/output variables and execute the script
     val script = dml(trainingScript).in(inputs)
+    if(getModifiedSolverParamDouble("validation_split", -1) != -1 && _X_val_mb != null) {
+      script.in(Caffe2DML.XVal, _X_val_mb).in(Caffe2DML.yVal, _y_val_mb)
+    }
     net.getLayers.map(net.getCaffeLayer(_)).filter(_.weight != null).map(l => script.out(l.weight))
     net.getLayers.map(net.getCaffeLayer(_)).filter(_.extraWeight != null).map(l => script.out(l.extraWeight))
     net.getLayers.map(net.getCaffeLayer(_)).filter(_.bias != null).map(l => script.out(l.bias))
@@ -538,110 +601,232 @@ class Caffe2DML(val sc: SparkContext,
   // ================================================================================================
   // -------------------------------------------------------------------------------------------
   // Helper functions to generate DML
+  private def innerMinibatchTrainingLoop(numSteps:String):Unit = {
+    def innerMinibatchTrainingLoopBody():Unit = {
+      rightIndexing(tabDMLScript, "Xb", Caffe2DML.X, "start_index", "(start_index+" + Caffe2DML.batchSize + "-1)")
+      rightIndexing(tabDMLScript, "yb", Caffe2DML.y, "start_index", "(start_index+" + Caffe2DML.batchSize + "-1)")
+      forwardBackwardUpdate
+      increment("iter")
+      assign(tabDMLScript, "start_index", "start_index + " + Caffe2DML.batchSize)
+    }
+    if(numSteps.equals("1")) {
+      innerMinibatchTrainingLoopBody
+    }
+    else {
+      forBlock("j", "1", numSteps) {
+        innerMinibatchTrainingLoopBody
+      }
+    }
+  }
+  private def increment(variable:String) = {
+    assign(tabDMLScript, variable, variable + " + 1")
+  }
+  private def forwardBackwardUpdate() {
+    // -------------------------------------------------------
+    // Perform forward, backward and update on minibatch
+    val performFusedBackwardUpdate = getInputBooleanValue("$perform_fused_backward_update")
+    forward;
+    if(performFusedBackwardUpdate) {
+      backwardUpdate
+    }
+    else {
+      backward; update
+    }
+    // -------------------------------------------------------
+  }
+  
+  // Helper code to generate DML that creates training and validation data. 
   // Initializes Caffe2DML.X, Caffe2DML.y, Caffe2DML.XVal, Caffe2DML.yVal and Caffe2DML.numImages
   private def trainTestSplit(numValidationBatches: Int): Unit = {
-    if (numValidationBatches > 0) {
-      if (solverParam.getDisplay <= 0)
-        throw new DMLRuntimeException("Since test_iter and test_interval is greater than zero, you should set display to be greater than zero")
-      assign(tabDMLScript, Caffe2DML.numValidationImages, numValidationBatches + " * " + Caffe2DML.batchSize)
-      tabDMLScript.append("# Sanity check to ensure that validation set is not too large\n")
-      val maxValidationSize = ceil("0.3 * " + Caffe2DML.numImages)
-      ifBlock(Caffe2DML.numValidationImages + " > " + maxValidationSize) {
-        stop(tabDMLScript, dmlConcat(
-            asDMLString("Too large validation size. Please reduce test_iter to "), floor(maxValidationSize + " / " + Caffe2DML.batchSize)))
+    val keras_validation_split = getModifiedSolverParamDouble("validation_split", -1)
+    if(keras_validation_split >= 1) {
+      throw new DMLRuntimeException("validation_split should be between [0, 1), but provided " + keras_validation_split)
+    }
+    if(keras_validation_split != -1) {
+      // Invoked via Keras2DML
+      if(_X_val_mb != null) {
+        // Since validation data is provided via Keras2DML, use that.
+        assign(tabDMLScript, Caffe2DML.XVal, "read(\" \")")
+        assign(tabDMLScript, Caffe2DML.yVal, "read(\" \")")
+      }
+      else if(keras_validation_split > 0) {
+        // Since validation data is not provided via Keras2DML but validation_split is provided, split the training dataset
+        assign(tabDMLScript, Caffe2DML.numValidationImages, Caffe2DML.numImages + " * " + keras_validation_split)
+        rightIndexing(tabDMLScript, Caffe2DML.X, "X_full", int_add(Caffe2DML.numValidationImages, "1"), Caffe2DML.numImages)
+        rightIndexing(tabDMLScript, Caffe2DML.y, "y_full", int_add(Caffe2DML.numValidationImages, "1"), Caffe2DML.numImages)
+        rightIndexing(tabDMLScript, Caffe2DML.XVal, "X_full", "1", Caffe2DML.numValidationImages)
+        rightIndexing(tabDMLScript, Caffe2DML.yVal, "y_full", "1", Caffe2DML.numValidationImages)
+      }
+      else {
+        // keras_validation_split <= 0 and no validation data is provided via Keras2DML, do not split the training dataset
+        assign(tabDMLScript, Caffe2DML.X, "X_full")
+        assign(tabDMLScript, Caffe2DML.y, "y_full")
+      }
+    }
+    else {
+      // Invoked via Caffe2DML
+      if (numValidationBatches > 0) {
+        if (solverParam.getDisplay <= 0)
+          throw new DMLRuntimeException("Since test_iter and test_interval is greater than zero, you should set display to be greater than zero")
+        assign(tabDMLScript, Caffe2DML.numValidationImages, numValidationBatches + " * " + Caffe2DML.batchSize)
+        tabDMLScript.append("# Sanity check to ensure that validation set is not too large\n")
+        val maxValidationSize = ceil("0.3 * " + Caffe2DML.numImages)
+        ifBlock(Caffe2DML.numValidationImages + " > " + maxValidationSize) {
+          stop(tabDMLScript, dmlConcat(
+              asDMLString("Too large validation size. Please reduce test_iter to "), floor(maxValidationSize + " / " + Caffe2DML.batchSize)))
+        }
+        rightIndexing(tabDMLScript, Caffe2DML.X, "X_full", int_add(Caffe2DML.numValidationImages, "1"), Caffe2DML.numImages)
+        rightIndexing(tabDMLScript, Caffe2DML.y, "y_full", int_add(Caffe2DML.numValidationImages, "1"), Caffe2DML.numImages)
+        rightIndexing(tabDMLScript, Caffe2DML.XVal, "X_full", "1", Caffe2DML.numValidationImages)
+        rightIndexing(tabDMLScript, Caffe2DML.yVal, "y_full", "1", Caffe2DML.numValidationImages)
+      } else {
+        assign(tabDMLScript, Caffe2DML.X, "X_full")
+        assign(tabDMLScript, Caffe2DML.y, "y_full")
       }
-      rightIndexing(tabDMLScript, Caffe2DML.X, "X_full", int_add(Caffe2DML.numValidationImages, "1"), Caffe2DML.numImages)
-      rightIndexing(tabDMLScript, Caffe2DML.y, "y_full", int_add(Caffe2DML.numValidationImages, "1"), Caffe2DML.numImages)
-      rightIndexing(tabDMLScript, Caffe2DML.XVal, "X_full", "1", Caffe2DML.numValidationImages)
-      rightIndexing(tabDMLScript, Caffe2DML.yVal, "y_full", "1", Caffe2DML.numValidationImages)
-    } else {
-      assign(tabDMLScript, Caffe2DML.X, "X_full")
-      assign(tabDMLScript, Caffe2DML.y, "y_full")
     }
     assign(tabDMLScript, Caffe2DML.numImages, nrow(Caffe2DML.y))
   }
   
-  private def displayTrainingLoss(lossLayer: IsLossLayer, performOneHotEncoding:Boolean): Unit = {
-    val DEBUG_TRAINING = getInputBooleanValue("$debug")
-    tabDMLScript.append("# Compute training loss & accuracy\n")
-    assign(tabDMLScript, "loss", "0"); assign(tabDMLScript, "accuracy", "0")
-    lossLayer.computeLoss(dmlScript, numTabs)
-    assign(tabDMLScript, "training_loss", "loss"); assign(tabDMLScript, "training_accuracy", "accuracy")
-    tabDMLScript.append(
-      print(dmlConcat(asDMLString("Iter:"), "iter", asDMLString(", training loss:"), "training_loss", asDMLString(", training accuracy:"), "training_accuracy"))
-    )
-    if(performOneHotEncoding && DEBUG_TRAINING && !trainAlgoContainsParfor) {
-      printClassificationReport
+  private def displayTrainingLoss(lossLayer: IsLossLayer, performOneHotEncoding:Boolean, cond:String=null, shouldExtractBatch:Boolean=false): Unit = {
+    def innerTrainingLossCode():Unit = {
+      val DEBUG_TRAINING = getInputBooleanValue("$debug")
+      tabDMLScript.append("# Compute training loss & accuracy\n")
+      assign(tabDMLScript, "loss", "0"); assign(tabDMLScript, "accuracy", "0")
+      lossLayer.computeLoss(dmlScript, numTabs)
+      assign(tabDMLScript, "training_loss", "loss"); assign(tabDMLScript, "training_accuracy", "accuracy")
+      tabDMLScript.append(
+        print(dmlConcat(asDMLString("Iter:"), "iter", asDMLString(", training loss:"), "training_loss", asDMLString(", training accuracy:"), "training_accuracy"))
+      )
+      if(performOneHotEncoding && DEBUG_TRAINING && !trainAlgoContainsParfor) {
+        printClassificationReport
+      }
+    }
+    if(shouldExtractBatch) {
+      assign(tabDMLScript, "Xb", Caffe2DML.X + "[beg:end,]")
+      assign(tabDMLScript, "yb", Caffe2DML.y + "[beg:end,]")
+    }
+    if(cond == null) {
+      innerTrainingLossCode
+    }
+    else {
+      ifBlock(cond) {
+        innerTrainingLossCode
+      }
     }
   }
   
-  private def displayValidationLoss(lossLayer: IsLossLayer, performOneHotEncoding:Boolean): Unit = {
+  // Helper utility to generate DML script that displays Keras2DML's validation loss
+  private def displayKerasValidationLoss(lossLayer: IsLossLayer, performOneHotEncoding:Boolean, cond:String, 
+      iterString:String): Unit = {
+    if(getModifiedSolverParamDouble("validation_split", -1) != 0 || _X_val_mb != null) {
+      // Only display loss if validation data is provided
+      displayValidationLoss(lossLayer, performOneHotEncoding, cond, iterString)
+    }
+    else {
+      // Else just print the epoch or iter
+      tabDMLScript.append(print(iterString))
+    }
+  }
+  
+  // Helper utility to generate DML script that displays the validation loss based on test_algo
+  private def displayValidationLoss(lossLayer: IsLossLayer, performOneHotEncoding:Boolean, cond:String, 
+      iterString:String="\"Iter:\" + iter"): Unit = {
     if (trainAlgoContainsParfor && testAlgoContainsParfor) {
       Caffe2DML.LOG.warn("The setting: train_algo=" + getTrainAlgo + " and test_algo=" + getTestAlgo + " is not recommended. Consider changing test_algo=minibatch")
     }
     // Append the DML to compute validation loss
     val numValidationBatches = if (solverParam.getTestIterCount > 0) solverParam.getTestIter(0) else 0
-    tabDMLScript.append("# Compute validation loss & accuracy\n")
-    assign(tabDMLScript, "loss", "0"); assign(tabDMLScript, "accuracy", "0")
-    getTestAlgo.toLowerCase match {
-      case Caffe2DML.MINIBATCH_ALGORITHM | Caffe2DML.LOOPED_MINIBATCH_ALGORITHM => {
-        assign(tabDMLScript, "validation_loss", "0")
-        assign(tabDMLScript, "validation_accuracy", "0")
-        forBlock("iVal", "1", "num_batches_per_epoch") {
-          getValidationBatch(tabDMLScript)
-          forward; lossLayer.computeLoss(dmlScript, numTabs)
-          tabDMLScript.append("validation_loss = validation_loss + loss\n")
-          tabDMLScript.append("validation_accuracy = validation_accuracy + accuracy\n")
+    
+    def innerValidationLossCode():Unit = {
+      tabDMLScript.append("# Compute validation loss & accuracy\n")
+      assign(tabDMLScript, "loss", "0"); assign(tabDMLScript, "accuracy", "0")
+      // ------------------------------------------------------------------------------------------
+      // Preloop code:
+      getTestAlgo.toLowerCase match {
+        case Caffe2DML.MINIBATCH_ALGORITHM | Caffe2DML.LOOPED_MINIBATCH_ALGORITHM => {
+          assign(tabDMLScript, "validation_loss", "0")
+          assign(tabDMLScript, "validation_accuracy", "0")
         }
-        tabDMLScript.append("validation_accuracy = validation_accuracy / num_batches_per_epoch\n")
-      }
-      case Caffe2DML.BATCH_ALGORITHM => {
-        assign(tabDMLScript, "Xb", Caffe2DML.XVal); assign(tabDMLScript, "yb", Caffe2DML.yVal)
-        net.getLayers.map(layer => net.getCaffeLayer(layer).forward(tabDMLScript, false))
-        lossLayer.computeLoss(dmlScript, numTabs)
-        assign(tabDMLScript, "validation_loss", "loss"); assign(tabDMLScript, "validation_accuracy", "accuracy")
-
+        case Caffe2DML.BATCH_ALGORITHM => { }
+        case Caffe2DML.ALLREDUCE_PARALLEL_BATCHES_ALGORITHM => {
+          assign(tabDMLScript, "max_validation_iter", "as.integer(ceil(" + Caffe2DML.numValidationImages + "/" + Caffe2DML.batchSize + "))")
+          assign(tabDMLScript, "group_validation_loss", matrix("0", "max_validation_iter", "1"))
+          assign(tabDMLScript, "group_validation_accuracy", matrix("0", "max_validation_iter", "1"))
+        }
+        case Caffe2DML.ALLREDUCE_ALGORITHM => {
+          assign(tabDMLScript, "group_validation_loss", matrix("0", Caffe2DML.numValidationImages, "1"))
+          assign(tabDMLScript, "group_validation_accuracy", matrix("0", Caffe2DML.numValidationImages, "1"))
+        }
+        case _ => throw new DMLRuntimeException("Unsupported test algo:" + getTestAlgo)
       }
-      case Caffe2DML.ALLREDUCE_PARALLEL_BATCHES_ALGORITHM => {
-        // This setting uses the batch size provided by the user
-        assign(tabDMLScript, "max_validation_iter", "as.integer(ceil(" + Caffe2DML.numValidationImages + "/" + Caffe2DML.batchSize + "))")
-        assign(tabDMLScript, "group_validation_loss", matrix("0", "max_validation_iter", "1"))
-        assign(tabDMLScript, "group_validation_accuracy", matrix("0", "max_validation_iter", "1"))
-        parForBlock("iVal", "1", "max_validation_iter", "1", getParforParameters()) {
-          assign(tabDMLScript, "validation_beg", "(iVal-1) * " + Caffe2DML.batchSize + " + 1")
-          assign(tabDMLScript, "validation_end", min(Caffe2DML.numValidationImages, "validation_beg + " + Caffe2DML.batchSize + " - 1"))
-          assign(tabDMLScript, "Xb", Caffe2DML.XVal + "[validation_beg:validation_end,]")
-          assign(tabDMLScript, "yb", Caffe2DML.yVal + "[validation_beg:validation_end,]")
-          net.getLayers.map(layer => net.getCaffeLayer(layer).forward(tabDMLScript, false))
-          lossLayer.computeLoss(dmlScript, numTabs)
-          assign(tabDMLScript, "group_validation_loss[iVal,1]", "loss")
-          assign(tabDMLScript, "group_validation_accuracy[iVal,1]", "accuracy")
+      // ------------------------------------------------------------------------------------------
+      getTestAlgo.toLowerCase match {
+        case Caffe2DML.MINIBATCH_ALGORITHM | Caffe2DML.LOOPED_MINIBATCH_ALGORITHM => {
+          forBlock("iVal", "1", "num_batches_per_epoch") {
+            getValidationBatch(tabDMLScript)
+            forward; lossLayer.computeLoss(dmlScript, numTabs)
+            tabDMLScript.append("validation_loss = validation_loss + loss\n")
+            tabDMLScript.append("validation_accuracy = validation_accuracy + accuracy\n")
+          }
         }
-        assign(tabDMLScript, "validation_loss", "sum(group_validation_loss)")
-        assign(tabDMLScript, "validation_accuracy", "mean(group_validation_accuracy)")
+        case Caffe2DML.BATCH_ALGORITHM => {
+          assign(tabDMLScript, "Xb", Caffe2DML.XVal); assign(tabDMLScript, "yb", Caffe2DML.yVal)
+          forward; lossLayer.computeLoss(dmlScript, numTabs)
+          assign(tabDMLScript, "validation_loss", "loss"); assign(tabDMLScript, "validation_accuracy", "accuracy")
+        }
+        case Caffe2DML.ALLREDUCE_PARALLEL_BATCHES_ALGORITHM => {
+          // This setting uses the batch size provided by the user
+          parForBlock("iVal", "1", "max_validation_iter", "1", getParforParameters()) {
+            assign(tabDMLScript, "validation_beg", "(iVal-1) * " + Caffe2DML.batchSize + " + 1")
+            assign(tabDMLScript, "validation_end", min(Caffe2DML.numValidationImages, "validation_beg + " + Caffe2DML.batchSize + " - 1"))
+            rightIndexing(tabDMLScript, "Xb", Caffe2DML.XVal, "validation_beg", "validation_end")
+            rightIndexing(tabDMLScript, "yb", Caffe2DML.yVal, "validation_beg", "validation_end")
+            forward; lossLayer.computeLoss(dmlScript, numTabs)
+            assign(tabDMLScript, "group_validation_loss[iVal,1]", "loss")
+            assign(tabDMLScript, "group_validation_accuracy[iVal,1]", "accuracy")
+          }
+        }
+        case Caffe2DML.ALLREDUCE_ALGORITHM => {
+          // This setting doesnot use the batch size for validation and allows the parfor optimizer to select plan
+          // by minimizing the memory requirement (i.e. batch size = 1)
+          parForBlock("iVal", "1", Caffe2DML.numValidationImages, "1", getParforParameters()) {
+            rightIndexing(tabDMLScript, "Xb", Caffe2DML.XVal, "iVal", "")
+            rightIndexing(tabDMLScript, "yb", Caffe2DML.yVal, "iVal", "")
+            forward; lossLayer.computeLoss(dmlScript, numTabs)
+            assign(tabDMLScript, "group_validation_loss[iVal,1]", "loss")
+            assign(tabDMLScript, "group_validation_accuracy[iVal,1]", "accuracy")
+          }
+        }
+        case _ => throw new DMLRuntimeException("Unsupported test algo:" + getTestAlgo)
       }
-      case Caffe2DML.ALLREDUCE_ALGORITHM => {
-        // This setting doesnot use the batch size for validation and allows the parfor optimizer to select plan
-        // by minimizing the memory requirement (i.e. batch size = 1)
-        assign(tabDMLScript, "group_validation_loss", matrix("0", Caffe2DML.numValidationImages, "1"))
-        assign(tabDMLScript, "group_validation_accuracy", matrix("0", Caffe2DML.numValidationImages, "1"))
-        parForBlock("iVal", "1", Caffe2DML.numValidationImages, "1", getParforParameters()) {
-          assign(tabDMLScript, "Xb", Caffe2DML.XVal + "[iVal,]")
-          assign(tabDMLScript, "yb", Caffe2DML.yVal + "[iVal,]")
-          net.getLayers.map(layer => net.getCaffeLayer(layer).forward(tabDMLScript, false))
-          lossLayer.computeLoss(dmlScript, numTabs)
-          assign(tabDMLScript, "group_validation_loss[iVal,1]", "loss")
-          assign(tabDMLScript, "group_validation_accuracy[iVal,1]", "accuracy")
+      // ------------------------------------------------------------------------------------------
+      // Post-loop code:
+      getTestAlgo.toLowerCase match {
+        case Caffe2DML.MINIBATCH_ALGORITHM | Caffe2DML.LOOPED_MINIBATCH_ALGORITHM => {
+          tabDMLScript.append("validation_accuracy = validation_accuracy / num_batches_per_epoch\n")
+        }
+        case Caffe2DML.BATCH_ALGORITHM => { }
+        case Caffe2DML.ALLREDUCE_PARALLEL_BATCHES_ALGORITHM | Caffe2DML.ALLREDUCE_ALGORITHM => {
+          assign(tabDMLScript, "validation_loss", "sum(group_validation_loss)")
+          assign(tabDMLScript, "validation_accuracy", "mean(group_validation_accuracy)")
         }
-        assign(tabDMLScript, "validation_loss", "sum(group_validation_loss)")
-        assign(tabDMLScript, "validation_accuracy", "mean(group_validation_accuracy)")
+        case _ => throw new DMLRuntimeException("Unsupported test algo:" + getTestAlgo)
+      }
+      // ------------------------------------------------------------------------------------------
+      tabDMLScript.append( 
+        print(dmlConcat(iterString, asDMLString(", validation loss:"), "validation_loss", 
+            asDMLString(", validation accuracy:"), "validation_accuracy"))
+      )
+    }
+    
+    if(cond == null) {
+      innerValidationLossCode
+    }
+    else {
+      ifBlock(cond) {
+        innerValidationLossCode
       }
-
-      case _ => throw new DMLRuntimeException("Unsupported test algo:" + getTestAlgo)
     }
-    tabDMLScript.append(
-      print(dmlConcat(asDMLString("Iter:"), "iter", asDMLString(", validation loss:"), "validation_loss", asDMLString(", validation accuracy:"), "validation_accuracy"))
-    )
   }
   private def appendSnapshotWrite(varName: String, fileName: String): Unit =
     tabDMLScript.append(write(varName, "snapshot_dir + \"" + fileName + "\"", "binary"))
diff --git a/src/main/scala/org/apache/sysml/api/dl/CaffeLayer.scala b/src/main/scala/org/apache/sysml/api/dl/CaffeLayer.scala
index bf86f38..8fd37d9 100644
--- a/src/main/scala/org/apache/sysml/api/dl/CaffeLayer.scala
+++ b/src/main/scala/org/apache/sysml/api/dl/CaffeLayer.scala
@@ -193,12 +193,11 @@ class Data(val param: LayerParameter, val id: Int, val net: CaffeNetwork, val nu
     if (param.hasTransformParam && param.getTransformParam.hasScale) {
       dmlScript.append("X_full = X_full * " + param.getTransformParam.getScale + "\n")
     }
-    if (param.hasDataParam && param.getDataParam.hasBatchSize) {
-      dmlScript.append("BATCH_SIZE = " + param.getDataParam.getBatchSize + "\n")
-    } else {
-      Caffe2DML.LOG.debug("Using default batch size of 64 as batch size is not set with DataParam")
-      dmlScript.append("BATCH_SIZE = 64\n")
-    }
+    dmlScript.append(Caffe2DML.batchSize + " = " + getBatchSize + "\n" )
+  }
+  def getBatchSize():Int = {
+    val defaultBatchSize = if(param.hasDataParam && param.getDataParam.hasBatchSize) param.getDataParam.getBatchSize else 64
+    caffe2dmlObj.getModifiedSolverParamInt("batch_size", defaultBatchSize)
   }
   var dataOutputShape                                                   = ("$num_channels", "$height", "$width")
   override def forward(dmlScript: StringBuilder, isPrediction: Boolean) = {}
diff --git a/src/main/scala/org/apache/sysml/api/dl/CaffeSolver.scala b/src/main/scala/org/apache/sysml/api/dl/CaffeSolver.scala
index da963b9..36e2ecc 100644
--- a/src/main/scala/org/apache/sysml/api/dl/CaffeSolver.scala
+++ b/src/main/scala/org/apache/sysml/api/dl/CaffeSolver.scala
@@ -93,6 +93,7 @@ class LearningRatePolicy(lr_policy: String = "exp", base_lr: Double = 0.01) {
   def setStepsize(step1: Double): Unit = step = step1
   def setPower(power1: Double): Unit   = power = power1
   def updateLearningRate(dmlScript: StringBuilder): Unit = {
+    dmlScript.append("# Learning rate\n")
     val new_lr = lr_policy.toLowerCase match {
       case "fixed"   => base_lr.toString
       case "step"    => "(" + base_lr + " * " + gamma + " ^ " + " floor(e/" + step + "))"
diff --git a/src/main/scala/org/apache/sysml/api/ml/BaseSystemMLClassifier.scala b/src/main/scala/org/apache/sysml/api/ml/BaseSystemMLClassifier.scala
index c46310d..1e0caef 100644
--- a/src/main/scala/org/apache/sysml/api/ml/BaseSystemMLClassifier.scala
+++ b/src/main/scala/org/apache/sysml/api/ml/BaseSystemMLClassifier.scala
@@ -39,6 +39,7 @@ import org.apache.sysml.api.mlcontext.ScriptFactory._
 import org.apache.spark.sql._
 import org.apache.sysml.api.mlcontext.MLContext.ExplainLevel
 import org.apache.sysml.hops.OptimizerUtils;
+import org.apache.sysml.conf.{ConfigurationManager, DMLConfig}
 
 import java.util.HashMap
 
@@ -130,9 +131,19 @@ trait BaseSystemMLEstimatorOrModel {
   def setStatistics(statistics1: Boolean): BaseSystemMLEstimatorOrModel                           = { statistics = statistics1; this }
   def setStatisticsMaxHeavyHitters(statisticsMaxHeavyHitters1: Int): BaseSystemMLEstimatorOrModel = { statisticsMaxHeavyHitters = statisticsMaxHeavyHitters1; this }
   def setConfigProperty(key: String, value: String): BaseSystemMLEstimatorOrModel                 = { config.put(key, value); this }
+  var localDmlConfig:DMLConfig = null
+  def getConfigProperty(key: String): String = {
+    if(config.containsKey())
+      return config.get(key)
+    if(localDmlConfig == null) {
+      localDmlConfig = ConfigurationManager.getDMLConfig()
+    }
+    return localDmlConfig.getTextValue(key)
+  }
+  
   def updateML(ml: MLContext): Unit = {
-	System.gc();
-	ml.setGPU(enableGPU); ml.setForceGPU(forceGPU);
+	  System.gc();
+	  ml.setGPU(enableGPU); ml.setForceGPU(forceGPU);
     ml.setExplain(explain); ml.setExplainLevel(explainLevel);
     ml.setStatistics(statistics); ml.setStatisticsMaxHeavyHitters(statisticsMaxHeavyHitters);
     config.map(x => ml.setConfigProperty(x._1, x._2))