You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@submarine.apache.org by pi...@apache.org on 2021/08/27 05:11:49 UTC
[submarine] branch master updated: SUBMARINE-901. Tensorflow distributed training example with model management API

This is an automated email from the ASF dual-hosted git repository.

pingsutw pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/submarine.git


The following commit(s) were added to refs/heads/master by this push:
     new acdd01f  SUBMARINE-901. Tensorflow distributed training example with model management API
acdd01f is described below

commit acdd01f4a037a438fa3bbb0aded9094f5d863505
Author: featherchen <ga...@gmail.com>
AuthorDate: Thu Aug 26 23:18:50 2021 +0800

    SUBMARINE-901. Tensorflow distributed training example with model management API
    
    ### What is this PR for?
    <!-- A few sentences describing the overall goals of the pull request's commits.
    First time? Check out the contributing guide - https://submarine.apache.org/contribution/contributions.html
    -->
    Add tensorflow distributed example with submarine sdk
    ### What type of PR is it?
    Improvement
    
    ### Todos
    * [ ] - ParameterSever Strategy
    
    ### What is the Jira issue?
    <!-- * Open an issue on Jira https://issues.apache.org/jira/browse/SUBMARINE/
    * Put link here, and add [SUBMARINE-*Jira number*] in PR title, eg. `SUBMARINE-23. PR title`
    -->
    https://issues.apache.org/jira/browse/SUBMARINE-901?filter=-1
    ### How should this be tested?
    <!--
    * First time? Setup Travis CI as described on https://submarine.apache.org/contribution/contributions.html#continuous-integration
    * Strongly recommended: add automated unit tests for any new or changed behavior
    * Outline any manual steps to test the PR here.
    -->
    ### Screenshots (if appropriate)
    Mirrored
    ![Screenshot from 2021-08-24 01-24-13](https://user-images.githubusercontent.com/57944334/130491783-6e71a093-b843-4ab7-b95b-f4b5410937cf.png)
    ![Screenshot from 2021-08-24 01-23-26](https://user-images.githubusercontent.com/57944334/130491797-f95c1c28-9298-4aa4-8a6f-e130b0c36f9c.png)
    
    MultiWorkerMirrored
    ![Screenshot from 2021-08-24 01-27-54](https://user-images.githubusercontent.com/57944334/130491857-edb2b78c-ea21-4837-9789-8c924a3ab5c2.png)
    ![Screenshot from 2021-08-24 01-27-35](https://user-images.githubusercontent.com/57944334/130491869-1876d312-26c4-44c5-90c1-8445d8932317.png)
    
    ### Questions:
    * Do the license files need updating? No
    * Are there breaking changes for older versions? No
    * Does this need new documentation? No
    
    Author: featherchen <ga...@gmail.com>
    
    Signed-off-by: Kevin <pi...@apache.org>
    
    Closes #700 from featherchen/SUBMARINE-901 and squashes the following commits:
    
    73aeb38e [featherchen] SUBMARINE-901. delete redundant comments
    6cbece13 [featherchen] SUBMARINE-901. fix coding style
    419002a0 [featherchen] SUBMARINE-901. Add MultiWorker and ps example
    2e3969d4 [featherchen] SUBMARINE-901. add multiworkerMirrored
    8ce4f730 [featherchen] SUBMARINE-901. README
    2f04d12a [featherchen] SUBMARINE-901. tf example
---
 .../mnist-tensorflow/MirroredStrategy/Dockerfile   |  25 +++++
 .../mnist-tensorflow/MirroredStrategy/build.sh     |  44 ++++++++
 .../MirroredStrategy/mnist_keras_distributed.py    | 111 +++++++++++++++++++++
 .../mnist-tensorflow/MirroredStrategy/post.sh      |  38 +++++++
 .../mnist-tensorflow/MirroredStrategy/readme.md    |  25 +++++
 .../MultiWorkerMirroredStrategy/Dockerfile         |  25 +++++
 .../MultiWorkerMirroredStrategy/build.sh           |  44 ++++++++
 .../mnist_keras_distributed.py                     |  95 ++++++++++++++++++
 .../MultiWorkerMirroredStrategy/post.sh            |  38 +++++++
 .../MultiWorkerMirroredStrategy/readme.md          |  25 +++++
 .../ParameterServerStrategy/Dockerfile             |  25 +++++
 .../ParameterServerStrategy/build.sh               |  44 ++++++++
 .../mnist_keras_distributed.py                     |  88 ++++++++++++++++
 .../ParameterServerStrategy/post.sh                |  42 ++++++++
 .../ParameterServerStrategy/readme.md              |  25 +++++
 dev-support/examples/nn-pytorch/model.py           |   2 +-
 dev-support/examples/nn-pytorch/readme.md          |   7 +-
 .../pysubmarine/submarine/models/client.py         |  10 +-
 18 files changed, 710 insertions(+), 3 deletions(-)

diff --git a/dev-support/examples/mnist-tensorflow/MirroredStrategy/Dockerfile b/dev-support/examples/mnist-tensorflow/MirroredStrategy/Dockerfile
new file mode 100644
index 0000000..e0f3fd5
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MirroredStrategy/Dockerfile
@@ -0,0 +1,25 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM python:3.7
+MAINTAINER Apache Software Foundation <de...@submarine.apache.org>
+
+ADD ./tmp/submarine-sdk /opt/
+RUN pip install /opt/pysubmarine
+RUN pip install tensorflow==2.3.0
+RUN pip install tensorboard
+RUN pip install tensorflow_datasets==2.1.0
+
+ADD ./mnist_keras_distributed.py /opt/
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/MirroredStrategy/build.sh b/dev-support/examples/mnist-tensorflow/MirroredStrategy/build.sh
new file mode 100755
index 0000000..ab72b5b
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MirroredStrategy/build.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -euxo pipefail
+
+SUBMARINE_VERSION=0.6.0-SNAPSHOT
+SUBMARINE_IMAGE_NAME="mirrored:${SUBMARINE_VERSION}"
+
+if [ -L ${BASH_SOURCE-$0} ]; then
+  PWD=$(dirname $(readlink "${BASH_SOURCE-$0}"))
+else
+  PWD=$(dirname ${BASH_SOURCE-$0})
+fi
+export CURRENT_PATH=$(cd "${PWD}">/dev/null; pwd)
+export SUBMARINE_HOME=${CURRENT_PATH}/../../../..
+
+if [ -d "${CURRENT_PATH}/tmp" ] # if old tmp folder is still there, delete it.
+then
+  rm -rf "${CURRENT_PATH}/tmp"
+fi
+
+mkdir -p "${CURRENT_PATH}/tmp"
+cp -r "${SUBMARINE_HOME}/submarine-sdk" "${CURRENT_PATH}/tmp"
+
+# build image
+cd ${CURRENT_PATH}
+echo "Start building the ${SUBMARINE_IMAGE_NAME} docker image ..."
+docker build -t ${SUBMARINE_IMAGE_NAME} .
+
+# clean temp file
+rm -rf "${CURRENT_PATH}/tmp"
diff --git a/dev-support/examples/mnist-tensorflow/MirroredStrategy/mnist_keras_distributed.py b/dev-support/examples/mnist-tensorflow/MirroredStrategy/mnist_keras_distributed.py
new file mode 100644
index 0000000..7f4ae6a
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MirroredStrategy/mnist_keras_distributed.py
@@ -0,0 +1,111 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+
+from submarine import ModelsClient
+import tensorflow_datasets as tfds
+import tensorflow as tf
+import os
+import tensorboard
+
+datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
+mnist_train, mnist_test = datasets['train'], datasets['test']
+
+strategy = tf.distribute.MirroredStrategy()
+
+print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
+
+# You can also do info.splits.total_num_examples to get the total
+# number of examples in the dataset.
+
+num_train_examples = info.splits['train'].num_examples
+num_test_examples = info.splits['test'].num_examples
+
+BUFFER_SIZE = 10000
+
+BATCH_SIZE_PER_REPLICA = 64
+BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
+
+def scale(image, label):
+  image = tf.cast(image, tf.float32)
+  image /= 255
+
+  return image, label
+
+train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
+eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)
+
+with strategy.scope():
+  model = tf.keras.Sequential([
+      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
+      tf.keras.layers.MaxPooling2D(),
+      tf.keras.layers.Flatten(),
+      tf.keras.layers.Dense(64, activation='relu'),
+      tf.keras.layers.Dense(10)
+  ])
+
+  model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
+                optimizer=tf.keras.optimizers.Adam(),
+                metrics=['accuracy'])
+
+# Define the checkpoint directory to store the checkpoints.
+checkpoint_dir = './training_checkpoints'
+# Define the name of the checkpoint files.
+checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
+
+# Define a function for decaying the learning rate.
+# You can define any decay function you need.
+def decay(epoch):
+  if epoch < 3:
+    return 1e-3
+  elif epoch >= 3 and epoch < 7:
+    return 1e-4
+  else:
+    return 1e-5
+
+# Define a callback for printing the learning rate at the end of each epoch.
+class PrintLR(tf.keras.callbacks.Callback):
+  def on_epoch_end(self, epoch, logs=None):
+    print('\nLearning rate for epoch {} is {}'.format(epoch + 1,
+                                                      model.optimizer.lr.numpy()))
+    modelClient.log_metric("lr", model.optimizer.lr.numpy())
+
+# Put all the callbacks together.
+callbacks = [
+    tf.keras.callbacks.TensorBoard(log_dir='./logs'),
+    tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
+                                       save_weights_only=True),
+    tf.keras.callbacks.LearningRateScheduler(decay),
+    PrintLR()
+]
+
+if __name__ == "__main__":
+  modelClient = ModelsClient()
+  with modelClient.start() as run:
+    EPOCHS = 5
+    hist = model.fit(train_dataset, epochs=EPOCHS, callbacks=callbacks)
+    for i in range(EPOCHS):
+      modelClient.log_metric("val_loss", hist.history['loss'][i])
+      modelClient.log_metric("Val_accuracy", hist.history['accuracy'][i])
+    model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
+    eval_loss, eval_acc = model.evaluate(eval_dataset)
+    print('Eval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))
+    modelClient.log_param("loss", eval_loss)
+    modelClient.log_param("acc", eval_acc)
+
+"""Reference:
+https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy
+"""
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/MirroredStrategy/post.sh b/dev-support/examples/mnist-tensorflow/MirroredStrategy/post.sh
new file mode 100755
index 0000000..5926843
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MirroredStrategy/post.sh
@@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+curl -X POST -H "Content-Type: application/json" -d '
+{
+  "meta": {
+    "name": "mirrored-example",
+    "namespace": "default",
+    "framework": "TensorFlow",
+    "cmd": "python /opt/mnist_keras_distributed.py",
+    "envVars": {
+      "ENV_1": "ENV1"
+    }
+  },
+  "environment": {
+    "image": "mirrored:0.6.0-SNAPSHOT"
+  },
+  "spec": {
+    "Worker": {
+      "replicas": 1,
+      "resources": "cpu=1,memory=512M"
+    }
+  }
+}
+' http://127.0.0.1:32080/api/v1/experiment
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/MirroredStrategy/readme.md b/dev-support/examples/mnist-tensorflow/MirroredStrategy/readme.md
new file mode 100644
index 0000000..96544a0
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MirroredStrategy/readme.md
@@ -0,0 +1,25 @@
+# TF MirroredStrategy Example
+
+## Usage
+
+This is an easy mnist example of how to train a distributed tensorflow model using MirroredStrategy and track the metric and paramater in submarine-sdk.
+
+## How to execute
+
+0. Set up (for a single terminal, only need to do this one time)
+
+```bash
+eval $(minikube -p minikube docker-env)
+```
+
+1. Build the docker image
+
+```bash
+./dev-support/examples/mnist-tensorflow/MirroredStrategy/build.sh
+```
+
+2. Submit a post request
+
+```bash
+./dev-support/examples/mnist-tensorflow/MirroredStrategy/post.sh
+```
diff --git a/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/Dockerfile b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/Dockerfile
new file mode 100644
index 0000000..e0f3fd5
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/Dockerfile
@@ -0,0 +1,25 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM python:3.7
+MAINTAINER Apache Software Foundation <de...@submarine.apache.org>
+
+ADD ./tmp/submarine-sdk /opt/
+RUN pip install /opt/pysubmarine
+RUN pip install tensorflow==2.3.0
+RUN pip install tensorboard
+RUN pip install tensorflow_datasets==2.1.0
+
+ADD ./mnist_keras_distributed.py /opt/
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/build.sh b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/build.sh
new file mode 100755
index 0000000..a0fb21d
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/build.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -euxo pipefail
+
+SUBMARINE_VERSION=0.6.0-SNAPSHOT
+SUBMARINE_IMAGE_NAME="multi-worker-mirrored:${SUBMARINE_VERSION}"
+
+if [ -L ${BASH_SOURCE-$0} ]; then
+  PWD=$(dirname $(readlink "${BASH_SOURCE-$0}"))
+else
+  PWD=$(dirname ${BASH_SOURCE-$0})
+fi
+export CURRENT_PATH=$(cd "${PWD}">/dev/null; pwd)
+export SUBMARINE_HOME=${CURRENT_PATH}/../../../..
+
+if [ -d "${CURRENT_PATH}/tmp" ] # if old tmp folder is still there, delete it.
+then
+  rm -rf "${CURRENT_PATH}/tmp"
+fi
+
+mkdir -p "${CURRENT_PATH}/tmp"
+cp -r "${SUBMARINE_HOME}/submarine-sdk" "${CURRENT_PATH}/tmp"
+
+# build image
+cd ${CURRENT_PATH}
+echo "Start building the ${SUBMARINE_IMAGE_NAME} docker image ..."
+docker build -t ${SUBMARINE_IMAGE_NAME} .
+
+# clean temp file
+rm -rf "${CURRENT_PATH}/tmp"
diff --git a/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/mnist_keras_distributed.py b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/mnist_keras_distributed.py
new file mode 100644
index 0000000..4b3c032
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/mnist_keras_distributed.py
@@ -0,0 +1,95 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+from submarine import ModelsClient
+import json
+import os
+import sys
+import tensorflow as tf
+import numpy as np
+import tensorflow_datasets as tfds
+
+BUFFER_SIZE = 10000
+BATCH_SIZE = 32
+
+strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
+
+def make_datasets_unbatched():
+  #Scaling MNIST data from (0, 255] to (0., 1.]
+  def scale(image, label):
+    image = tf.cast(image, tf.float32)
+    image /= 255
+    return image, label
+
+  datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
+
+  return datasets['train'].map(scale, num_parallel_calls=tf.data.experimental.AUTOTUNE).cache().shuffle(BUFFER_SIZE)
+
+def build_and_compile_cnn_model():
+  model = tf.keras.Sequential([
+      tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
+      tf.keras.layers.MaxPooling2D(),
+      tf.keras.layers.Flatten(),
+      tf.keras.layers.Dense(64, activation='relu'),
+      tf.keras.layers.Dense(10)
+  ])
+  model.compile(
+      loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
+      optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
+      metrics=['accuracy'])
+  return model
+  
+tf_config = json.loads(os.environ['TF_CONFIG'])
+NUM_WORKERS = len(tf_config['cluster']['worker'])
+
+#Here the batch size scales up by number of workers since 
+#`tf.data.Dataset.batch` expects the global batch size. Previously we used 64, 
+#and now this becomes 128.
+GLOBAL_BATCH_SIZE = 64 * NUM_WORKERS
+
+#Creation of dataset needs to be after MultiWorkerMirroredStrategy object
+#is instantiated.
+train_datasets = make_datasets_unbatched().batch(GLOBAL_BATCH_SIZE)
+
+#next three line is the key point to fix this problem
+options = tf.data.Options()
+options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA  # AutoShardPolicy.OFF can work too.
+train_datasets_no_auto_shard = train_datasets.with_options(options)
+
+with strategy.scope():
+  #Model building/compiling need to be within `strategy.scope()`.
+  multi_worker_model = build_and_compile_cnn_model()
+
+#Keras' `model.fit()` trains the model with specified number of epochs and
+#number of steps per epoch. Note that the numbers here are for demonstration
+#purposes only and may not sufficiently produce a model with good quality.
+
+#attention:   x=train_datasets_no_auto_shard , not x = train_datasets
+
+if __name__ == "__main__":
+  modelClient = ModelsClient()
+  with modelClient.start() as run:
+    EPOCHS = 5
+    hist = multi_worker_model.fit(x=train_datasets_no_auto_shard, epochs=EPOCHS, steps_per_epoch=5)
+    for i in range(EPOCHS):
+      modelClient.log_metric("val_loss", hist.history['loss'][i])
+      modelClient.log_metric("Val_accuracy", hist.history['accuracy'][i])
+
+
+"""Reference
+https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
+https://reurl.cc/no9Zk8
+"""
diff --git a/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/post.sh b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/post.sh
new file mode 100755
index 0000000..901b4a3
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/post.sh
@@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+curl -X POST -H "Content-Type: application/json" -d '
+{
+  "meta": {
+    "name": "multi-worker-mirrored-example",
+    "namespace": "default",
+    "framework": "TensorFlow",
+    "cmd": "python /opt/mnist_keras_distributed.py",
+    "envVars": {
+      "ENV_1": "ENV1"
+    }
+  },
+  "environment": {
+    "image": "multi-worker-mirrored:0.6.0-SNAPSHOT"
+  },
+  "spec": {
+    "Worker": {
+      "replicas": 4,
+      "resources": "cpu=1,memory=512M"
+    }
+  }
+}
+' http://127.0.0.1:32080/api/v1/experiment
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/readme.md b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/readme.md
new file mode 100644
index 0000000..171e07e
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/readme.md
@@ -0,0 +1,25 @@
+# TF MultiWorkerMirroredStrategy Example
+
+## Usage
+
+This is an easy mnist example of how to train a distributed tensorflow model using MultiWorkerMirroredStrategy and track the metric in submarine-sdk.
+
+## How to execute
+
+0. Set up (for a single terminal, only need to do this one time)
+
+```bash
+eval $(minikube -p minikube docker-env)
+```
+
+1. Build the docker image
+
+```bash
+./dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/build.sh
+```
+
+2. Submit a post request
+
+```bash
+./dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/post.sh
+```
diff --git a/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/Dockerfile b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/Dockerfile
new file mode 100644
index 0000000..b7fb6df
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/Dockerfile
@@ -0,0 +1,25 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM python:3.7
+MAINTAINER Apache Software Foundation <de...@submarine.apache.org>
+
+ADD ./tmp/submarine-sdk /opt/
+RUN pip install /opt/pysubmarine
+RUN pip install tf-nightly
+RUN pip install tensorboard
+RUN pip install tensorflow_datasets==2.1.0
+
+ADD ./mnist_keras_distributed.py /opt/
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/build.sh b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/build.sh
new file mode 100755
index 0000000..c4652ec
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/build.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -euxo pipefail
+
+SUBMARINE_VERSION=0.6.0-SNAPSHOT
+SUBMARINE_IMAGE_NAME="parameter-server:${SUBMARINE_VERSION}"
+
+if [ -L ${BASH_SOURCE-$0} ]; then
+  PWD=$(dirname $(readlink "${BASH_SOURCE-$0}"))
+else
+  PWD=$(dirname ${BASH_SOURCE-$0})
+fi
+export CURRENT_PATH=$(cd "${PWD}">/dev/null; pwd)
+export SUBMARINE_HOME=${CURRENT_PATH}/../../../..
+
+if [ -d "${CURRENT_PATH}/tmp" ] # if old tmp folder is still there, delete it.
+then
+  rm -rf "${CURRENT_PATH}/tmp"
+fi
+
+mkdir -p "${CURRENT_PATH}/tmp"
+cp -r "${SUBMARINE_HOME}/submarine-sdk" "${CURRENT_PATH}/tmp"
+
+# build image
+cd ${CURRENT_PATH}
+echo "Start building the ${SUBMARINE_IMAGE_NAME} docker image ..."
+docker build -t ${SUBMARINE_IMAGE_NAME} .
+
+# clean temp file
+rm -rf "${CURRENT_PATH}/tmp"
diff --git a/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/mnist_keras_distributed.py b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/mnist_keras_distributed.py
new file mode 100644
index 0000000..3a9cbd8
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/mnist_keras_distributed.py
@@ -0,0 +1,88 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied.  See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+import os
+import random
+import tensorflow as tf
+import json
+from tensorflow.keras.layers.experimental import preprocessing
+import tensorflow_datasets as tfds
+import tensorboard
+
+print(tf.__version__)
+
+TF_CONFIG = os.environ.get('TF_CONFIG', '')
+NUM_PS = len(json.loads(TF_CONFIG)['cluster']['ps'])
+cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
+
+variable_partitioner = (
+    tf.distribute.experimental.partitioners.MinSizePartitioner(
+        min_shard_bytes=(256 << 10),
+        max_shards=NUM_PS))
+
+strategy = tf.distribute.experimental.ParameterServerStrategy(
+    cluster_resolver,
+    variable_partitioner=variable_partitioner)
+
+def dataset_fn(input_context):
+  global_batch_size = 64
+  batch_size = input_context.get_per_replica_batch_size(global_batch_size)
+
+  x = tf.random.uniform((10, 10))
+  y = tf.random.uniform((10,))
+
+  dataset = tf.data.Dataset.from_tensor_slices((x, y)).shuffle(10).repeat()
+  dataset = dataset.shard(
+      input_context.num_input_pipelines,
+      input_context.input_pipeline_id)
+  dataset = dataset.batch(batch_size)
+  dataset = dataset.prefetch(2)
+
+  return dataset
+
+dc = tf.keras.utils.experimental.DatasetCreator(dataset_fn)
+
+with strategy.scope():
+  model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])
+
+model.compile(tf.keras.optimizers.SGD(), loss='mse', steps_per_execution=10)
+
+working_dir = '/tmp/my_working_dir'
+log_dir = os.path.join(working_dir, 'log')
+ckpt_filepath = os.path.join(working_dir, 'ckpt')
+backup_dir = os.path.join(working_dir, 'backup')
+
+callbacks = [
+    tf.keras.callbacks.TensorBoard(log_dir=log_dir),
+    tf.keras.callbacks.ModelCheckpoint(filepath=ckpt_filepath),
+    tf.keras.callbacks.experimental.BackupAndRestore(backup_dir=backup_dir),
+]
+
+model.fit(dc, epochs=5, steps_per_epoch=20, callbacks=callbacks)
+if __name__ == "__main__":
+  modelClient = ModelsClient()
+  with modelClient.start() as run:
+    EPOCHS = 5
+    hist = model.fit(dc, epochs=EPOCHS, steps_per_epoch=20, callbacks=callbacks)
+    for i in range(EPOCHS):
+      modelClient.log_metric("val_loss", hist.history['loss'][i])
+      modelClient.log_metric("Val_accuracy", hist.history['accuracy'][i])
+    model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
+
+"""
+Reference:
+https://www.tensorflow.org/tutorials/distribute/parameter_server_training
+"""
diff --git a/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/post.sh b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/post.sh
new file mode 100755
index 0000000..d6709f8
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/post.sh
@@ -0,0 +1,42 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+curl -X POST -H "Content-Type: application/json" -d '
+{
+  "meta": {
+    "name": "parameter-server-example",
+    "namespace": "default",
+    "framework": "TensorFlow",
+    "cmd": "python /opt/mnist_keras_distributed.py",
+    "envVars": {
+      "ENV_1": "ENV1"
+    }
+  },
+  "environment": {
+    "image": "parameter-server:0.6.0-SNAPSHOT"
+  },
+  "spec": {
+    "Ps": {
+      "replicas": 1,
+      "resources": "cpu=1,memory=128M"
+    },
+    "Worker": {
+      "replicas": 3,
+      "resources": "cpu=1,memory=128M"
+    }
+  }
+}
+' http://127.0.0.1:32080/api/v1/experiment
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/readme.md b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/readme.md
new file mode 100644
index 0000000..e2ebccc
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/readme.md
@@ -0,0 +1,25 @@
+# TF ParameterServerStrategy Example (Beta)
+
+## Usage
+
+This is an easy mnist example of how to train a distributed tensorflow model using ParameterServerStrategy and track the metric and paramater in submarine-sdk.
+
+## How to execute
+
+0. Set up (for a single terminal, only need to do this one time)
+
+```bash
+eval $(minikube -p minikube docker-env)
+```
+
+1. Build the docker image
+
+```bash
+./dev-support/examples/mnist-tensorflow/ParameterServerStrategy/build.sh
+```
+
+2. Submit a post request
+
+```bash
+./dev-support/examples/mnist-tensorflow/ParameterServerStrategy/post.sh
+```
diff --git a/dev-support/examples/nn-pytorch/model.py b/dev-support/examples/nn-pytorch/model.py
index 732d85a..6080c4a 100644
--- a/dev-support/examples/nn-pytorch/model.py
+++ b/dev-support/examples/nn-pytorch/model.py
@@ -32,4 +32,4 @@ class LinearNNModel(torch.nn.Module):
 if __name__ == "__main__":
     client = ModelsClient()
     net = LinearNNModel()
-    client.log_model("simple-nn-model", net)
+    client.save_model(model_type = "pytorch", model = net, artifact_path="pytorch-nn-model", registered_model_name="simple-nn-model")
\ No newline at end of file
diff --git a/dev-support/examples/nn-pytorch/readme.md b/dev-support/examples/nn-pytorch/readme.md
index f6e7172..ea999fc 100644
--- a/dev-support/examples/nn-pytorch/readme.md
+++ b/dev-support/examples/nn-pytorch/readme.md
@@ -1,6 +1,7 @@
 # Save_model Example
 
 ## Usage
+
 This is an easy example of saving a pytorch linear model to model registry.
 
 ## How to execute
@@ -21,6 +22,7 @@ This is an easy example of saving a pytorch linear model to model registry.
 
 1. Make sure the model is saved in the model registry (viewed on MLflow UI)
 2. Call serve API to create serve resource
+
 - Request
   ```
   curl -X POST -H "Content-Type: application/json" -d '
@@ -46,6 +48,7 @@ This is an easy example of saving a pytorch linear model to model registry.
   ```
 
 3. Send data to inference
+
 - Request
   ```
   curl -d '{"data":[[-1, -1]]}' -H 'Content-Type: application/json; format=pandas-split' -X POST http://127.0.0.1:32080/serve/simple-nn-model-1/invocations
@@ -54,7 +57,9 @@ This is an easy example of saving a pytorch linear model to model registry.
   ```
   [{"0": -0.5663654804229736}]
   ```
+
 4. Call serve API to delete serve resource
+
 - Request
   ```
   curl -X DELETE http://0.0.0.0:32080/api/v1/experiment/serve?modelName=simple-nn-model&modelVersion=1&namespace=default
@@ -62,4 +67,4 @@ This is an easy example of saving a pytorch linear model to model registry.
 - Response
   ```
   {"status":"OK","code":200,"success":true,"message":null,"result":{"url":"/serve/simple-nn-model-1"},"attributes":{}}
-  ```
\ No newline at end of file
+  ```
diff --git a/submarine-sdk/pysubmarine/submarine/models/client.py b/submarine-sdk/pysubmarine/submarine/models/client.py
index 084778f..d258d48 100644
--- a/submarine-sdk/pysubmarine/submarine/models/client.py
+++ b/submarine-sdk/pysubmarine/submarine/models/client.py
@@ -15,6 +15,7 @@
  under the License.
 """
 import os
+import time
 
 import mlflow
 from mlflow.exceptions import MlflowException
@@ -109,7 +110,14 @@ class ModelsClient():
         try:
             experiment = mlflow.get_experiment_by_name(experiment_name)
             if experiment is None:  # if not found
-                raise MlflowException("No valid experiment has been found")
+                run_name = get_worker_index()
+                if run_name == "worker-0":
+                    raise MlflowException("No valid experiment has been found")
+                else:
+                    while experiment is None:
+                        time.sleep(1)
+                        experiment = mlflow.get_experiment_by_name(
+                            experiment_name)
             return experiment.experiment_id  # if found
         except MlflowException:
             experiment = mlflow.create_experiment(name=experiment_name)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@submarine.apache.org
For additional commands, e-mail: dev-help@submarine.apache.org