You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@submarine.apache.org by pi...@apache.org on 2021/08/27 05:11:49 UTC
[submarine] branch master updated: SUBMARINE-901. Tensorflow
distributed training example with model management API
This is an automated email from the ASF dual-hosted git repository.
pingsutw pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/submarine.git
The following commit(s) were added to refs/heads/master by this push:
new acdd01f SUBMARINE-901. Tensorflow distributed training example with model management API
acdd01f is described below
commit acdd01f4a037a438fa3bbb0aded9094f5d863505
Author: featherchen <ga...@gmail.com>
AuthorDate: Thu Aug 26 23:18:50 2021 +0800
SUBMARINE-901. Tensorflow distributed training example with model management API
### What is this PR for?
<!-- A few sentences describing the overall goals of the pull request's commits.
First time? Check out the contributing guide - https://submarine.apache.org/contribution/contributions.html
-->
Add tensorflow distributed example with submarine sdk
### What type of PR is it?
Improvement
### Todos
* [ ] - ParameterSever Strategy
### What is the Jira issue?
<!-- * Open an issue on Jira https://issues.apache.org/jira/browse/SUBMARINE/
* Put link here, and add [SUBMARINE-*Jira number*] in PR title, eg. `SUBMARINE-23. PR title`
-->
https://issues.apache.org/jira/browse/SUBMARINE-901?filter=-1
### How should this be tested?
<!--
* First time? Setup Travis CI as described on https://submarine.apache.org/contribution/contributions.html#continuous-integration
* Strongly recommended: add automated unit tests for any new or changed behavior
* Outline any manual steps to test the PR here.
-->
### Screenshots (if appropriate)
Mirrored
![Screenshot from 2021-08-24 01-24-13](https://user-images.githubusercontent.com/57944334/130491783-6e71a093-b843-4ab7-b95b-f4b5410937cf.png)
![Screenshot from 2021-08-24 01-23-26](https://user-images.githubusercontent.com/57944334/130491797-f95c1c28-9298-4aa4-8a6f-e130b0c36f9c.png)
MultiWorkerMirrored
![Screenshot from 2021-08-24 01-27-54](https://user-images.githubusercontent.com/57944334/130491857-edb2b78c-ea21-4837-9789-8c924a3ab5c2.png)
![Screenshot from 2021-08-24 01-27-35](https://user-images.githubusercontent.com/57944334/130491869-1876d312-26c4-44c5-90c1-8445d8932317.png)
### Questions:
* Do the license files need updating? No
* Are there breaking changes for older versions? No
* Does this need new documentation? No
Author: featherchen <ga...@gmail.com>
Signed-off-by: Kevin <pi...@apache.org>
Closes #700 from featherchen/SUBMARINE-901 and squashes the following commits:
73aeb38e [featherchen] SUBMARINE-901. delete redundant comments
6cbece13 [featherchen] SUBMARINE-901. fix coding style
419002a0 [featherchen] SUBMARINE-901. Add MultiWorker and ps example
2e3969d4 [featherchen] SUBMARINE-901. add multiworkerMirrored
8ce4f730 [featherchen] SUBMARINE-901. README
2f04d12a [featherchen] SUBMARINE-901. tf example
---
.../mnist-tensorflow/MirroredStrategy/Dockerfile | 25 +++++
.../mnist-tensorflow/MirroredStrategy/build.sh | 44 ++++++++
.../MirroredStrategy/mnist_keras_distributed.py | 111 +++++++++++++++++++++
.../mnist-tensorflow/MirroredStrategy/post.sh | 38 +++++++
.../mnist-tensorflow/MirroredStrategy/readme.md | 25 +++++
.../MultiWorkerMirroredStrategy/Dockerfile | 25 +++++
.../MultiWorkerMirroredStrategy/build.sh | 44 ++++++++
.../mnist_keras_distributed.py | 95 ++++++++++++++++++
.../MultiWorkerMirroredStrategy/post.sh | 38 +++++++
.../MultiWorkerMirroredStrategy/readme.md | 25 +++++
.../ParameterServerStrategy/Dockerfile | 25 +++++
.../ParameterServerStrategy/build.sh | 44 ++++++++
.../mnist_keras_distributed.py | 88 ++++++++++++++++
.../ParameterServerStrategy/post.sh | 42 ++++++++
.../ParameterServerStrategy/readme.md | 25 +++++
dev-support/examples/nn-pytorch/model.py | 2 +-
dev-support/examples/nn-pytorch/readme.md | 7 +-
.../pysubmarine/submarine/models/client.py | 10 +-
18 files changed, 710 insertions(+), 3 deletions(-)
diff --git a/dev-support/examples/mnist-tensorflow/MirroredStrategy/Dockerfile b/dev-support/examples/mnist-tensorflow/MirroredStrategy/Dockerfile
new file mode 100644
index 0000000..e0f3fd5
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MirroredStrategy/Dockerfile
@@ -0,0 +1,25 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM python:3.7
+MAINTAINER Apache Software Foundation <de...@submarine.apache.org>
+
+ADD ./tmp/submarine-sdk /opt/
+RUN pip install /opt/pysubmarine
+RUN pip install tensorflow==2.3.0
+RUN pip install tensorboard
+RUN pip install tensorflow_datasets==2.1.0
+
+ADD ./mnist_keras_distributed.py /opt/
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/MirroredStrategy/build.sh b/dev-support/examples/mnist-tensorflow/MirroredStrategy/build.sh
new file mode 100755
index 0000000..ab72b5b
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MirroredStrategy/build.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -euxo pipefail
+
+SUBMARINE_VERSION=0.6.0-SNAPSHOT
+SUBMARINE_IMAGE_NAME="mirrored:${SUBMARINE_VERSION}"
+
+if [ -L ${BASH_SOURCE-$0} ]; then
+ PWD=$(dirname $(readlink "${BASH_SOURCE-$0}"))
+else
+ PWD=$(dirname ${BASH_SOURCE-$0})
+fi
+export CURRENT_PATH=$(cd "${PWD}">/dev/null; pwd)
+export SUBMARINE_HOME=${CURRENT_PATH}/../../../..
+
+if [ -d "${CURRENT_PATH}/tmp" ] # if old tmp folder is still there, delete it.
+then
+ rm -rf "${CURRENT_PATH}/tmp"
+fi
+
+mkdir -p "${CURRENT_PATH}/tmp"
+cp -r "${SUBMARINE_HOME}/submarine-sdk" "${CURRENT_PATH}/tmp"
+
+# build image
+cd ${CURRENT_PATH}
+echo "Start building the ${SUBMARINE_IMAGE_NAME} docker image ..."
+docker build -t ${SUBMARINE_IMAGE_NAME} .
+
+# clean temp file
+rm -rf "${CURRENT_PATH}/tmp"
diff --git a/dev-support/examples/mnist-tensorflow/MirroredStrategy/mnist_keras_distributed.py b/dev-support/examples/mnist-tensorflow/MirroredStrategy/mnist_keras_distributed.py
new file mode 100644
index 0000000..7f4ae6a
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MirroredStrategy/mnist_keras_distributed.py
@@ -0,0 +1,111 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+
+from submarine import ModelsClient
+import tensorflow_datasets as tfds
+import tensorflow as tf
+import os
+import tensorboard
+
+datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
+mnist_train, mnist_test = datasets['train'], datasets['test']
+
+strategy = tf.distribute.MirroredStrategy()
+
+print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
+
+# You can also do info.splits.total_num_examples to get the total
+# number of examples in the dataset.
+
+num_train_examples = info.splits['train'].num_examples
+num_test_examples = info.splits['test'].num_examples
+
+BUFFER_SIZE = 10000
+
+BATCH_SIZE_PER_REPLICA = 64
+BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
+
+def scale(image, label):
+ image = tf.cast(image, tf.float32)
+ image /= 255
+
+ return image, label
+
+train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
+eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)
+
+with strategy.scope():
+ model = tf.keras.Sequential([
+ tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
+ tf.keras.layers.MaxPooling2D(),
+ tf.keras.layers.Flatten(),
+ tf.keras.layers.Dense(64, activation='relu'),
+ tf.keras.layers.Dense(10)
+ ])
+
+ model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
+ optimizer=tf.keras.optimizers.Adam(),
+ metrics=['accuracy'])
+
+# Define the checkpoint directory to store the checkpoints.
+checkpoint_dir = './training_checkpoints'
+# Define the name of the checkpoint files.
+checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
+
+# Define a function for decaying the learning rate.
+# You can define any decay function you need.
+def decay(epoch):
+ if epoch < 3:
+ return 1e-3
+ elif epoch >= 3 and epoch < 7:
+ return 1e-4
+ else:
+ return 1e-5
+
+# Define a callback for printing the learning rate at the end of each epoch.
+class PrintLR(tf.keras.callbacks.Callback):
+ def on_epoch_end(self, epoch, logs=None):
+ print('\nLearning rate for epoch {} is {}'.format(epoch + 1,
+ model.optimizer.lr.numpy()))
+ modelClient.log_metric("lr", model.optimizer.lr.numpy())
+
+# Put all the callbacks together.
+callbacks = [
+ tf.keras.callbacks.TensorBoard(log_dir='./logs'),
+ tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
+ save_weights_only=True),
+ tf.keras.callbacks.LearningRateScheduler(decay),
+ PrintLR()
+]
+
+if __name__ == "__main__":
+ modelClient = ModelsClient()
+ with modelClient.start() as run:
+ EPOCHS = 5
+ hist = model.fit(train_dataset, epochs=EPOCHS, callbacks=callbacks)
+ for i in range(EPOCHS):
+ modelClient.log_metric("val_loss", hist.history['loss'][i])
+ modelClient.log_metric("Val_accuracy", hist.history['accuracy'][i])
+ model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
+ eval_loss, eval_acc = model.evaluate(eval_dataset)
+ print('Eval loss: {}, Eval accuracy: {}'.format(eval_loss, eval_acc))
+ modelClient.log_param("loss", eval_loss)
+ modelClient.log_param("acc", eval_acc)
+
+"""Reference:
+https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy
+"""
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/MirroredStrategy/post.sh b/dev-support/examples/mnist-tensorflow/MirroredStrategy/post.sh
new file mode 100755
index 0000000..5926843
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MirroredStrategy/post.sh
@@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+curl -X POST -H "Content-Type: application/json" -d '
+{
+ "meta": {
+ "name": "mirrored-example",
+ "namespace": "default",
+ "framework": "TensorFlow",
+ "cmd": "python /opt/mnist_keras_distributed.py",
+ "envVars": {
+ "ENV_1": "ENV1"
+ }
+ },
+ "environment": {
+ "image": "mirrored:0.6.0-SNAPSHOT"
+ },
+ "spec": {
+ "Worker": {
+ "replicas": 1,
+ "resources": "cpu=1,memory=512M"
+ }
+ }
+}
+' http://127.0.0.1:32080/api/v1/experiment
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/MirroredStrategy/readme.md b/dev-support/examples/mnist-tensorflow/MirroredStrategy/readme.md
new file mode 100644
index 0000000..96544a0
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MirroredStrategy/readme.md
@@ -0,0 +1,25 @@
+# TF MirroredStrategy Example
+
+## Usage
+
+This is an easy mnist example of how to train a distributed tensorflow model using MirroredStrategy and track the metric and paramater in submarine-sdk.
+
+## How to execute
+
+0. Set up (for a single terminal, only need to do this one time)
+
+```bash
+eval $(minikube -p minikube docker-env)
+```
+
+1. Build the docker image
+
+```bash
+./dev-support/examples/mnist-tensorflow/MirroredStrategy/build.sh
+```
+
+2. Submit a post request
+
+```bash
+./dev-support/examples/mnist-tensorflow/MirroredStrategy/post.sh
+```
diff --git a/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/Dockerfile b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/Dockerfile
new file mode 100644
index 0000000..e0f3fd5
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/Dockerfile
@@ -0,0 +1,25 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM python:3.7
+MAINTAINER Apache Software Foundation <de...@submarine.apache.org>
+
+ADD ./tmp/submarine-sdk /opt/
+RUN pip install /opt/pysubmarine
+RUN pip install tensorflow==2.3.0
+RUN pip install tensorboard
+RUN pip install tensorflow_datasets==2.1.0
+
+ADD ./mnist_keras_distributed.py /opt/
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/build.sh b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/build.sh
new file mode 100755
index 0000000..a0fb21d
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/build.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -euxo pipefail
+
+SUBMARINE_VERSION=0.6.0-SNAPSHOT
+SUBMARINE_IMAGE_NAME="multi-worker-mirrored:${SUBMARINE_VERSION}"
+
+if [ -L ${BASH_SOURCE-$0} ]; then
+ PWD=$(dirname $(readlink "${BASH_SOURCE-$0}"))
+else
+ PWD=$(dirname ${BASH_SOURCE-$0})
+fi
+export CURRENT_PATH=$(cd "${PWD}">/dev/null; pwd)
+export SUBMARINE_HOME=${CURRENT_PATH}/../../../..
+
+if [ -d "${CURRENT_PATH}/tmp" ] # if old tmp folder is still there, delete it.
+then
+ rm -rf "${CURRENT_PATH}/tmp"
+fi
+
+mkdir -p "${CURRENT_PATH}/tmp"
+cp -r "${SUBMARINE_HOME}/submarine-sdk" "${CURRENT_PATH}/tmp"
+
+# build image
+cd ${CURRENT_PATH}
+echo "Start building the ${SUBMARINE_IMAGE_NAME} docker image ..."
+docker build -t ${SUBMARINE_IMAGE_NAME} .
+
+# clean temp file
+rm -rf "${CURRENT_PATH}/tmp"
diff --git a/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/mnist_keras_distributed.py b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/mnist_keras_distributed.py
new file mode 100644
index 0000000..4b3c032
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/mnist_keras_distributed.py
@@ -0,0 +1,95 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+from submarine import ModelsClient
+import json
+import os
+import sys
+import tensorflow as tf
+import numpy as np
+import tensorflow_datasets as tfds
+
+BUFFER_SIZE = 10000
+BATCH_SIZE = 32
+
+strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
+
+def make_datasets_unbatched():
+ #Scaling MNIST data from (0, 255] to (0., 1.]
+ def scale(image, label):
+ image = tf.cast(image, tf.float32)
+ image /= 255
+ return image, label
+
+ datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
+
+ return datasets['train'].map(scale, num_parallel_calls=tf.data.experimental.AUTOTUNE).cache().shuffle(BUFFER_SIZE)
+
+def build_and_compile_cnn_model():
+ model = tf.keras.Sequential([
+ tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
+ tf.keras.layers.MaxPooling2D(),
+ tf.keras.layers.Flatten(),
+ tf.keras.layers.Dense(64, activation='relu'),
+ tf.keras.layers.Dense(10)
+ ])
+ model.compile(
+ loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
+ optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
+ metrics=['accuracy'])
+ return model
+
+tf_config = json.loads(os.environ['TF_CONFIG'])
+NUM_WORKERS = len(tf_config['cluster']['worker'])
+
+#Here the batch size scales up by number of workers since
+#`tf.data.Dataset.batch` expects the global batch size. Previously we used 64,
+#and now this becomes 128.
+GLOBAL_BATCH_SIZE = 64 * NUM_WORKERS
+
+#Creation of dataset needs to be after MultiWorkerMirroredStrategy object
+#is instantiated.
+train_datasets = make_datasets_unbatched().batch(GLOBAL_BATCH_SIZE)
+
+#next three line is the key point to fix this problem
+options = tf.data.Options()
+options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA # AutoShardPolicy.OFF can work too.
+train_datasets_no_auto_shard = train_datasets.with_options(options)
+
+with strategy.scope():
+ #Model building/compiling need to be within `strategy.scope()`.
+ multi_worker_model = build_and_compile_cnn_model()
+
+#Keras' `model.fit()` trains the model with specified number of epochs and
+#number of steps per epoch. Note that the numbers here are for demonstration
+#purposes only and may not sufficiently produce a model with good quality.
+
+#attention: x=train_datasets_no_auto_shard , not x = train_datasets
+
+if __name__ == "__main__":
+ modelClient = ModelsClient()
+ with modelClient.start() as run:
+ EPOCHS = 5
+ hist = multi_worker_model.fit(x=train_datasets_no_auto_shard, epochs=EPOCHS, steps_per_epoch=5)
+ for i in range(EPOCHS):
+ modelClient.log_metric("val_loss", hist.history['loss'][i])
+ modelClient.log_metric("Val_accuracy", hist.history['accuracy'][i])
+
+
+"""Reference
+https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras
+https://reurl.cc/no9Zk8
+"""
diff --git a/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/post.sh b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/post.sh
new file mode 100755
index 0000000..901b4a3
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/post.sh
@@ -0,0 +1,38 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+curl -X POST -H "Content-Type: application/json" -d '
+{
+ "meta": {
+ "name": "multi-worker-mirrored-example",
+ "namespace": "default",
+ "framework": "TensorFlow",
+ "cmd": "python /opt/mnist_keras_distributed.py",
+ "envVars": {
+ "ENV_1": "ENV1"
+ }
+ },
+ "environment": {
+ "image": "multi-worker-mirrored:0.6.0-SNAPSHOT"
+ },
+ "spec": {
+ "Worker": {
+ "replicas": 4,
+ "resources": "cpu=1,memory=512M"
+ }
+ }
+}
+' http://127.0.0.1:32080/api/v1/experiment
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/readme.md b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/readme.md
new file mode 100644
index 0000000..171e07e
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/readme.md
@@ -0,0 +1,25 @@
+# TF MultiWorkerMirroredStrategy Example
+
+## Usage
+
+This is an easy mnist example of how to train a distributed tensorflow model using MultiWorkerMirroredStrategy and track the metric in submarine-sdk.
+
+## How to execute
+
+0. Set up (for a single terminal, only need to do this one time)
+
+```bash
+eval $(minikube -p minikube docker-env)
+```
+
+1. Build the docker image
+
+```bash
+./dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/build.sh
+```
+
+2. Submit a post request
+
+```bash
+./dev-support/examples/mnist-tensorflow/MultiWorkerMirroredStrategy/post.sh
+```
diff --git a/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/Dockerfile b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/Dockerfile
new file mode 100644
index 0000000..b7fb6df
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/Dockerfile
@@ -0,0 +1,25 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM python:3.7
+MAINTAINER Apache Software Foundation <de...@submarine.apache.org>
+
+ADD ./tmp/submarine-sdk /opt/
+RUN pip install /opt/pysubmarine
+RUN pip install tf-nightly
+RUN pip install tensorboard
+RUN pip install tensorflow_datasets==2.1.0
+
+ADD ./mnist_keras_distributed.py /opt/
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/build.sh b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/build.sh
new file mode 100755
index 0000000..c4652ec
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/build.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+set -euxo pipefail
+
+SUBMARINE_VERSION=0.6.0-SNAPSHOT
+SUBMARINE_IMAGE_NAME="parameter-server:${SUBMARINE_VERSION}"
+
+if [ -L ${BASH_SOURCE-$0} ]; then
+ PWD=$(dirname $(readlink "${BASH_SOURCE-$0}"))
+else
+ PWD=$(dirname ${BASH_SOURCE-$0})
+fi
+export CURRENT_PATH=$(cd "${PWD}">/dev/null; pwd)
+export SUBMARINE_HOME=${CURRENT_PATH}/../../../..
+
+if [ -d "${CURRENT_PATH}/tmp" ] # if old tmp folder is still there, delete it.
+then
+ rm -rf "${CURRENT_PATH}/tmp"
+fi
+
+mkdir -p "${CURRENT_PATH}/tmp"
+cp -r "${SUBMARINE_HOME}/submarine-sdk" "${CURRENT_PATH}/tmp"
+
+# build image
+cd ${CURRENT_PATH}
+echo "Start building the ${SUBMARINE_IMAGE_NAME} docker image ..."
+docker build -t ${SUBMARINE_IMAGE_NAME} .
+
+# clean temp file
+rm -rf "${CURRENT_PATH}/tmp"
diff --git a/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/mnist_keras_distributed.py b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/mnist_keras_distributed.py
new file mode 100644
index 0000000..3a9cbd8
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/mnist_keras_distributed.py
@@ -0,0 +1,88 @@
+"""
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+ http://www.apache.org/licenses/LICENSE-2.0
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+"""
+import os
+import random
+import tensorflow as tf
+import json
+from tensorflow.keras.layers.experimental import preprocessing
+import tensorflow_datasets as tfds
+import tensorboard
+
+print(tf.__version__)
+
+TF_CONFIG = os.environ.get('TF_CONFIG', '')
+NUM_PS = len(json.loads(TF_CONFIG)['cluster']['ps'])
+cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
+
+variable_partitioner = (
+ tf.distribute.experimental.partitioners.MinSizePartitioner(
+ min_shard_bytes=(256 << 10),
+ max_shards=NUM_PS))
+
+strategy = tf.distribute.experimental.ParameterServerStrategy(
+ cluster_resolver,
+ variable_partitioner=variable_partitioner)
+
+def dataset_fn(input_context):
+ global_batch_size = 64
+ batch_size = input_context.get_per_replica_batch_size(global_batch_size)
+
+ x = tf.random.uniform((10, 10))
+ y = tf.random.uniform((10,))
+
+ dataset = tf.data.Dataset.from_tensor_slices((x, y)).shuffle(10).repeat()
+ dataset = dataset.shard(
+ input_context.num_input_pipelines,
+ input_context.input_pipeline_id)
+ dataset = dataset.batch(batch_size)
+ dataset = dataset.prefetch(2)
+
+ return dataset
+
+dc = tf.keras.utils.experimental.DatasetCreator(dataset_fn)
+
+with strategy.scope():
+ model = tf.keras.models.Sequential([tf.keras.layers.Dense(10)])
+
+model.compile(tf.keras.optimizers.SGD(), loss='mse', steps_per_execution=10)
+
+working_dir = '/tmp/my_working_dir'
+log_dir = os.path.join(working_dir, 'log')
+ckpt_filepath = os.path.join(working_dir, 'ckpt')
+backup_dir = os.path.join(working_dir, 'backup')
+
+callbacks = [
+ tf.keras.callbacks.TensorBoard(log_dir=log_dir),
+ tf.keras.callbacks.ModelCheckpoint(filepath=ckpt_filepath),
+ tf.keras.callbacks.experimental.BackupAndRestore(backup_dir=backup_dir),
+]
+
+model.fit(dc, epochs=5, steps_per_epoch=20, callbacks=callbacks)
+if __name__ == "__main__":
+ modelClient = ModelsClient()
+ with modelClient.start() as run:
+ EPOCHS = 5
+ hist = model.fit(dc, epochs=EPOCHS, steps_per_epoch=20, callbacks=callbacks)
+ for i in range(EPOCHS):
+ modelClient.log_metric("val_loss", hist.history['loss'][i])
+ modelClient.log_metric("Val_accuracy", hist.history['accuracy'][i])
+ model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
+
+"""
+Reference:
+https://www.tensorflow.org/tutorials/distribute/parameter_server_training
+"""
diff --git a/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/post.sh b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/post.sh
new file mode 100755
index 0000000..d6709f8
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/post.sh
@@ -0,0 +1,42 @@
+#!/usr/bin/env bash
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+curl -X POST -H "Content-Type: application/json" -d '
+{
+ "meta": {
+ "name": "parameter-server-example",
+ "namespace": "default",
+ "framework": "TensorFlow",
+ "cmd": "python /opt/mnist_keras_distributed.py",
+ "envVars": {
+ "ENV_1": "ENV1"
+ }
+ },
+ "environment": {
+ "image": "parameter-server:0.6.0-SNAPSHOT"
+ },
+ "spec": {
+ "Ps": {
+ "replicas": 1,
+ "resources": "cpu=1,memory=128M"
+ },
+ "Worker": {
+ "replicas": 3,
+ "resources": "cpu=1,memory=128M"
+ }
+ }
+}
+' http://127.0.0.1:32080/api/v1/experiment
\ No newline at end of file
diff --git a/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/readme.md b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/readme.md
new file mode 100644
index 0000000..e2ebccc
--- /dev/null
+++ b/dev-support/examples/mnist-tensorflow/ParameterServerStrategy/readme.md
@@ -0,0 +1,25 @@
+# TF ParameterServerStrategy Example (Beta)
+
+## Usage
+
+This is an easy mnist example of how to train a distributed tensorflow model using ParameterServerStrategy and track the metric and paramater in submarine-sdk.
+
+## How to execute
+
+0. Set up (for a single terminal, only need to do this one time)
+
+```bash
+eval $(minikube -p minikube docker-env)
+```
+
+1. Build the docker image
+
+```bash
+./dev-support/examples/mnist-tensorflow/ParameterServerStrategy/build.sh
+```
+
+2. Submit a post request
+
+```bash
+./dev-support/examples/mnist-tensorflow/ParameterServerStrategy/post.sh
+```
diff --git a/dev-support/examples/nn-pytorch/model.py b/dev-support/examples/nn-pytorch/model.py
index 732d85a..6080c4a 100644
--- a/dev-support/examples/nn-pytorch/model.py
+++ b/dev-support/examples/nn-pytorch/model.py
@@ -32,4 +32,4 @@ class LinearNNModel(torch.nn.Module):
if __name__ == "__main__":
client = ModelsClient()
net = LinearNNModel()
- client.log_model("simple-nn-model", net)
+ client.save_model(model_type = "pytorch", model = net, artifact_path="pytorch-nn-model", registered_model_name="simple-nn-model")
\ No newline at end of file
diff --git a/dev-support/examples/nn-pytorch/readme.md b/dev-support/examples/nn-pytorch/readme.md
index f6e7172..ea999fc 100644
--- a/dev-support/examples/nn-pytorch/readme.md
+++ b/dev-support/examples/nn-pytorch/readme.md
@@ -1,6 +1,7 @@
# Save_model Example
## Usage
+
This is an easy example of saving a pytorch linear model to model registry.
## How to execute
@@ -21,6 +22,7 @@ This is an easy example of saving a pytorch linear model to model registry.
1. Make sure the model is saved in the model registry (viewed on MLflow UI)
2. Call serve API to create serve resource
+
- Request
```
curl -X POST -H "Content-Type: application/json" -d '
@@ -46,6 +48,7 @@ This is an easy example of saving a pytorch linear model to model registry.
```
3. Send data to inference
+
- Request
```
curl -d '{"data":[[-1, -1]]}' -H 'Content-Type: application/json; format=pandas-split' -X POST http://127.0.0.1:32080/serve/simple-nn-model-1/invocations
@@ -54,7 +57,9 @@ This is an easy example of saving a pytorch linear model to model registry.
```
[{"0": -0.5663654804229736}]
```
+
4. Call serve API to delete serve resource
+
- Request
```
curl -X DELETE http://0.0.0.0:32080/api/v1/experiment/serve?modelName=simple-nn-model&modelVersion=1&namespace=default
@@ -62,4 +67,4 @@ This is an easy example of saving a pytorch linear model to model registry.
- Response
```
{"status":"OK","code":200,"success":true,"message":null,"result":{"url":"/serve/simple-nn-model-1"},"attributes":{}}
- ```
\ No newline at end of file
+ ```
diff --git a/submarine-sdk/pysubmarine/submarine/models/client.py b/submarine-sdk/pysubmarine/submarine/models/client.py
index 084778f..d258d48 100644
--- a/submarine-sdk/pysubmarine/submarine/models/client.py
+++ b/submarine-sdk/pysubmarine/submarine/models/client.py
@@ -15,6 +15,7 @@
under the License.
"""
import os
+import time
import mlflow
from mlflow.exceptions import MlflowException
@@ -109,7 +110,14 @@ class ModelsClient():
try:
experiment = mlflow.get_experiment_by_name(experiment_name)
if experiment is None: # if not found
- raise MlflowException("No valid experiment has been found")
+ run_name = get_worker_index()
+ if run_name == "worker-0":
+ raise MlflowException("No valid experiment has been found")
+ else:
+ while experiment is None:
+ time.sleep(1)
+ experiment = mlflow.get_experiment_by_name(
+ experiment_name)
return experiment.experiment_id # if found
except MlflowException:
experiment = mlflow.create_experiment(name=experiment_name)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@submarine.apache.org
For additional commands, e-mail: dev-help@submarine.apache.org