You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@submarine.apache.org by pi...@apache.org on 2021/08/05 05:37:55 UTC
[submarine] branch master updated: SUBMARINE-928. [Quickstart]
Rewrite quickstart guide
This is an automated email from the ASF dual-hosted git repository.
pingsutw pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/submarine.git
The following commit(s) were added to refs/heads/master by this push:
new a94494b SUBMARINE-928. [Quickstart] Rewrite quickstart guide
a94494b is described below
commit a94494bf4ba89d05b3e8680b3139736204781c35
Author: ByronHsu <by...@gmail.com>
AuthorDate: Thu Jul 29 20:54:32 2021 +0800
SUBMARINE-928. [Quickstart] Rewrite quickstart guide
### What is this PR for?
Write an example that will walk users through the end-to-end usage of the submarine.
### What type of PR is it?
[Documentation]
### Todos
* [ ] - Task
### What is the Jira issue?
https://issues.apache.org/jira/browse/SUBMARINE-928
### How should this be tested?
### Screenshots (if appropriate)
### Questions:
* Do the license files need updating? No
* Are there breaking changes for older versions? No
* Does this need new documentation? No
Author: ByronHsu <by...@gmail.com>
Signed-off-by: Kevin <pi...@apache.org>
Closes #664 from ByronHsu/quickstart and squashes the following commits:
673b2ab0 [ByronHsu] fix conflict
dcad3bbc [ByronHsu] push quickstart to dockerhub
d8be6fcd [ByronHsu] add workbench connection and mlflow demo
b87b04b2 [ByronHsu] add example
bcb01d79 [ByronHsu] version 1
---
.github/workflows/deploy_docker_images.yml | 5 +
.../examples/quickstart/{post.sh => Dockerfile} | 31 +---
.../examples/quickstart/{post.sh => build.sh} | 49 ++---
dev-support/examples/quickstart/post.sh | 1 -
dev-support/examples/quickstart/train.py | 86 +++++++++
website/docs/assets/quickstart-mlflow-2.png | Bin 0 -> 267330 bytes
website/docs/assets/quickstart-mlflow.png | Bin 0 -> 309585 bytes
website/docs/assets/quickstart-submit-1.png | Bin 0 -> 245302 bytes
website/docs/assets/quickstart-submit-2.png | Bin 0 -> 244702 bytes
website/docs/assets/quickstart-submit-3.png | Bin 0 -> 251717 bytes
website/docs/assets/quickstart-submit-4.png | Bin 0 -> 332445 bytes
website/docs/assets/quickstart-worbench.png | Bin 0 -> 86036 bytes
website/docs/gettingStarted/notebook.md | 2 +-
website/docs/gettingStarted/quickstart.md | 203 +++++++++++++++++++++
website/docusaurus.config.js | 2 +-
website/sidebars.js | 6 +-
16 files changed, 334 insertions(+), 51 deletions(-)
diff --git a/.github/workflows/deploy_docker_images.yml b/.github/workflows/deploy_docker_images.yml
index 3afd32e..93b55d6 100644
--- a/.github/workflows/deploy_docker_images.yml
+++ b/.github/workflows/deploy_docker_images.yml
@@ -79,3 +79,8 @@ jobs:
run: ./dev-support/docker-images/serve/build.sh
- name: Push submarine-serve docker image
run: docker push apache/submarine:serve-$SUBMARINE_VERSION
+
+ - name: Build submarine quickstart
+ run: ./dev-support/examples/quickstart/build.sh
+ - name: Push submarine quickstart docker image
+ run: docker push apache/submarine:quickstart-$SUBMARINE_VERSION
diff --git a/dev-support/examples/quickstart/post.sh b/dev-support/examples/quickstart/Dockerfile
similarity index 62%
copy from dev-support/examples/quickstart/post.sh
copy to dev-support/examples/quickstart/Dockerfile
index 39336bc..ee6d66d 100644
--- a/dev-support/examples/quickstart/post.sh
+++ b/dev-support/examples/quickstart/Dockerfile
@@ -1,4 +1,3 @@
-#!/usr/bin/env bash
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
@@ -14,26 +13,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.
+FROM continuumio/anaconda3
+MAINTAINER Apache Software Foundation <de...@submarine.apache.org>
-curl -X POST -H "Content-Type: application/json" -d '
-{
- "meta": {
- "name": "quickstart",
- "namespace": "default",
- "framework": "TensorFlow",
- "cmd": "python /opt/train.py",
- "envVars": {
- "ENV_1": "ENV1"
- }
- },
- "environment": {
- "image": "quickstart:0.6.0-SNAPSHOT"
- },
- "spec": {
- "Worker": {
- "replicas": 3,
- "resources": "cpu=1,memory=1024M"
- }
- }
-}
-' http://127.0.0.1:32080/api/v1/experiment
\ No newline at end of file
+ADD ./tmp/submarine-sdk /opt/
+# install submarine-sdk locally
+RUN pip install /opt/pysubmarine/.[tf-latest]
+RUN pip install tensorflow_datasets
+
+ADD ./train.py /opt/
\ No newline at end of file
diff --git a/dev-support/examples/quickstart/post.sh b/dev-support/examples/quickstart/build.sh
old mode 100644
new mode 100755
similarity index 51%
copy from dev-support/examples/quickstart/post.sh
copy to dev-support/examples/quickstart/build.sh
index 39336bc..6865c39
--- a/dev-support/examples/quickstart/post.sh
+++ b/dev-support/examples/quickstart/build.sh
@@ -14,26 +14,31 @@
# See the License for the specific language governing permissions and
# limitations under the License.
+set -euxo pipefail
-curl -X POST -H "Content-Type: application/json" -d '
-{
- "meta": {
- "name": "quickstart",
- "namespace": "default",
- "framework": "TensorFlow",
- "cmd": "python /opt/train.py",
- "envVars": {
- "ENV_1": "ENV1"
- }
- },
- "environment": {
- "image": "quickstart:0.6.0-SNAPSHOT"
- },
- "spec": {
- "Worker": {
- "replicas": 3,
- "resources": "cpu=1,memory=1024M"
- }
- }
-}
-' http://127.0.0.1:32080/api/v1/experiment
\ No newline at end of file
+SUBMARINE_VERSION=0.6.0-SNAPSHOT
+SUBMARINE_IMAGE_NAME="apache/submarine:quickstart-${SUBMARINE_VERSION}"
+
+if [ -L ${BASH_SOURCE-$0} ]; then
+ PWD=$(dirname $(readlink "${BASH_SOURCE-$0}"))
+else
+ PWD=$(dirname ${BASH_SOURCE-$0})
+fi
+export CURRENT_PATH=$(cd "${PWD}">/dev/null; pwd)
+export SUBMARINE_HOME=${CURRENT_PATH}/../../..
+
+if [ -d "${CURRENT_PATH}/tmp" ] # if old tmp folder is still there, delete it.
+then
+ rm -rf "${CURRENT_PATH}/tmp"
+fi
+
+mkdir -p "${CURRENT_PATH}/tmp"
+cp -r "${SUBMARINE_HOME}/submarine-sdk" "${CURRENT_PATH}/tmp"
+
+# build image
+cd ${CURRENT_PATH}
+echo "Start building the ${SUBMARINE_IMAGE_NAME} docker image ..."
+docker build -t ${SUBMARINE_IMAGE_NAME} .
+
+# clean temp file
+rm -rf "${CURRENT_PATH}/tmp"
diff --git a/dev-support/examples/quickstart/post.sh b/dev-support/examples/quickstart/post.sh
old mode 100644
new mode 100755
index 39336bc..8c23c52
--- a/dev-support/examples/quickstart/post.sh
+++ b/dev-support/examples/quickstart/post.sh
@@ -14,7 +14,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-
curl -X POST -H "Content-Type: application/json" -d '
{
"meta": {
diff --git a/dev-support/examples/quickstart/train.py b/dev-support/examples/quickstart/train.py
new file mode 100644
index 0000000..e33de68
--- /dev/null
+++ b/dev-support/examples/quickstart/train.py
@@ -0,0 +1,86 @@
+# Copyright 2020 The Kubeflow Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""
+An example of multi-worker training with Keras model using Strategy API.
+https://github.com/kubeflow/tf-operator/blob/master/examples/v1/distribution_strategy/keras-API/multi_worker_strategy-with-keras.py
+"""
+import tensorflow_datasets as tfds
+import tensorflow as tf
+from tensorflow.keras import layers, models
+from submarine import ModelsClient
+
+def make_datasets_unbatched():
+ BUFFER_SIZE = 10000
+
+ # Scaling MNIST data from (0, 255] to (0., 1.]
+ def scale(image, label):
+ image = tf.cast(image, tf.float32)
+ image /= 255
+ return image, label
+
+ datasets, _ = tfds.load(name='mnist', with_info=True, as_supervised=True)
+
+ return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE)
+
+
+def build_and_compile_cnn_model():
+ model = models.Sequential()
+ model.add(
+ layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
+ model.add(layers.MaxPooling2D((2, 2)))
+ model.add(layers.Conv2D(64, (3, 3), activation='relu'))
+ model.add(layers.MaxPooling2D((2, 2)))
+ model.add(layers.Conv2D(64, (3, 3), activation='relu'))
+ model.add(layers.Flatten())
+ model.add(layers.Dense(64, activation='relu'))
+ model.add(layers.Dense(10, activation='softmax'))
+
+ model.summary()
+
+ model.compile(optimizer='adam',
+ loss='sparse_categorical_crossentropy',
+ metrics=['accuracy'])
+
+ return model
+
+def main():
+ strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
+ communication=tf.distribute.experimental.CollectiveCommunication.AUTO)
+
+ BATCH_SIZE_PER_REPLICA = 4
+ BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
+
+ with strategy.scope():
+ ds_train = make_datasets_unbatched().batch(BATCH_SIZE).repeat()
+ options = tf.data.Options()
+ options.experimental_distribute.auto_shard_policy = \
+ tf.data.experimental.AutoShardPolicy.DATA
+ ds_train = ds_train.with_options(options)
+ # Model building/compiling need to be within `strategy.scope()`.
+ multi_worker_model = build_and_compile_cnn_model()
+
+ class MyCallback(tf.keras.callbacks.Callback):
+ def on_epoch_end(self, epoch, logs=None):
+ # monitor the loss and accuracy
+ print(logs)
+ modelClient.log_metrics({"loss": logs["loss"], "accuracy": logs["accuracy"]}, epoch)
+
+ with modelClient.start() as run:
+ multi_worker_model.fit(ds_train, epochs=10, steps_per_epoch=70, callbacks=[MyCallback()])
+
+
+if __name__ == '__main__':
+ modelClient = ModelsClient()
+ main()
\ No newline at end of file
diff --git a/website/docs/assets/quickstart-mlflow-2.png b/website/docs/assets/quickstart-mlflow-2.png
new file mode 100644
index 0000000..6430164
Binary files /dev/null and b/website/docs/assets/quickstart-mlflow-2.png differ
diff --git a/website/docs/assets/quickstart-mlflow.png b/website/docs/assets/quickstart-mlflow.png
new file mode 100644
index 0000000..7600663
Binary files /dev/null and b/website/docs/assets/quickstart-mlflow.png differ
diff --git a/website/docs/assets/quickstart-submit-1.png b/website/docs/assets/quickstart-submit-1.png
new file mode 100644
index 0000000..a5d095f
Binary files /dev/null and b/website/docs/assets/quickstart-submit-1.png differ
diff --git a/website/docs/assets/quickstart-submit-2.png b/website/docs/assets/quickstart-submit-2.png
new file mode 100644
index 0000000..cc368d6
Binary files /dev/null and b/website/docs/assets/quickstart-submit-2.png differ
diff --git a/website/docs/assets/quickstart-submit-3.png b/website/docs/assets/quickstart-submit-3.png
new file mode 100644
index 0000000..0ca1daa
Binary files /dev/null and b/website/docs/assets/quickstart-submit-3.png differ
diff --git a/website/docs/assets/quickstart-submit-4.png b/website/docs/assets/quickstart-submit-4.png
new file mode 100644
index 0000000..ad7c60e
Binary files /dev/null and b/website/docs/assets/quickstart-submit-4.png differ
diff --git a/website/docs/assets/quickstart-worbench.png b/website/docs/assets/quickstart-worbench.png
new file mode 100644
index 0000000..a9ca304
Binary files /dev/null and b/website/docs/assets/quickstart-worbench.png differ
diff --git a/website/docs/gettingStarted/notebook.md b/website/docs/gettingStarted/notebook.md
index 532f5bc..1b58b59 100644
--- a/website/docs/gettingStarted/notebook.md
+++ b/website/docs/gettingStarted/notebook.md
@@ -1,5 +1,5 @@
---
-title: Notebook Tutorial
+title: Jupyter Notebook
---
<!--
diff --git a/website/docs/gettingStarted/quickstart.md b/website/docs/gettingStarted/quickstart.md
new file mode 100644
index 0000000..0de4a93
--- /dev/null
+++ b/website/docs/gettingStarted/quickstart.md
@@ -0,0 +1,203 @@
+---
+title: Quickstart
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This document gives you a quick view on the basic usage of Submarine platform. You can finish each step of ML model lifecycle on the platform without messing up with the troublesome environment problems.
+
+## Installation
+
+### Prepare a Kubernetes cluster
+
+1. Prerequisite
+
+- Check [dependency page](https://github.com/apache/submarine/blob/master/website/docs/devDocs/Dependencies.md) for the compatible version
+- [kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/)
+- [helm](https://helm.sh/docs/intro/install/) (Helm v3 is minimum requirement.)
+- [minikube](https://minikube.sigs.k8s.io/docs/start/).
+
+2. Start minikube cluster
+```
+$ minikube start --vm-driver=docker --cpus 8 --memory 4096 --kubernetes-version v1.15.11
+```
+
+### Launch submarine in the cluster
+
+1. Clone the project
+```
+$ git clone https://github.com/apache/submarine.git
+```
+
+2. Install the resources by helm chart
+```
+$ cd submarine
+$ helm install submarine ./helm-charts/submarine
+```
+### Ensure submarine is ready
+
+1. Use kubectl to query the status of pods
+```
+$ kubectl get pods
+```
+
+2. Make sure each pod is `Running`
+```
+NAME READY STATUS RESTARTS AGE
+notebook-controller-deployment-5d4f5f874c-vwds8 1/1 Running 0 3h33m
+pytorch-operator-844c866d54-q5ztd 1/1 Running 0 3h33m
+submarine-database-674987ff7d-r8zqs 1/1 Running 0 3h33m
+submarine-minio-5fdd957785-xd987 1/1 Running 0 3h33m
+submarine-mlflow-76bbf5c7b-g2ntd 1/1 Running 0 3h33m
+submarine-server-66f7b8658b-sfmv8 1/1 Running 0 3h33m
+submarine-tensorboard-6c44944dfb-tvbr9 1/1 Running 0 3h33m
+submarine-traefik-7cbcfd4bd9-4bczn 1/1 Running 0 3h33m
+tf-job-operator-6bb69fd44-mc8ww 1/1 Running 0 3h33m
+```
+
+### Connect to workbench
+
+1. Port-forwarding
+
+```
+# using port-forwarding
+$ kubectl port-forward --address 0.0.0.0 service/submarine-traefik 32080:80
+```
+
+2. Open `http://0.0.0.0:32080`
+
+![](../assets/quickstart-worbench.png)
+
+## Example: Submit a mnist distributed example
+
+We put the code of this example [here](https://github.com/apache/submarine/tree/master/dev-support/examples/quickstart). `train.py` is our training script, and `build.sh` is the script to build a docker image.
+
+### 1. Write a python script for distributed training
+
+Take a simple mnist tensorflow script as an example. We choose `MultiWorkerMirroredStrategy` as our distributed strategy.
+
+```python
+"""
+./dev-support/examples/quickstart/train.py
+Reference: https://github.com/kubeflow/tf-operator/blob/master/examples/v1/distribution_strategy/keras-API/multi_worker_strategy-with-keras.py
+"""
+
+import tensorflow_datasets as tfds
+import tensorflow as tf
+from tensorflow.keras import layers, models
+from submarine import ModelsClient
+
+def make_datasets_unbatched():
+ BUFFER_SIZE = 10000
+
+ # Scaling MNIST data from (0, 255] to (0., 1.]
+ def scale(image, label):
+ image = tf.cast(image, tf.float32)
+ image /= 255
+ return image, label
+
+ datasets, _ = tfds.load(name='mnist', with_info=True, as_supervised=True)
+
+ return datasets['train'].map(scale).cache().shuffle(BUFFER_SIZE)
+
+
+def build_and_compile_cnn_model():
+ model = models.Sequential()
+ model.add(
+ layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
+ model.add(layers.MaxPooling2D((2, 2)))
+ model.add(layers.Conv2D(64, (3, 3), activation='relu'))
+ model.add(layers.MaxPooling2D((2, 2)))
+ model.add(layers.Conv2D(64, (3, 3), activation='relu'))
+ model.add(layers.Flatten())
+ model.add(layers.Dense(64, activation='relu'))
+ model.add(layers.Dense(10, activation='softmax'))
+
+ model.summary()
+
+ model.compile(optimizer='adam',
+ loss='sparse_categorical_crossentropy',
+ metrics=['accuracy'])
+
+ return model
+
+def main():
+ strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
+ communication=tf.distribute.experimental.CollectiveCommunication.AUTO)
+
+ BATCH_SIZE_PER_REPLICA = 4
+ BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
+
+ with strategy.scope():
+ ds_train = make_datasets_unbatched().batch(BATCH_SIZE).repeat()
+ options = tf.data.Options()
+ options.experimental_distribute.auto_shard_policy = \
+ tf.data.experimental.AutoShardPolicy.DATA
+ ds_train = ds_train.with_options(options)
+ # Model building/compiling need to be within `strategy.scope()`.
+ multi_worker_model = build_and_compile_cnn_model()
+
+ class MyCallback(tf.keras.callbacks.Callback):
+ def on_epoch_end(self, epoch, logs=None):
+ # monitor the loss and accuracy
+ print(logs)
+ modelClient.log_metrics({"loss": logs["loss"], "accuracy": logs["accuracy"]}, epoch)
+
+ with modelClient.start() as run:
+ multi_worker_model.fit(ds_train, epochs=10, steps_per_epoch=70, callbacks=[MyCallback()])
+
+
+if __name__ == '__main__':
+ modelClient = ModelsClient()
+ main()
+```
+
+### 2. Prepare an environment compatible with the training
+Build a docker image equipped with the requirement of the environment.
+
+```bash
+$ ./dev-support/examples/quickstart/build.sh
+```
+
+### 3. Submit the experiment
+
+1. Open submarine workbench and click `+ New Experiment`
+2. Fill the form accordingly. Here we set 3 workers.
+
+ 1. Step 1
+ ![](../assets/quickstart-submit-1.png)
+ 2. Step 2
+ ![](../assets/quickstart-submit-2.png)
+ 3. Step 3
+ ![](../assets/quickstart-submit-3.png)
+ 4. The experiment is successfully submitted
+ ![](../assets/quickstart-submit-4.png)
+
+### 4. Monitor the process (modelClient)
+
+1. In our code, we use `modelClient` from `submarine-sdk` to record the metrics. To see the result, click `MLflow UI` in the workbench.
+2. To compare the metrics of each worker, you can select all workers and then click `compare`
+
+ ![](../assets/quickstart-mlflow.png)
+
+ ![](../assets/quickstart-mlflow-2.png)
+
+
+### 5. Serve the model (In development)
diff --git a/website/docusaurus.config.js b/website/docusaurus.config.js
index bf482b3..23b1839 100644
--- a/website/docusaurus.config.js
+++ b/website/docusaurus.config.js
@@ -37,7 +37,7 @@ module.exports = {
items: [
{
type: 'doc',
- docId: 'gettingStarted/localDeployment',
+ docId: 'gettingStarted/quickstart',
label: 'Docs',
position: 'left',
},
diff --git a/website/sidebars.js b/website/sidebars.js
index 80fd24d..b10ec3a 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -22,10 +22,10 @@ module.exports = {
{
"Introduction": [],
"Getting Started": [
- "gettingStarted/localDeployment",
- "gettingStarted/kind",
+ "gettingStarted/quickstart",
+ // "gettingStarted/localDeployment",
"gettingStarted/notebook",
- "gettingStarted/python-sdk",
+ // "gettingStarted/python-sdk",
],
"User Docs": [
{
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@submarine.apache.org
For additional commands, e-mail: dev-help@submarine.apache.org