You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by ww...@apache.org on 2021/12/04 08:21:31 UTC

[incubator-yunikorn-site] branch master updated: [YUNIKORN-953] Document running kubeflow/training-operator with yunikorn (#96)

This is an automated email from the ASF dual-hosted git repository.

wwei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-yunikorn-site.git


The following commit(s) were added to refs/heads/master by this push:
     new 1f1be06  [YUNIKORN-953] Document running kubeflow/training-operator with yunikorn (#96)
1f1be06 is described below

commit 1f1be0658f8bcd8f092a0ef682cac14fd687c395
Author: Wen-Chih (Ryan) Lo <lo...@gmail.com>
AuthorDate: Sat Dec 4 16:21:27 2021 +0800

    [YUNIKORN-953] Document running kubeflow/training-operator with yunikorn (#96)
---
 docs/assets/tf-job-on-ui.png                | Bin 0 -> 327800 bytes
 docs/user_guide/workloads/run_tensorflow.md |  73 ++++++++++++++++++++++++----
 2 files changed, 63 insertions(+), 10 deletions(-)

diff --git a/docs/assets/tf-job-on-ui.png b/docs/assets/tf-job-on-ui.png
new file mode 100644
index 0000000..06acabe
Binary files /dev/null and b/docs/assets/tf-job-on-ui.png differ
diff --git a/docs/user_guide/workloads/run_tensorflow.md b/docs/user_guide/workloads/run_tensorflow.md
index 393e330..5448990 100644
--- a/docs/user_guide/workloads/run_tensorflow.md
+++ b/docs/user_guide/workloads/run_tensorflow.md
@@ -1,6 +1,7 @@
 ---
 id: run_tf
-title: Run Tensorflow Jobs
+title: Run TensorFlow Jobs
+description: How to run TensorFlow jobs with YuniKorn
 keywords:
  - tensorflow
 ---
@@ -24,17 +25,69 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-Here is an example for Tensorflow job. You must install tf-operator first. 
-You can install tf-operator by applying all yaml from two website down below:
-* CRD: https://github.com/kubeflow/manifests/tree/master/tf-training/tf-job-crds/base
-* Deployment: https://github.com/kubeflow/manifests/tree/master/tf-training/tf-job-operator/base
-Also you can install kubeflow which can auto install tf-operator for you, URL: https://www.kubeflow.org/docs/started/getting-started/
+This guide gives an overview of how to set up [training-operator](https://github.com/kubeflow/training-operator)
+and how to run a Tensorflow job with YuniKorn scheduler. The training-operator is a unified training operator maintained by
+Kubeflow. It not only supports TensorFlow but also PyTorch, XGboots, etc.
 
-A simple Tensorflow job example:
+## Install training-operator
+You can use the following command to install training operator in kubeflow namespace by default. If you have problems with installation,
+please refer to [this doc](https://github.com/kubeflow/training-operator#installation) for details.
+```
+kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.3.0"
+```
+
+## Prepare the docker image
+Before you start running a TensorFlow job on Kubernetes, you'll need to build the docker image.
+1. Download files from [deployment/examples/tfjob](https://github.com/apache/incubator-yunikorn-k8shim/tree/master/deployments/examples/tfjob)
+2. To build this docker image with the following command
+
+```
+docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .
+```
 
-You need to [build the image](https://github.com/kubeflow/tf-operator/tree/master/examples/v1/dist-mnist) which used in example yaml.
+## Run a TensorFlow job
+Here is a TFJob yaml for MNIST [example](https://github.com/apache/incubator-yunikorn-k8shim/blob/master/deployments/examples/tfjob/tf-job-mnist.yaml).
+
+```yaml
+apiVersion: kubeflow.org/v1
+kind: TFJob
+metadata:
+  name: dist-mnist-for-e2e-test
+  namespace: kubeflow
+spec:
+  tfReplicaSpecs:
+    PS:
+      replicas: 2
+      restartPolicy: Never
+      template:
+        metadata:
+          labels:
+            applicationId: "tf_job_20200521_001"
+            queue: root.sandbox
+        spec:
+          schedulerName: yunikorn
+          containers:
+            - name: tensorflow
+              image: kubeflow/tf-dist-mnist-test:1.0
+    Worker:
+      replicas: 4
+      restartPolicy: Never
+      template:
+        metadata:
+          labels:
+            applicationId: "tf_job_20200521_001"
+            queue: root.sandbox
+        spec:
+          schedulerName: yunikorn
+          containers:
+            - name: tensorflow
+              image: kubeflow/tf-dist-mnist-test:1.0
+```
+Create the TFJob
 ```
-kubectl create -f examples/tfjob/tf-job-mnist.yaml
+kubectl create -f deployments/examples/tfjob/tf-job-mnist.yaml
 ```
+You can view the job info from YuniKorn UI. If you do not know how to access the YuniKorn UI,
+please read the document [here](../../get_started/get_started.md#access-the-web-ui).
 
-The file for this example can be found in the [README Tensorflow job](https://github.com/apache/incubator-yunikorn-k8shim/tree/master/deployments/examples#Tensorflow-job) section.
+![tf-job-on-ui](../../assets/tf-job-on-ui.png)