You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by wi...@apache.org on 2022/10/17 17:16:56 UTC
[yunikorn-site] branch master updated: [YUNIKORN-1339] Adding time slicing GPU to Tensorflow example (#195)

This is an automated email from the ASF dual-hosted git repository.

wilfreds pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-site.git


The following commit(s) were added to refs/heads/master by this push:
     new 6ef97888c [YUNIKORN-1339] Adding time slicing GPU to Tensorflow example (#195)
6ef97888c is described below

commit 6ef97888c4ffd832b0535edef1d8682c2ad6d3e4
Author: xuanzongwu <t3...@gmail.com>
AuthorDate: Mon Oct 17 10:16:30 2022 -0700

    [YUNIKORN-1339] Adding time slicing GPU to Tensorflow example (#195)
    
    Closes: #195
    
    Signed-off-by: Wilfred Spiegelenburg <wi...@apache.org>
---
 docs/assets/tf-job-gpu-on-logs.png          | Bin 0 -> 221691 bytes
 docs/assets/tf-job-gpu-on-ui.png            | Bin 0 -> 101211 bytes
 docs/user_guide/workloads/run_tensorflow.md | 139 ++++++++++++++++++++++++++++
 3 files changed, 139 insertions(+)

diff --git a/docs/assets/tf-job-gpu-on-logs.png b/docs/assets/tf-job-gpu-on-logs.png
new file mode 100644
index 000000000..db2a6b693
Binary files /dev/null and b/docs/assets/tf-job-gpu-on-logs.png differ
diff --git a/docs/assets/tf-job-gpu-on-ui.png b/docs/assets/tf-job-gpu-on-ui.png
new file mode 100644
index 000000000..b599dca7b
Binary files /dev/null and b/docs/assets/tf-job-gpu-on-ui.png differ
diff --git a/docs/user_guide/workloads/run_tensorflow.md b/docs/user_guide/workloads/run_tensorflow.md
index 367ac6e6c..152068bd1 100644
--- a/docs/user_guide/workloads/run_tensorflow.md
+++ b/docs/user_guide/workloads/run_tensorflow.md
@@ -91,3 +91,142 @@ You can view the job info from YuniKorn UI. If you do not know how to access the
 please read the document [here](../../get_started/get_started.md#access-the-web-ui).
 
 ![tf-job-on-ui](../../assets/tf-job-on-ui.png)
+
+## Using Time-Slicing GPU
+
+### Prerequisite
+To use Time-Slicing GPU your cluster must be configured to use GPUs and Time-Slicing GPUs.
+- Nodes must have GPUs attached.
+- Kubernetes version 1.24
+- GPU drivers must be installed on the cluster
+- Use the [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html) to automatically setup and manage the NVIDA software components on the worker nodes.
+- Set the Configuration of [Time-Slicing GPUs in Kubernetes](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-sharing.html)
+
+
+
+Once the GPU Operator and Time-Slicing GPUs is installed, check the status of the pods to ensure all the containers are running and the validation is complete :
+```shell script
+kubectl get pod -n gpu-operator
+```
+```shell script
+NAME                                                          READY   STATUS      RESTARTS       AGE
+gpu-feature-discovery-fd5x4                                   2/2     Running     0              5d2h
+gpu-operator-569d9c8cb-kbn7s                                  1/1     Running     14 (39h ago)   5d2h
+gpu-operator-node-feature-discovery-master-84c7c7c6cf-f4sxz   1/1     Running     0              5d2h
+gpu-operator-node-feature-discovery-worker-p5plv              1/1     Running     8 (39h ago)    5d2h
+nvidia-container-toolkit-daemonset-zq766                      1/1     Running     0              5d2h
+nvidia-cuda-validator-5tldf                                   0/1     Completed   0              5d2h
+nvidia-dcgm-exporter-95vm8                                    1/1     Running     0              5d2h
+nvidia-device-plugin-daemonset-7nzvf                          2/2     Running     0              5d2h
+nvidia-device-plugin-validator-gj7nn                          0/1     Completed   0              5d2h
+nvidia-operator-validator-nz84d                               1/1     Running     0              5d2h
+```
+Verify that the time-slicing configuration is applied successfully :
+
+```shell script
+kubectl describe node
+```
+
+```shell script
+Capacity:
+  nvidia.com/gpu:     16
+...
+Allocatable:
+  nvidia.com/gpu:     16
+...
+```
+### Testing TensorFlow job with GPUs
+This section covers a workload test scenario to validate TFJob with Time-slicing GPU.
+
+1. Create a workload test file `tf-gpu.yaml` as follows:
+  ```shell script
+  vim tf-gpu.yaml
+  ```
+  ```yaml
+  apiVersion: "kubeflow.org/v1"
+  kind: "TFJob"
+  metadata:
+    name: "tf-smoke-gpu"
+    namespace: kubeflow
+  spec:
+    tfReplicaSpecs:
+      PS:
+        replicas: 1
+        template:
+          metadata:
+            creationTimestamp: 
+            labels:
+              applicationId: "tf_job_20200521_001"
+          spec:
+            schedulerName: yunikorn
+            containers:
+              - args:
+                  - python
+                  - tf_cnn_benchmarks.py
+                  - --batch_size=32
+                  - --model=resnet50
+                  - --variable_update=parameter_server
+                  - --flush_stdout=true
+                  - --num_gpus=1
+                  - --local_parameter_device=cpu
+                  - --device=cpu
+                  - --data_format=NHWC
+                image: docker.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
+                name: tensorflow
+                ports:
+                  - containerPort: 2222
+                    name: tfjob-port
+                workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
+            restartPolicy: OnFailure
+      Worker:
+        replicas: 1
+        template:
+          metadata:
+            creationTimestamp: null
+            labels:
+              applicationId: "tf_job_20200521_001"
+          spec:
+            schedulerName: yunikorn
+            containers:
+              - args:
+                  - python
+                  - tf_cnn_benchmarks.py
+                  - --batch_size=32
+                  - --model=resnet50
+                  - --variable_update=parameter_server
+                  - --flush_stdout=true
+                  - --num_gpus=1
+                  - --local_parameter_device=cpu
+                  - --device=gpu
+                  - --data_format=NHWC
+                image: docker.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
+                name: tensorflow
+                ports:
+                  - containerPort: 2222
+                    name: tfjob-port
+                resources:
+                  limits:
+                    nvidia.com/gpu: 2
+                workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
+            restartPolicy: OnFailure
+  ```
+2. Create the TFJob
+  ```shell script
+  kubectl apply -f tf-gpu.yaml
+  ```
+3. Verify that TFJob are running on YuniKorn:
+  ![tf-job-gpu-on-ui](../../assets/tf-job-gpu-on-ui.png)
+    Check the log of the pod:
+    ```shell script
+    kubectl logs logs po/tf-smoke-gpu-worker-0 -n kubeflow
+    ```
+    ```
+    .......
+    ..Found device 0 with properties:
+    ..name: NVIDIA GeForce RTX 3080 major: 8 minor: 6 memoryClockRate(GHz): 1.71
+
+    .......
+    ..Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6)
+    .......
+    ```
+    ![tf-job-gpu-on-logs](../../assets/tf-job-gpu-on-logs.png)
\ No newline at end of file