You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by wi...@apache.org on 2022/10/17 17:16:56 UTC
[yunikorn-site] branch master updated: [YUNIKORN-1339] Adding time slicing GPU to Tensorflow example (#195)
This is an automated email from the ASF dual-hosted git repository.
wilfreds pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-site.git
The following commit(s) were added to refs/heads/master by this push:
new 6ef97888c [YUNIKORN-1339] Adding time slicing GPU to Tensorflow example (#195)
6ef97888c is described below
commit 6ef97888c4ffd832b0535edef1d8682c2ad6d3e4
Author: xuanzongwu <t3...@gmail.com>
AuthorDate: Mon Oct 17 10:16:30 2022 -0700
[YUNIKORN-1339] Adding time slicing GPU to Tensorflow example (#195)
Closes: #195
Signed-off-by: Wilfred Spiegelenburg <wi...@apache.org>
---
docs/assets/tf-job-gpu-on-logs.png | Bin 0 -> 221691 bytes
docs/assets/tf-job-gpu-on-ui.png | Bin 0 -> 101211 bytes
docs/user_guide/workloads/run_tensorflow.md | 139 ++++++++++++++++++++++++++++
3 files changed, 139 insertions(+)
diff --git a/docs/assets/tf-job-gpu-on-logs.png b/docs/assets/tf-job-gpu-on-logs.png
new file mode 100644
index 000000000..db2a6b693
Binary files /dev/null and b/docs/assets/tf-job-gpu-on-logs.png differ
diff --git a/docs/assets/tf-job-gpu-on-ui.png b/docs/assets/tf-job-gpu-on-ui.png
new file mode 100644
index 000000000..b599dca7b
Binary files /dev/null and b/docs/assets/tf-job-gpu-on-ui.png differ
diff --git a/docs/user_guide/workloads/run_tensorflow.md b/docs/user_guide/workloads/run_tensorflow.md
index 367ac6e6c..152068bd1 100644
--- a/docs/user_guide/workloads/run_tensorflow.md
+++ b/docs/user_guide/workloads/run_tensorflow.md
@@ -91,3 +91,142 @@ You can view the job info from YuniKorn UI. If you do not know how to access the
please read the document [here](../../get_started/get_started.md#access-the-web-ui).
![tf-job-on-ui](../../assets/tf-job-on-ui.png)
+
+## Using Time-Slicing GPU
+
+### Prerequisite
+To use Time-Slicing GPU your cluster must be configured to use GPUs and Time-Slicing GPUs.
+- Nodes must have GPUs attached.
+- Kubernetes version 1.24
+- GPU drivers must be installed on the cluster
+- Use the [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html) to automatically setup and manage the NVIDA software components on the worker nodes.
+- Set the Configuration of [Time-Slicing GPUs in Kubernetes](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-sharing.html)
+
+
+
+Once the GPU Operator and Time-Slicing GPUs is installed, check the status of the pods to ensure all the containers are running and the validation is complete :
+```shell script
+kubectl get pod -n gpu-operator
+```
+```shell script
+NAME READY STATUS RESTARTS AGE
+gpu-feature-discovery-fd5x4 2/2 Running 0 5d2h
+gpu-operator-569d9c8cb-kbn7s 1/1 Running 14 (39h ago) 5d2h
+gpu-operator-node-feature-discovery-master-84c7c7c6cf-f4sxz 1/1 Running 0 5d2h
+gpu-operator-node-feature-discovery-worker-p5plv 1/1 Running 8 (39h ago) 5d2h
+nvidia-container-toolkit-daemonset-zq766 1/1 Running 0 5d2h
+nvidia-cuda-validator-5tldf 0/1 Completed 0 5d2h
+nvidia-dcgm-exporter-95vm8 1/1 Running 0 5d2h
+nvidia-device-plugin-daemonset-7nzvf 2/2 Running 0 5d2h
+nvidia-device-plugin-validator-gj7nn 0/1 Completed 0 5d2h
+nvidia-operator-validator-nz84d 1/1 Running 0 5d2h
+```
+Verify that the time-slicing configuration is applied successfully :
+
+```shell script
+kubectl describe node
+```
+
+```shell script
+Capacity:
+ nvidia.com/gpu: 16
+...
+Allocatable:
+ nvidia.com/gpu: 16
+...
+```
+### Testing TensorFlow job with GPUs
+This section covers a workload test scenario to validate TFJob with Time-slicing GPU.
+
+1. Create a workload test file `tf-gpu.yaml` as follows:
+ ```shell script
+ vim tf-gpu.yaml
+ ```
+ ```yaml
+ apiVersion: "kubeflow.org/v1"
+ kind: "TFJob"
+ metadata:
+ name: "tf-smoke-gpu"
+ namespace: kubeflow
+ spec:
+ tfReplicaSpecs:
+ PS:
+ replicas: 1
+ template:
+ metadata:
+ creationTimestamp:
+ labels:
+ applicationId: "tf_job_20200521_001"
+ spec:
+ schedulerName: yunikorn
+ containers:
+ - args:
+ - python
+ - tf_cnn_benchmarks.py
+ - --batch_size=32
+ - --model=resnet50
+ - --variable_update=parameter_server
+ - --flush_stdout=true
+ - --num_gpus=1
+ - --local_parameter_device=cpu
+ - --device=cpu
+ - --data_format=NHWC
+ image: docker.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
+ name: tensorflow
+ ports:
+ - containerPort: 2222
+ name: tfjob-port
+ workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
+ restartPolicy: OnFailure
+ Worker:
+ replicas: 1
+ template:
+ metadata:
+ creationTimestamp: null
+ labels:
+ applicationId: "tf_job_20200521_001"
+ spec:
+ schedulerName: yunikorn
+ containers:
+ - args:
+ - python
+ - tf_cnn_benchmarks.py
+ - --batch_size=32
+ - --model=resnet50
+ - --variable_update=parameter_server
+ - --flush_stdout=true
+ - --num_gpus=1
+ - --local_parameter_device=cpu
+ - --device=gpu
+ - --data_format=NHWC
+ image: docker.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
+ name: tensorflow
+ ports:
+ - containerPort: 2222
+ name: tfjob-port
+ resources:
+ limits:
+ nvidia.com/gpu: 2
+ workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
+ restartPolicy: OnFailure
+ ```
+2. Create the TFJob
+ ```shell script
+ kubectl apply -f tf-gpu.yaml
+ ```
+3. Verify that TFJob are running on YuniKorn:
+ ![tf-job-gpu-on-ui](../../assets/tf-job-gpu-on-ui.png)
+ Check the log of the pod:
+ ```shell script
+ kubectl logs logs po/tf-smoke-gpu-worker-0 -n kubeflow
+ ```
+ ```
+ .......
+ ..Found device 0 with properties:
+ ..name: NVIDIA GeForce RTX 3080 major: 8 minor: 6 memoryClockRate(GHz): 1.71
+
+ .......
+ ..Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6)
+ .......
+ ```
+ ![tf-job-gpu-on-logs](../../assets/tf-job-gpu-on-logs.png)
\ No newline at end of file