You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by ww...@apache.org on 2022/11/26 07:56:01 UTC
[yunikorn-site] branch master updated: [YUNIKORN-1422] Adding Chinese translations of Run TensorFlow Jobs (#215)
This is an automated email from the ASF dual-hosted git repository.
wwei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-site.git
The following commit(s) were added to refs/heads/master by this push:
new bd3e0870e [YUNIKORN-1422] Adding Chinese translations of Run TensorFlow Jobs (#215)
bd3e0870e is described below
commit bd3e0870e2db480eb4322df8aa20d4c50c0cf078
Author: wusamzong <48...@users.noreply.github.com>
AuthorDate: Sat Nov 26 15:55:56 2022 +0800
[YUNIKORN-1422] Adding Chinese translations of Run TensorFlow Jobs (#215)
Co-authored-by: wusamzong <t3...@gamil.com>
---
.../current/user_guide/workloads/run_tensorflow.md | 136 +++++++++++++++++++++
1 file changed, 136 insertions(+)
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/workloads/run_tensorflow.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/workloads/run_tensorflow.md
index c5d708c5e..a6007028b 100644
--- a/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/workloads/run_tensorflow.md
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/workloads/run_tensorflow.md
@@ -91,3 +91,139 @@ kubectl create -f deployments/examples/tfjob/tf-job-mnist.yaml
请阅读此 [文档](../../get_started/get_started.md#访问-web-ui)。
![tf-job-on-ui](../../assets/tf-job-on-ui.png)
+
+## 使用GPU Time-slicing
+### 前提
+要使用 Time-slicing GPU,您需要先设定丛集以让GPU和Time-slicing GPU能被使用。
+- 节点上必须连接GPU
+- Kubernetes版本为1.24
+- 丛集中需要安装 GPU drivers
+- 透过 [GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html) 自动化的建置与管理节点中的 NVIDIA 软体组件
+- 在Kubernetes中设定 [Time-Slicing GPUs in Kubernetes](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/gpu-sharing.html)
+
+
+在安装完 GPU Operator 及 Time-slicing GPU 以后,确认pods的状态以确保所有的containers正在运行或完成:
+```shell script
+kubectl get pod -n gpu-operator
+```
+```shell script
+NAME READY STATUS RESTARTS AGE
+gpu-feature-discovery-fd5x4 2/2 Running 0 5d2h
+gpu-operator-569d9c8cb-kbn7s 1/1 Running 14 (39h ago) 5d2h
+gpu-operator-node-feature-discovery-master-84c7c7c6cf-f4sxz 1/1 Running 0 5d2h
+gpu-operator-node-feature-discovery-worker-p5plv 1/1 Running 8 (39h ago) 5d2h
+nvidia-container-toolkit-daemonset-zq766 1/1 Running 0 5d2h
+nvidia-cuda-validator-5tldf 0/1 Completed 0 5d2h
+nvidia-dcgm-exporter-95vm8 1/1 Running 0 5d2h
+nvidia-device-plugin-daemonset-7nzvf 2/2 Running 0 5d2h
+nvidia-device-plugin-validator-gj7nn 0/1 Completed 0 5d2h
+nvidia-operator-validator-nz84d 1/1 Running 0 5d2h
+```
+确认时间片设定是否被成功的使用:
+```shell script
+kubectl describe node
+```
+
+```shell script
+Capacity:
+ nvidia.com/gpu: 16
+...
+Allocatable:
+ nvidia.com/gpu: 16
+...
+```
+### 使用GPU测试TensorFlow job
+在这个段落中会在 Time-slicing GPU 的支援下,测试及验证TFJob的运行
+
+1. 新建一个workload的测试档案tf-gpu.yaml:
+ ```shell script
+ vim tf-gpu.yaml
+ ```
+ ```yaml
+ apiVersion: "kubeflow.org/v1"
+ kind: "TFJob"
+ metadata:
+ name: "tf-smoke-gpu"
+ namespace: kubeflow
+ spec:
+ tfReplicaSpecs:
+ PS:
+ replicas: 1
+ template:
+ metadata:
+ creationTimestamp:
+ labels:
+ applicationId: "tf_job_20200521_001"
+ spec:
+ schedulerName: yunikorn
+ containers:
+ - args:
+ - python
+ - tf_cnn_benchmarks.py
+ - --batch_size=32
+ - --model=resnet50
+ - --variable_update=parameter_server
+ - --flush_stdout=true
+ - --num_gpus=1
+ - --local_parameter_device=cpu
+ - --device=cpu
+ - --data_format=NHWC
+ image: docker.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
+ name: tensorflow
+ ports:
+ - containerPort: 2222
+ name: tfjob-port
+ workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
+ restartPolicy: OnFailure
+ Worker:
+ replicas: 1
+ template:
+ metadata:
+ creationTimestamp: null
+ labels:
+ applicationId: "tf_job_20200521_001"
+ spec:
+ schedulerName: yunikorn
+ containers:
+ - args:
+ - python
+ - tf_cnn_benchmarks.py
+ - --batch_size=32
+ - --model=resnet50
+ - --variable_update=parameter_server
+ - --flush_stdout=true
+ - --num_gpus=1
+ - --local_parameter_device=cpu
+ - --device=gpu
+ - --data_format=NHWC
+ image: docker.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
+ name: tensorflow
+ ports:
+ - containerPort: 2222
+ name: tfjob-port
+ resources:
+ limits:
+ nvidia.com/gpu: 2
+ workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
+ restartPolicy: OnFailure
+ ```
+2. 创建TFJob
+ ```shell script
+ kubectl apply -f tf-gpu.yaml
+ ```
+3. 在Yunikorn中验证TFJob是否运行
+ ![tf-job-gpu-on-ui](../../assets/tf-job-gpu-on-ui.png)
+ 察看pod的日志:
+ ```shell script
+ kubectl logs logs po/tf-smoke-gpu-worker-0 -n kubeflow
+ ```
+ ```
+ .......
+ ..Found device 0 with properties:
+ ..name: NVIDIA GeForce RTX 3080 major: 8 minor: 6 memoryClockRate(GHz): 1.71
+
+ .......
+ ..Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA GeForce RTX 3080, pci bus id: 0000:01:00.0, compute capability: 8.6)
+ .......
+ ```
+ ![tf-job-gpu-on-logs](../../assets/tf-job-gpu-on-logs.png)