You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by ww...@apache.org on 2022/01/28 17:34:27 UTC

[incubator-yunikorn-site] branch master updated: [FIX] [YUNIKORN-1059] Fix get started doc yarn build error (#117)

This is an automated email from the ASF dual-hosted git repository.

wwei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-yunikorn-site.git


The following commit(s) were added to refs/heads/master by this push:
     new 529de2d  [FIX] [YUNIKORN-1059] Fix get started doc yarn build error (#117)
529de2d is described below

commit 529de2d9a6c56d0782dd435089ed2f9a76e128fc
Author: Thinking Chen <cd...@hotmail.com>
AuthorDate: Sat Jan 29 01:34:07 2022 +0800

    [FIX] [YUNIKORN-1059] Fix get started doc yarn build error (#117)
    
    Added missing files after revalidation of build
---
 .../current/developer_guide                        |   1 -
 .../current/get_started/get_started.md             |   4 +-
 .../current/performance/metrics.md                 | 109 +++++
 .../current/performance/performance_tutorial.md    | 452 +++++++++++++++++++++
 .../current/user_guide/gang_scheduling.md          | 285 +++++++++++++
 .../current/user_guide/trouble_shooting.md         | 192 +++++++++
 .../current/user_guide/workloads/run_spark.md      | 149 +++++++
 .../current/user_guide/workloads/run_tensorflow.md |  93 +++++
 .../version-0.12.1/developer_guide                 |   1 -
 .../version-0.12.1/get_started/get_started.md      |   4 +-
 .../version-0.12.1/performance/metrics.md          | 109 +++++
 .../performance/performance_tutorial.md            | 452 +++++++++++++++++++++
 .../version-0.12.1/user_guide/gang_scheduling.md   | 285 +++++++++++++
 .../version-0.12.1/user_guide/trouble_shooting.md  | 192 +++++++++
 .../user_guide/workloads/run_spark.md              | 149 +++++++
 .../user_guide/workloads/run_tensorflow.md         |  93 +++++
 16 files changed, 2564 insertions(+), 6 deletions(-)

diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/developer_guide b/i18n/zh-cn/docusaurus-plugin-content-docs/current/developer_guide
deleted file mode 120000
index c4ae1e2..0000000
--- a/i18n/zh-cn/docusaurus-plugin-content-docs/current/developer_guide
+++ /dev/null
@@ -1 +0,0 @@
-../../../../docs/developer_guide
\ No newline at end of file
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/get_started/get_started.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/get_started/get_started.md
index 6601bac..035d25d 100644
--- a/i18n/zh-cn/docusaurus-plugin-content-docs/current/get_started/get_started.md
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/get_started/get_started.md
@@ -25,7 +25,7 @@ under the License.
 
 在阅读本指南之前,我们假设您有一个Kubernetes集群或本地 Kubernetes 开发环境,例如 MiniKube。
 还假定 `kubectl` 在您的环境路径内,并且配置正确。
-遵循此 [指南](../developer_guide/env_setup.md) 来讲述如何使用 docker-desktop 设置本地Kubernetes集群。
+遵循此 [指南](developer_guide/env_setup.md) 来讲述如何使用 docker-desktop 设置本地Kubernetes集群。
 
 ## 安装
 
@@ -43,7 +43,7 @@ helm install yunikorn yunikorn/yunikorn --namespace yunikorn
 `admission-controller` 一旦安装,它将把所有集群流量路由到YuniKorn。
 这意味着资源调度会委托给YuniKorn。在Helm安装过程中,可以通过将 `embedAdmissionController` 标志设置为false来禁用它。
 
-如果你不想使用 Helm Chart,您可以找到我们的细节教程 [点击这里](../developer_guide/deployment.md) 。
+如果你不想使用 Helm Chart,您可以找到我们的细节教程 [点击这里](developer_guide/deployment.md) 。
 
 ## 卸载
 
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/metrics.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/metrics.md
new file mode 100644
index 0000000..9dedbec
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/metrics.md
@@ -0,0 +1,109 @@
+---
+id: metrics
+title: Scheduler Metrics
+keywords:
+ - metrics
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+YuniKorn leverages [Prometheus](https://prometheus.io/) to record metrics. The metrics system keeps tracking of
+scheduler's critical execution paths, to reveal potential performance bottlenecks. Currently, there are three categories
+for these metrics:
+
+- scheduler: generic metrics of the scheduler, such as allocation latency, num of apps etc.
+- queue: each queue has its own metrics sub-system, tracking queue status.
+- event: record various changes of events in YuniKorn.
+
+all metrics are declared in `yunikorn` namespace.
+###    Scheduler Metrics
+
+| Metrics Name          | Metrics Type  | Description  | 
+| --------------------- | ------------  | ------------ |
+| containerAllocation   | Counter       | Total number of attempts to allocate containers. State of the attempt includes `allocated`, `rejected`, `error`, `released`. Increase only.  |
+| applicationSubmission | Counter       | Total number of application submissions. State of the attempt includes `accepted` and `rejected`. Increase only. |
+| applicationStatus     | Gauge         | Total number of application status. State of the application includes `running` and `completed`.  | 
+| totalNodeActive       | Gauge         | Total number of active nodes.                          |
+| totalNodeFailed       | Gauge         | Total number of failed nodes.                          |
+| nodeResourceUsage     | Gauge         | Total resource usage of node, by resource name.        |
+| schedulingLatency     | Histogram     | Latency of the main scheduling routine, in seconds.    |
+| nodeSortingLatency    | Histogram     | Latency of all nodes sorting, in seconds.              |
+| appSortingLatency     | Histogram     | Latency of all applications sorting, in seconds.       |
+| queueSortingLatency   | Histogram     | Latency of all queues sorting, in seconds.             |
+| tryNodeLatency        | Histogram     | Latency of node condition checks for container allocations, such as placement constraints, in seconds, in seconds. |
+
+###    Queue Metrics
+
+| Metrics Name              | Metrics Type  | Description |
+| ------------------------- | ------------- | ----------- |
+| appMetrics                | Counter       | Application Metrics, record the total number of applications. State of the application includes `accepted`,`rejected` and `Completed`.     |
+| usedResourceMetrics       | Gauge         | Queue used resource.     |
+| pendingResourceMetrics    | Gauge         | Queue pending resource.  |
+| availableResourceMetrics  | Gauge         | Used resource metrics related to queues etc.    |
+
+###    Event Metrics
+
+| Metrics Name             | Metrics Type  | Description |
+| ------------------------ | ------------  | ----------- |
+| totalEventsCreated       | Gauge         | Total events created.          |
+| totalEventsChanneled     | Gauge         | Total events channeled.        |
+| totalEventsNotChanneled  | Gauge         | Total events not channeled.    |
+| totalEventsProcessed     | Gauge         | Total events processed.        |
+| totalEventsStored        | Gauge         | Total events stored.           |
+| totalEventsNotStored     | Gauge         | Total events not stored.       |
+| totalEventsCollected     | Gauge         | Total events collected.        |
+
+## Access Metrics
+
+YuniKorn metrics are collected through Prometheus client library, and exposed via scheduler restful service.
+Once started, they can be accessed via endpoint http://localhost:9080/ws/v1/metrics.
+
+## Aggregate Metrics to Prometheus
+
+It's simple to setup a Prometheus server to grab YuniKorn metrics periodically. Follow these steps:
+
+- Setup Prometheus (read more from [Prometheus docs](https://prometheus.io/docs/prometheus/latest/installation/))
+
+- Configure Prometheus rules: a sample configuration 
+
+```yaml
+global:
+  scrape_interval:     3s
+  evaluation_interval: 15s
+
+scrape_configs:
+  - job_name: 'yunikorn'
+    scrape_interval: 1s
+    metrics_path: '/ws/v1/metrics'
+    static_configs:
+    - targets: ['docker.for.mac.host.internal:9080']
+```
+
+- start Prometheus
+
+```shell script
+docker pull prom/prometheus:latest
+docker run -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
+```
+
+Use `docker.for.mac.host.internal` instead of `localhost` if you are running Prometheus in a local docker container
+on Mac OS. Once started, open Prometheus web UI: http://localhost:9090/graph. You'll see all available metrics from
+YuniKorn scheduler.
+
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/performance_tutorial.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/performance_tutorial.md
new file mode 100644
index 0000000..2d34025
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/performance_tutorial.md
@@ -0,0 +1,452 @@
+---
+id: performance_tutorial
+title: Benchmarking Tutorial
+keywords:
+ - performance
+ - tutorial
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Overview
+
+The YuniKorn community continues to optimize the performance of the scheduler, ensuring that YuniKorn satisfies the performance requirements of large-scale batch workloads. Thus, the community has built some useful tools for performance benchmarking that can be reused across releases. This document introduces all these tools and steps to run them.
+
+## Hardware
+
+Be aware that performance result is highly variable depending on the underlying  hardware. All results published in the doc can only be used as references. We encourage each individual to run similar tests on their own environments in order to get a result based on your own hardware. This doc is just for demonstration purpose.
+
+A list of servers being used in this test are (Huge thanks to [National Taichung University of Education](http://www.ntcu.edu.tw/newweb/index.htm), [Kuan-Chou Lai](http://www.ntcu.edu.tw/kclai/) for providing these servers for running tests):
+
+| Manchine Type         | CPU | Memory | Download/upload(Mbps) |
+| --------------------- | --- | ------ | --------------------- |
+| HP                    | 16  | 36G    | 525.74/509.86         |
+| HP                    | 16  | 30G    | 564.84/461.82         |
+| HP                    | 16  | 30G    | 431.06/511.69         |
+| HP                    | 24  | 32G    | 577.31/576.21         |
+| IBM blade H22         | 16  | 38G    | 432.11/4.15           |
+| IBM blade H22         | 16  | 36G    | 714.84/4.14           |
+| IBM blade H22         | 16  | 42G    | 458.38/4.13           |
+| IBM blade H22         | 16  | 42G    | 445.42/4.13           |
+| IBM blade H22         | 16  | 32G    | 400.59/4.13           |
+| IBM blade H22         | 16  | 12G    | 499.87/4.13           |
+| IBM blade H23         | 8   | 32G    | 468.51/4.14           |
+| WS660T                | 8   | 16G    | 87.73/86.30           |
+| ASUSPRO D640MB_M640SA | 4   | 8G     | 92.43/93.77           |
+| PRO E500 G6_WS720T    | 16  | 8G     | 90/87.18              |
+| WS E500 G6_WS720T     | 8   | 40G    | 92.61/89.78           |
+| E500 G5               | 8   | 8G     | 91.34/85.84           |
+| WS E500 G5_WS690T     | 12  | 16G    | 92.2/93.76            |
+| WS E500 G5_WS690T     | 8   | 32G    | 91/89.41              |
+| WS E900 G4_SW980T     | 80  | 512G   | 89.24/87.97           |
+
+The following steps are needed for each server, otherwise the large scale testing may fail due to the limited number of users/processes/open-files.
+
+### 1. Set /etc/sysctl.conf
+```
+kernel.pid_max=400000
+fs.inotify.max_user_instances=50000
+fs.inotify.max_user_watches=52094
+```
+### 2. Set /etc/security/limits.conf
+
+```
+* soft nproc 4000000
+* hard nproc 4000000
+root soft nproc 4000000
+root hard nproc 4000000
+* soft nofile 50000
+* hard nofile 50000
+root soft nofile 50000
+root hard nofile 50000
+```
+---
+
+## Deploy workflow
+
+Before going into the details, here are the general steps used in our tests:
+
+- [Step 1](#Kubernetes): Properly configure Kubernetes API server and controller manager, then add worker nodes.
+- [Step 2](#Setup-Kubemark): Deploy hollow pods,which will simulate worker nodes, name hollow nodes. After all hollow nodes in ready status, we need to cordon all native nodes, which are physical presence in the cluster, not the simulated nodes, to avoid we allocated test workload pod to native nodes.
+- [Step 3](#Deploy-YuniKorn): Deploy YuniKorn using the Helm chart on the master node, and scale down the Deployment to 0 replica, and [modify the port](#Setup-Prometheus) in `prometheus.yml` to match the port of the service.
+- [Step 4](#Run-tests): Deploy 50k Nginx pods for testing, and the API server will create them. But since the YuniKorn scheduler Deployment has been scaled down to 0 replica, all Nginx pods will be stuck in pending.
+- [Step 5](../user_guide/trouble_shooting.md#restart-the-scheduler): Scale up The YuniKorn Deployment back to 1 replica, and cordon the master node to avoid YuniKorn allocating Nginx pods there. In this step, YuniKorn will start collecting the metrics.
+- [Step 6](#Collect-and-Observe-YuniKorn-metrics): Observe the metrics exposed in Prometheus UI.
+---
+
+## Setup Kubemark
+
+[Kubemark](https://github.com/kubernetes/kubernetes/tree/master/test/kubemark) is a performance testing tool which allows users to run experiments on simulated clusters. The primary use case is the scalability testing. The basic idea is to run tens or hundreds of fake kubelet nodes on one physical node in order to simulate large scale clusters. In our tests, we leverage Kubemark to simulate up to a 4K-node cluster on less than 20 physical nodes.
+
+### 1. Build image
+
+##### Clone kubernetes repo, and build kubemark binary file
+
+```
+git clone https://github.com/kubernetes/kubernetes.git
+```
+```
+cd kubernetes
+```
+```
+KUBE_BUILD_PLATFORMS=linux/amd64 make kubemark GOFLAGS=-v GOGCFLAGS="-N -l"
+```
+
+##### Copy kubemark binary file to the image folder and build kubemark docker image
+
+```
+cp _output/bin/kubemark cluster/images/kubemark
+```
+```
+IMAGE_TAG=v1.XX.X make build
+```
+After this step, you can get the kubemark image which can simulate cluster node. You can upload it to Docker-Hub or just deploy it locally.
+
+### 2. Install Kubermark
+
+##### Create kubemark namespace
+
+```
+kubectl create ns kubemark
+```
+
+##### Create configmap
+
+```
+kubectl create configmap node-configmap -n kubemark --from-literal=content.type="test-cluster"
+```
+
+##### Create secret
+
+```
+kubectl create secret generic kubeconfig --type=Opaque --namespace=kubemark --from-file=kubelet.kubeconfig={kubeconfig_file_path} --from-file=kubeproxy.kubeconfig={kubeconfig_file_path}
+```
+### 3. Label node
+
+We need to label all native nodes, otherwise the scheduler might allocate hollow pods to other simulated hollow nodes. We can leverage Node selector in yaml to allocate hollow pods to native nodes.
+
+```
+kubectl label node {node name} tag=tagName
+```
+
+### 4. Deploy Kubemark
+
+The hollow-node.yaml is down below, there are some parameters we can configure.
+
+```
+apiVersion: v1
+kind: ReplicationController
+metadata:
+  name: hollow-node
+  namespace: kubemark
+spec:
+  replicas: 2000  // the node number you want to simulate
+  selector:
+      name: hollow-node
+  template:
+    metadata:
+      labels:
+        name: hollow-node
+    spec:
+      nodeSelector:  // leverage label to allocate to native node
+        tag: tagName  
+      initContainers:
+      - name: init-inotify-limit
+        image: docker.io/busybox:latest
+        imagePullPolicy: IfNotPresent
+        command: ['sysctl', '-w', 'fs.inotify.max_user_instances=200'] // set as same as max_user_instance in actual node 
+        securityContext:
+          privileged: true
+      volumes:
+      - name: kubeconfig-volume
+        secret:
+          secretName: kubeconfig
+      - name: logs-volume
+        hostPath:
+          path: /var/log
+      containers:
+      - name: hollow-kubelet
+        image: 0yukali0/kubemark:1.20.10 // the kubemark image you build 
+        imagePullPolicy: IfNotPresent
+        ports:
+        - containerPort: 4194
+        - containerPort: 10250
+        - containerPort: 10255
+        env:
+        - name: NODE_NAME
+          valueFrom:
+            fieldRef:
+              fieldPath: metadata.name
+        command:
+        - /kubemark
+        args:
+        - --morph=kubelet
+        - --name=$(NODE_NAME)
+        - --kubeconfig=/kubeconfig/kubelet.kubeconfig
+        - --alsologtostderr
+        - --v=2
+        volumeMounts:
+        - name: kubeconfig-volume
+          mountPath: /kubeconfig
+          readOnly: true
+        - name: logs-volume
+          mountPath: /var/log
+        resources:
+          requests:    // the resource of hollow pod, can modify it.
+            cpu: 20m
+            memory: 50M
+        securityContext:
+          privileged: true
+      - name: hollow-proxy
+        image: 0yukali0/kubemark:1.20.10 // the kubemark image you build 
+        imagePullPolicy: IfNotPresent
+        env:
+        - name: NODE_NAME
+          valueFrom:
+            fieldRef:
+              fieldPath: metadata.name
+        command:
+        - /kubemark
+        args:
+        - --morph=proxy
+        - --name=$(NODE_NAME)
+        - --use-real-proxier=false
+        - --kubeconfig=/kubeconfig/kubeproxy.kubeconfig
+        - --alsologtostderr
+        - --v=2
+        volumeMounts:
+        - name: kubeconfig-volume
+          mountPath: /kubeconfig
+          readOnly: true
+        - name: logs-volume
+          mountPath: /var/log
+        resources:  // the resource of hollow pod, can modify it.
+          requests:
+            cpu: 20m
+            memory: 50M
+      tolerations:
+      - effect: NoExecute
+        key: node.kubernetes.io/unreachable
+        operator: Exists
+      - effect: NoExecute
+        key: node.kubernetes.io/not-ready
+        operator: Exists
+```
+
+once done editing, apply it to the cluster:
+
+```
+kubectl apply -f hollow-node.yaml
+```
+
+---
+
+## Deploy YuniKorn
+
+#### Install YuniKorn with helm
+
+We can install YuniKorn with Helm, please refer to this [doc](https://yunikorn.apache.org/docs/#install).
+We need to tune some parameters based on the default configuration. We recommend to clone the [release repo](https://github.com/apache/incubator-yunikorn-release) and modify the parameters in `value.yaml`.
+
+```
+git clone https://github.com/apache/incubator-yunikorn-release.git
+cd helm-charts/yunikorn
+```
+
+#### Configuration
+
+The modifications in the `value.yaml` are:
+
+- increased memory/cpu resources for the scheduler pod
+- disabled the admission controller
+- set the app sorting policy to FAIR
+
+please see the changes below:
+
+```
+resources:
+  requests:
+    cpu: 14
+    memory: 16Gi
+  limits:
+    cpu: 14
+    memory: 16Gi
+```
+```
+embedAdmissionController: false
+```
+```
+configuration: |
+  partitions:
+    -
+      name: default
+      queues:
+        - name: root
+          submitacl: '*'
+          queues:
+            -
+              name: sandbox
+              properties:
+                application.sort.policy: fair
+```
+
+#### Install YuniKorn with local release repo
+
+```
+Helm install yunikorn . --namespace yunikorn
+```
+
+---
+
+## Setup Prometheus
+
+YuniKorn exposes its scheduling metrics via Prometheus. Thus, we need to set up a Prometheus server to collect these metrics.
+
+### 1. Download Prometheus release
+
+```
+wget https://github.com/prometheus/prometheus/releases/download/v2.30.3/prometheus-2.30.3.linux-amd64.tar.gz
+```
+```
+tar xvfz prometheus-*.tar.gz
+cd prometheus-*
+```
+
+### 2. Configure prometheus.yml
+
+```
+global:
+  scrape_interval:     3s
+  evaluation_interval: 15s
+
+scrape_configs:
+  - job_name: 'yunikorn'
+    scrape_interval: 1s
+    metrics_path: '/ws/v1/metrics'
+    static_configs:
+    - targets: ['docker.for.mac.host.internal:9080'] 
+    // 9080 is internal port, need port forward or modify 9080 to service's port
+```
+
+### 3. Launch Prometheus
+```
+./prometheus --config.file=prometheus.yml
+```
+
+---
+## Run tests
+
+Once the environment is setup, you are good to run workloads and collect results. YuniKorn community has some useful tools to run workloads and collect metrics, more details will be published here.
+
+---
+
+## Collect and Observe YuniKorn metrics
+
+After Prometheus is launched, YuniKorn metrics can be easily collected. Here is the [docs](metrics.md) of YuniKorn metrics. YuniKorn tracks some key scheduling metrics which measure the latency of some critical scheduling paths. These metrics include:
+
+ - **scheduling_latency_seconds:** Latency of the main scheduling routine, in seconds.
+ - **app_sorting_latency_seconds**: Latency of all applications sorting, in seconds.
+ - **node_sorting_latency_seconds**: Latency of all nodes sorting, in seconds.
+ - **queue_sorting_latency_seconds**: Latency of all queues sorting, in seconds.
+ - **container_allocation_attempt_total**: Total number of attempts to allocate containers. State of the attempt includes `allocated`, `rejected`, `error`, `released`. Increase only.
+
+you can select and generate graph on Prometheus UI easily, such as:
+
+![Prometheus Metrics List](./../assets/prometheus.png)
+
+
+---
+
+## Performance Tuning
+
+### Kubernetes
+
+The default K8s setup has limited concurrent requests which limits the overall throughput of the cluster. In this section, we introduced a few parameters that need to be tuned up in order to increase the overall throughput of the cluster.
+
+#### kubeadm
+
+Set pod-network mask
+
+```
+kubeadm init --pod-network-cidr=10.244.0.0/8
+```
+
+#### CNI
+
+Modify CNI mask and resources.
+
+```
+  net-conf.json: |
+    {
+      "Network": "10.244.0.0/8",
+      "Backend": {
+        "Type": "vxlan"
+      }
+    }
+```
+```
+  resources:
+    requests:
+      cpu: "100m"
+      memory: "200Mi"
+    limits:
+      cpu: "100m"
+      memory: "200Mi"
+```
+
+
+#### Api-Server
+
+In the Kubernetes API server, we need to modify two parameters: `max-mutating-requests-inflight` and `max-requests-inflight`. Those two parameters represent the API request bandwidth. Because we will generate a large amount of pod request, we need to increase those two parameters. Modify `/etc/kubernetes/manifest/kube-apiserver.yaml`:
+
+```
+--max-mutating-requests-inflight=3000
+--max-requests-inflight=3000
+```
+
+#### Controller-Manager
+
+In the Kubernetes controller manager, we need to increase the value of three parameters: `node-cidr-mask-size`, `kube-api-burst` and `kube-api-qps`. `kube-api-burst` and `kube-api-qps` control the server side request bandwidth. `node-cidr-mask-size` represents the node CIDR. it needs to be increased as well in order to scale up to thousands of nodes. 
+
+
+Modify `/etc/kubernetes/manifest/kube-controller-manager.yaml`:
+
+```
+--node-cidr-mask-size=21 //log2(max number of pods in cluster)
+--kube-api-burst=3000
+--kube-api-qps=3000
+```
+
+#### kubelet
+
+In single worker node, we can run 110 pods as default. But to get higher node resource utilization, we need to add some parameters in Kubelet launch command, and restart it.
+
+Modify start arg in `/etc/systemd/system/kubelet.service.d/10-kubeadm.conf`, add `--max-Pods=300` behind the start arg and restart
+
+```
+systemctl daemon-reload
+systemctl restart kubelet
+```
+
+---
+
+## Summary
+
+With Kubemark and Prometheus, we can easily run benchmark testing, collect YuniKorn metrics and analyze the performance. This helps us to identify the performance bottleneck in the scheduler and further eliminate them. The YuniKorn community will continue to improve these tools in the future, and continue to gain more performance improvements.
\ No newline at end of file
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/gang_scheduling.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/gang_scheduling.md
new file mode 100644
index 0000000..47b5722
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/gang_scheduling.md
@@ -0,0 +1,285 @@
+---
+id: gang_scheduling
+title: Gang Scheduling
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## What is Gang Scheduling
+
+When Gang Scheduling is enabled, YuniKorn schedules the app only when
+the app’s minimal resource request can be satisfied. Otherwise, apps
+will be waiting in the queue. Apps are queued in hierarchy queues,
+with gang scheduling enabled, each resource queue is assigned with the
+maximum number of applications running concurrently with min resource guaranteed.
+
+![Gang Scheduling](./../assets/gang_scheduling_iintro.png)
+
+## Enable Gang Scheduling
+
+There is no cluster-wide configuration needed to enable Gang Scheduling.
+The scheduler actively monitors the metadata of each app, if the app has included
+a valid taskGroups definition, it will be considered as gang scheduling desired.
+
+:::info Task Group
+A task group is a “gang” of tasks in an app, these tasks are having the same resource profile
+and the same placement constraints. They are considered as homogeneous requests that can be
+treated as the same kind in the scheduler.
+:::
+
+### Prerequisite
+
+For the queues which runs gang scheduling enabled applications, the queue sorting policy needs to be set either
+`FIFO` or `StateAware`. To configure queue sorting policy, please refer to doc: [app sorting policies](user_guide/sorting_policies.md#Application_sorting).
+
+:::info Why FIFO based sorting policy?
+When Gang Scheduling is enabled, the scheduler proactively reserves resources
+for each application. If the queue sorting policy is not FIFO based (StateAware is FIFO based sorting policy),
+the scheduler might reserve partial resources for each app and causing resource segmentation issues.
+:::
+
+### App Configuration
+
+On Kubernetes, YuniKorn discovers apps by loading metadata from individual pod, the first pod of the app
+is required to enclosed with a full copy of app metadata. If the app doesn’t have any notion about the first or second pod,
+then all pods are required to carry the same taskGroups info. Gang scheduling requires taskGroups definition,
+which can be specified via pod annotations. The required fields are:
+
+| Annotation                                     | Value |
+|----------------------------------------------- |---------------------	|
+| yunikorn.apache.org/task-group-name 	         | Task group name, it must be unique within the application |
+| yunikorn.apache.org/task-groups                | A list of task groups, each item contains all the info defined for the certain task group |
+| yunikorn.apache.org/schedulingPolicyParameters | Optional. A arbitrary key value pairs to define scheduling policy parameters. Please read [schedulingPolicyParameters section](#scheduling-policy-parameters) |
+
+#### How many task groups needed?
+
+This depends on how many different types of pods this app requests from K8s. A task group is a “gang” of tasks in an app,
+these tasks are having the same resource profile and the same placement constraints. They are considered as homogeneous
+requests that can be treated as the same kind in the scheduler. Use Spark as an example, each job will need to have 2 task groups,
+one for the driver pod and the other one for the executor pods.
+
+#### How to define task groups?
+
+The task group definition is a copy of the app’s real pod definition, values for fields like resources, node-selector
+and toleration should be the same as the real pods. This is to ensure the scheduler can reserve resources with the
+exact correct pod specification.
+
+#### Scheduling Policy Parameters
+
+Scheduling policy related configurable parameters. Apply the parameters in the following format in pod's annotation:
+
+```yaml
+annotations:
+   yunikorn.apache.org/schedulingPolicyParameters: "PARAM1=VALUE1 PARAM2=VALUE2 ..."
+```
+
+Currently, the following parameters are supported:
+
+`placeholderTimeoutInSeconds`
+
+Default value: *15 minutes*.
+This parameter defines the reservation timeout for how long the scheduler should wait until giving up allocating all the placeholders.
+The timeout timer starts to tick when the scheduler *allocates the first placeholder pod*. This ensures if the scheduler
+could not schedule all the placeholder pods, it will eventually give up after a certain amount of time. So that the resources can be
+freed up and used by other apps. If non of the placeholders can be allocated, this timeout won't kick-in. To avoid the placeholder
+pods stuck forever, please refer to [troubleshooting](trouble_shooting.md#gang-scheduling) for solutions.
+
+` gangSchedulingStyle`
+
+Valid values: *Soft*, *Hard*
+
+Default value: *Soft*.
+This parameter defines the fallback mechanism if the app encounters gang issues due to placeholder pod allocation.
+See more details in [Gang Scheduling styles](#gang-scheduling-styles) section
+
+More scheduling parameters will added in order to provide more flexibility while scheduling apps.
+
+#### Example
+
+The following example is a yaml file for a job. This job launches 2 pods and each pod sleeps 30 seconds.
+The notable change in the pod spec is *spec.template.metadata.annotations*, where we defined `yunikorn.apache.org/task-group-name`
+and `yunikorn.apache.org/task-groups`.
+
+```yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: gang-scheduling-job-example
+spec:
+  completions: 2
+  parallelism: 2
+  template:
+    metadata:
+      labels:
+        app: sleep
+        applicationId: "gang-scheduling-job-example"
+        queue: root.sandbox
+      annotations:
+        yunikorn.apache.org/task-group-name: task-group-example
+        yunikorn.apache.org/task-groups: |-
+          [{
+              "name": "task-group-example",
+              "minMember": 2,
+              "minResource": {
+                "cpu": "100m",
+                "memory": "50M"
+              },
+              "nodeSelector": {},
+              "tolerations": []
+          }]
+    spec:
+      schedulerName: yunikorn
+      restartPolicy: Never
+      containers:
+        - name: sleep30
+          image: "alpine:latest"
+          command: ["sleep", "30"]
+          resources:
+            requests:
+              cpu: "100m"
+              memory: "50M"
+```
+
+When this job is submitted to Kubernetes, 2 pods will be created using the same template, and they all belong to one taskGroup:
+*“task-group-example”*. YuniKorn will create 2 placeholder pods, each uses the resources specified in the taskGroup definition.
+When all 2 placeholders are allocated, the scheduler will bind the the real 2 sleep pods using the spot reserved by the placeholders.
+
+You can add more than one taskGroups if necessary, each taskGroup is identified by the taskGroup name,
+it is required to map each real pod with a pre-defined taskGroup by setting the taskGroup name. Note,
+the task group name is only required to be unique within an application.
+
+### Enable Gang scheduling for Spark jobs
+
+Each Spark job runs 2 types of pods, driver and executor. Hence, we need to define 2 task groups for each job.
+The annotations for the driver pod looks like:
+
+```yaml
+Annotations:
+  yunikorn.apache.org/schedulingPolicyParameters: “placeholderTimeoutSeconds=30”
+  yunikorn.apache.org/taskGroupName: “spark-driver”
+  yunikorn.apache.org/taskGroup: “
+    TaskGroups: [
+     {
+       Name: “spark-driver”
+       minMember: 1,
+       minResource: {
+         Cpu: 1,
+         Memory: 2Gi
+       },
+       Node-selector: ...
+       Tolerations: ...
+     },
+      {
+        Name: “spark-executor”,
+        minMember: 10, 
+        minResource: {
+          Cpu: 1,
+          Memory: 2Gi
+        }
+      }
+  ]
+  ”
+```
+
+:::note
+Spark driver and executor pod has memory overhead, that needs to be considered in the taskGroup resources. 
+:::
+
+For all the executor pods,
+
+```yaml
+Annotations:
+  # the taskGroup name should match to the names
+  # defined in the taskGroups field
+  yunikorn.apache.org/taskGroupName: “spark-executor”
+```
+
+Once the job is submitted to the scheduler, the job won’t be scheduled immediately.
+Instead, the scheduler will ensure it gets its minimal resources before actually starting the driver/executors. 
+
+## Gang scheduling Styles
+
+There are 2 gang scheduling styles supported, Soft and Hard respectively. It can be configured per app-level to define how the app will behave in case the gang scheduling fails.
+
+- `Hard style`: when this style is used, we will have the initial behavior, more precisely if the application cannot be scheduled according to gang scheduling rules, and it times out, it will be marked as failed, without retrying to schedule it.
+- `Soft style`: when the app cannot be gang scheduled, it will fall back to the normal scheduling, and the non-gang scheduling strategy will be used to achieve the best-effort scheduling. When this happens, the app transits to the Resuming state and all the remaining placeholder pods will be cleaned up.
+
+**Default style used**: `Soft`
+
+**Enable a specific style**: the style can be changed by setting in the application definition the ‘gangSchedulingStyle’ parameter to Soft or Hard.
+
+#### Example
+
+```yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: gang-app-timeout
+spec:
+  completions: 4
+  parallelism: 4
+  template:
+    metadata:
+      labels:
+        app: sleep
+        applicationId: gang-app-timeout
+        queue: fifo
+      annotations:
+        yunikorn.apache.org/task-group-name: sched-style
+        yunikorn.apache.org/schedulingPolicyParameters: "placeholderTimeoutInSeconds=60 gangSchedulingStyle=Hard"
+        yunikorn.apache.org/task-groups: |-
+          [{
+              "name": "sched-style",
+              "minMember": 4,
+              "minResource": {
+                "cpu": "1",
+                "memory": "1000M"
+              },
+              "nodeSelector": {},
+              "tolerations": []
+          }]
+    spec:
+      schedulerName: yunikorn
+      restartPolicy: Never
+      containers:
+        - name: sleep30
+          image: "alpine:latest"
+          imagePullPolicy: "IfNotPresent"
+          command: ["sleep", "30"]
+          resources:
+            requests:
+              cpu: "1"
+              memory: "1000M"
+
+```
+
+## Verify Configuration
+
+To verify if the configuration has been done completely and correctly, check the following things:
+1. When an app is submitted, verify the expected number of placeholders are created by the scheduler.
+If you define 2 task groups, 1 with minMember 1 and the other with minMember 5, that means we are expecting 6 placeholder
+gets created once the job is submitted.
+2. Verify the placeholder spec is correct. Each placeholder needs to have the same info as the real pod in the same taskGroup.
+Check field including: namespace, pod resources, node-selector, and toleration.
+3. Verify the placeholders can be allocated on correct type of nodes, and verify the real pods are started by replacing the placeholder pods.
+
+## Troubleshooting
+
+Please see the troubleshooting doc when gang scheduling is enabled [here](trouble_shooting.md#gang-scheduling).
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/trouble_shooting.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/trouble_shooting.md
new file mode 100644
index 0000000..deada94
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/trouble_shooting.md
@@ -0,0 +1,192 @@
+---
+id: trouble_shooting
+title: Trouble Shooting
+---
+
+<!--
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ -->
+ 
+## Scheduler logs
+
+### Retrieve scheduler logs
+
+Currently, the scheduler writes its logs to stdout/stderr, docker container handles the redirection of these logs to a
+local location on the underneath node, you can read more document [here](https://docs.docker.com/config/containers/logging/configure/).
+These logs can be retrieved by [kubectl logs](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#logs). Such as:
+
+```shell script
+// get the scheduler pod
+kubectl get pod -l component=yunikorn-scheduler -n yunikorn
+NAME                                  READY   STATUS    RESTARTS   AGE
+yunikorn-scheduler-766d7d6cdd-44b82   2/2     Running   0          33h
+
+// retrieve logs
+kubectl logs yunikorn-scheduler-766d7d6cdd-44b82 yunikorn-scheduler-k8s -n yunikorn
+```
+
+In most cases, this command cannot get all logs because the scheduler is rolling logs very fast. To retrieve more logs in
+the past, you will need to setup the [cluster level logging](https://kubernetes.io/docs/concepts/cluster-administration/logging/#cluster-level-logging-architectures).
+The recommended setup is to leverage [fluentd](https://www.fluentd.org/) to collect and persistent logs on an external storage, e.g s3. 
+
+### Set Logging Level
+
+:::note
+Changing the logging level requires a restart of the scheduler pod.
+:::
+
+Stop the scheduler:
+
+```shell script
+kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=0
+```
+edit the deployment config in vim:
+
+```shell script
+kubectl edit deployment yunikorn-scheduler -n yunikorn
+```
+
+add `LOG_LEVEL` to the `env` field of the container template. For example setting `LOG_LEVEL` to `0` sets the logging
+level to `INFO`.
+
+```yaml
+apiVersion: extensions/v1beta1
+kind: Deployment
+metadata:
+ ...
+spec:
+  template: 
+   ...
+    spec:
+      containers:
+      - env:
+        - name: LOG_LEVEL
+          value: '0'
+```
+
+Start the scheduler:
+
+```shell script
+kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=1
+```
+
+Available logging levels:
+
+| Value 	| Logging Level 	|
+|:-----:	|:-------------:	|
+|   -1  	|     DEBUG     	|
+|   0   	|      INFO     	|
+|   1   	|      WARN     	|
+|   2   	|     ERROR     	|
+|   3   	|     DPanic    	|
+|   4   	|     Panic     	|
+|   5   	|     Fatal     	|
+
+## Pods are stuck at Pending state
+
+If some pods are stuck at Pending state, that means the scheduler could not find a node to allocate the pod. There are
+several possibilities to cause this:
+
+### 1. Non of the nodes satisfy pod placement requirement
+
+A pod can be configured with some placement constraints, such as [node-selector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector),
+[affinity/anti-affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity),
+do not have certain toleration for node [taints](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/), etc.
+To debug such issues, you can describe the pod by:
+
+```shell script
+kubectl describe pod <pod-name> -n <namespace>
+```
+
+the pod events will contain the predicate failures and that explains why nodes are not qualified for allocation.
+
+### 2. The queue is running out of capacity
+
+If the queue is running out of capacity, pods will be pending for available queue resources. To check if a queue is still
+having enough capacity for the pending pods, there are several approaches:
+
+1) check the queue usage from yunikorn UI
+
+If you do not know how to access the UI, you can refer the document [here](../get_started/get_started.md#访问-web-ui). Go
+to the `Queues` page, navigate to the queue where this job is submitted to. You will be able to see the available capacity
+left for the queue.
+
+2) check the pod events
+
+Run the `kubectl describe pod` to get the pod events. If you see some event like:
+`Application <appID> does not fit into <queuePath> queue`. That means the pod could not get allocated because the queue
+is running out of capacity.
+
+The pod will be allocated if some other pods in this queue is completed or removed. If the pod remains pending even
+the queue has capacity, that may because it is waiting for the cluster to scale up.
+
+## Restart the scheduler
+
+YuniKorn can recover its state upon a restart. YuniKorn scheduler pod is deployed as a deployment, restart the scheduler
+can be done by scale down and up the replica:
+
+```shell script
+kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=0
+kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=1
+```
+
+## Gang Scheduling
+
+### 1. No placeholders created, app's pods are pending
+
+*Reason*: This is usually because the app is rejected by the scheduler, therefore non of the pods are scheduled.
+The common reasons caused the rejection are: 1) The taskGroups definition is invalid. The scheduler does the
+sanity check upon app submission, to ensure all the taskGroups are defined correctly, if these info are malformed,
+the scheduler rejects the app; 2) The total min resources defined in the taskGroups is bigger than the queues' max
+capacity, scheduler rejects the app because it won't fit into the queue's capacity. Check the pod event for relevant messages,
+and you will also be able to find more detail error messages from the schedulers' log.
+
+*Solution*: Correct the taskGroups definition and retry submitting the app. 
+
+### 2. Not all placeholders can be allocated
+
+*Reason*: The placeholders also consume resources, if not all of them can be allocated, that usually means either the queue
+or the cluster has no sufficient resources for them. In this case, the placeholders will be cleaned up after a certain
+amount of time, defined by the `placeholderTimeoutInSeconds` scheduling policy parameter.
+
+*Solution*: Note, if the placeholder timeout reaches, currently the app will transit to failed state and can not be scheduled
+anymore. You can increase the placeholder timeout value if you are willing to wait for a longer time. In the future, a fallback policy
+might be added to provide some retry other than failing the app.
+
+### 3. Not all placeholders are swapped
+
+*Reason*: This usually means the actual app's pods are less than the minMembers defined in the taskGroups.
+
+*Solution*: Check the `minMember` in the taskGroup field and ensure it is correctly set. The `minMember` can be less than
+the actual pods, setting it to bigger than the actual number of pods is invalid.
+
+### 4.Placeholders are not cleaned up when the app terminated
+
+*Reason*: All the placeholders are set an [ownerReference](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents)
+to the first real pod of the app, or the controller reference. If the placeholder could not be cleaned up, that means
+the garbage collection is not working properly. 
+
+*Solution*: check the placeholder `ownerReference` and the garbage collector in Kubernetes.    
+
+
+## Still got questions?
+
+No problem! The Apache YuniKorn community will be happy to help. You can reach out to the community with the following options:
+
+1. Post your questions to dev@yunikorn.apache.org
+2. Join the [YuniKorn slack channel](https://join.slack.com/t/yunikornworkspace/shared_invite/enQtNzAzMjY0OTI4MjYzLTBmMDdkYTAwNDMwNTE3NWVjZWE1OTczMWE4NDI2Yzg3MmEyZjUyYTZlMDE5M2U4ZjZhNmYyNGFmYjY4ZGYyMGE) and post your questions to the `#yunikorn-user` channel.
+3. Join the [community sync up meetings](http://yunikorn.apache.org/community/getInvolved#community-meetings) and directly talk to the community members. 
\ No newline at end of file
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/workloads/run_spark.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/workloads/run_spark.md
new file mode 100644
index 0000000..9f748eb
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/workloads/run_spark.md
@@ -0,0 +1,149 @@
+---
+id: run_spark
+title: Run Spark Jobs
+description: How to run Spark jobs with YuniKorn
+keywords:
+ - spark
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+:::note
+This document assumes you have YuniKorn and its admission-controller both installed. Please refer to
+[get started](../../get_started/get_started.md) to see how that is done.
+:::
+
+## Prepare the docker image for Spark
+
+To run Spark on Kubernetes, you'll need the Spark docker images. You can 1) use the docker images provided by the YuniKorn
+team, or 2) build one from scratch. If you want to build your own Spark docker image, you can
+* Download a Spark version that has Kubernetes support, URL: https://github.com/apache/spark
+* Build spark with Kubernetes support:
+```shell script
+mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.4 -Phive -Pkubernetes -Phive-thriftserver -DskipTests package
+```
+
+## Create a namespace for Spark jobs
+
+Create a namespace:
+
+```shell script
+cat <<EOF | kubectl apply -f -
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: spark-test
+EOF
+```
+
+Create service account and cluster role bindings under `spark-test` namespace:
+
+```shell script
+cat <<EOF | kubectl apply -n spark-test -f -
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: spark
+  namespace: spark-test
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: spark-cluster-role
+  namespace: spark-test
+rules:
+- apiGroups: [""]
+  resources: ["pods"]
+  verbs: ["get", "watch", "list", "create", "delete"]
+- apiGroups: [""]
+  resources: ["configmaps"]
+  verbs: ["get", "create", "delete"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: spark-cluster-role-binding
+  namespace: spark-test
+subjects:
+- kind: ServiceAccount
+  name: spark
+  namespace: spark-test
+roleRef:
+  kind: ClusterRole
+  name: spark-cluster-role
+  apiGroup: rbac.authorization.k8s.io
+EOF
+```
+
+:::note
+Do NOT use `ClusterRole` and `ClusterRoleBinding` to run Spark jobs in production, please configure a more fine-grained
+security context for running Spark jobs. See more about how to configure proper RBAC rules [here](https://kubernetes.io/docs/reference/access-authn-authz/rbac/).
+:::
+
+## Submit a Spark job
+
+If this is running from local machine, you will need to start the proxy in order to talk to the api-server.
+```shell script
+kubectl proxy
+```
+
+Run a simple SparkPi job (this assumes that the Spark binaries are installed to `/usr/local` directory).
+```shell script
+export SPARK_HOME=/usr/local/spark-2.4.4-bin-hadoop2.7/
+${SPARK_HOME}/bin/spark-submit --master k8s://http://localhost:8001 --deploy-mode cluster --name spark-pi \
+   --master k8s://http://localhost:8001 --deploy-mode cluster --name spark-pi \
+   --class org.apache.spark.examples.SparkPi \
+   --conf spark.executor.instances=1 \
+   --conf spark.kubernetes.namespace=spark-test \
+   --conf spark.kubernetes.executor.request.cores=1 \
+   --conf spark.kubernetes.container.image=apache/yunikorn:spark-2.4.4 \
+   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-test:spark \
+   local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar
+```
+
+You'll see Spark driver and executors been created on Kubernetes:
+
+![spark-pods](./../../assets/spark-pods.png)
+
+You can also view the job info from YuniKorn UI. If you do not know how to access the YuniKorn UI, please read the document
+[here](../../get_started/get_started.md#访问-web-ui).
+
+![spark-jobs-on-ui](./../../assets/spark-jobs-on-ui.png)
+
+## What happens behind the scenes?
+
+When the Spark job is submitted to the cluster, the job is submitted to `spark-test` namespace. The Spark driver pod will
+be firstly created under this namespace. Since this cluster has YuniKorn admission-controller enabled, when the driver pod
+get created, the admission-controller mutates the pod's spec and injects `schedulerName=yunikorn`, by doing this, the
+default K8s scheduler will skip this pod and it will be scheduled by YuniKorn instead. See how this is done by [configuring
+another scheduler in Kubernetes](https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/).
+
+The default configuration has placement rule enabled, which automatically maps the `spark-test` namespace to a YuniKorn
+queue `root.spark-test`. All Spark jobs submitted to this namespace will be automatically submitted to the queue first.
+To see more about how placement rule works, please see doc [placement-rules](user_guide/placement_rules.md). By far,
+the namespace defines the security context of the pods, and the queue determines how the job and pods will be scheduled
+with considering of job ordering, queue resource fairness, etc. Note, this is the simplest setup, which doesn't enforce
+the queue capacities. The queue is considered as having unlimited capacity.
+
+YuniKorn reuses the Spark application ID set in label `spark-app-selector`, and this job is submitted
+to YuniKorn and being considered as a job. The job is scheduled and running as there is sufficient resources in the cluster.
+YuniKorn allocates the driver pod to a node, binds the pod and starts all the containers. Once the driver pod gets started,
+it requests for a bunch of executor pods to run its tasks. Those pods will be created in the same namespace as well and
+scheduled by YuniKorn as well.
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/workloads/run_tensorflow.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/workloads/run_tensorflow.md
new file mode 100644
index 0000000..3330aa4
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/workloads/run_tensorflow.md
@@ -0,0 +1,93 @@
+---
+id: run_tf
+title: Run TensorFlow Jobs
+description: How to run TensorFlow jobs with YuniKorn
+keywords:
+ - tensorflow
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This guide gives an overview of how to set up [training-operator](https://github.com/kubeflow/training-operator)
+and how to run a Tensorflow job with YuniKorn scheduler. The training-operator is a unified training operator maintained by
+Kubeflow. It not only supports TensorFlow but also PyTorch, XGboots, etc.
+
+## Install training-operator
+You can use the following command to install training operator in kubeflow namespace by default. If you have problems with installation,
+please refer to [this doc](https://github.com/kubeflow/training-operator#installation) for details.
+```
+kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.3.0"
+```
+
+## Prepare the docker image
+Before you start running a TensorFlow job on Kubernetes, you'll need to build the docker image.
+1. Download files from [deployment/examples/tfjob](https://github.com/apache/incubator-yunikorn-k8shim/tree/master/deployments/examples/tfjob)
+2. To build this docker image with the following command
+
+```
+docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .
+```
+
+## Run a TensorFlow job
+Here is a TFJob yaml for MNIST [example](https://github.com/apache/incubator-yunikorn-k8shim/blob/master/deployments/examples/tfjob/tf-job-mnist.yaml).
+
+```yaml
+apiVersion: kubeflow.org/v1
+kind: TFJob
+metadata:
+  name: dist-mnist-for-e2e-test
+  namespace: kubeflow
+spec:
+  tfReplicaSpecs:
+    PS:
+      replicas: 2
+      restartPolicy: Never
+      template:
+        metadata:
+          labels:
+            applicationId: "tf_job_20200521_001"
+            queue: root.sandbox
+        spec:
+          schedulerName: yunikorn
+          containers:
+            - name: tensorflow
+              image: kubeflow/tf-dist-mnist-test:1.0
+    Worker:
+      replicas: 4
+      restartPolicy: Never
+      template:
+        metadata:
+          labels:
+            applicationId: "tf_job_20200521_001"
+            queue: root.sandbox
+        spec:
+          schedulerName: yunikorn
+          containers:
+            - name: tensorflow
+              image: kubeflow/tf-dist-mnist-test:1.0
+```
+Create the TFJob
+```
+kubectl create -f deployments/examples/tfjob/tf-job-mnist.yaml
+```
+You can view the job info from YuniKorn UI. If you do not know how to access the YuniKorn UI,
+please read the document [here](../../get_started/get_started.md#访问-web-ui).
+
+![tf-job-on-ui](../../assets/tf-job-on-ui.png)
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/developer_guide b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/developer_guide
deleted file mode 120000
index 6f5db06..0000000
--- a/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/developer_guide
+++ /dev/null
@@ -1 +0,0 @@
-../../../../versioned_docs/version-0.12.1/developer_guide
\ No newline at end of file
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/get_started/get_started.md b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/get_started/get_started.md
index 6601bac..035d25d 100644
--- a/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/get_started/get_started.md
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/get_started/get_started.md
@@ -25,7 +25,7 @@ under the License.
 
 在阅读本指南之前,我们假设您有一个Kubernetes集群或本地 Kubernetes 开发环境,例如 MiniKube。
 还假定 `kubectl` 在您的环境路径内,并且配置正确。
-遵循此 [指南](../developer_guide/env_setup.md) 来讲述如何使用 docker-desktop 设置本地Kubernetes集群。
+遵循此 [指南](developer_guide/env_setup.md) 来讲述如何使用 docker-desktop 设置本地Kubernetes集群。
 
 ## 安装
 
@@ -43,7 +43,7 @@ helm install yunikorn yunikorn/yunikorn --namespace yunikorn
 `admission-controller` 一旦安装,它将把所有集群流量路由到YuniKorn。
 这意味着资源调度会委托给YuniKorn。在Helm安装过程中,可以通过将 `embedAdmissionController` 标志设置为false来禁用它。
 
-如果你不想使用 Helm Chart,您可以找到我们的细节教程 [点击这里](../developer_guide/deployment.md) 。
+如果你不想使用 Helm Chart,您可以找到我们的细节教程 [点击这里](developer_guide/deployment.md) 。
 
 ## 卸载
 
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/performance/metrics.md b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/performance/metrics.md
new file mode 100644
index 0000000..9dedbec
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/performance/metrics.md
@@ -0,0 +1,109 @@
+---
+id: metrics
+title: Scheduler Metrics
+keywords:
+ - metrics
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+YuniKorn leverages [Prometheus](https://prometheus.io/) to record metrics. The metrics system keeps tracking of
+scheduler's critical execution paths, to reveal potential performance bottlenecks. Currently, there are three categories
+for these metrics:
+
+- scheduler: generic metrics of the scheduler, such as allocation latency, num of apps etc.
+- queue: each queue has its own metrics sub-system, tracking queue status.
+- event: record various changes of events in YuniKorn.
+
+all metrics are declared in `yunikorn` namespace.
+###    Scheduler Metrics
+
+| Metrics Name          | Metrics Type  | Description  | 
+| --------------------- | ------------  | ------------ |
+| containerAllocation   | Counter       | Total number of attempts to allocate containers. State of the attempt includes `allocated`, `rejected`, `error`, `released`. Increase only.  |
+| applicationSubmission | Counter       | Total number of application submissions. State of the attempt includes `accepted` and `rejected`. Increase only. |
+| applicationStatus     | Gauge         | Total number of application status. State of the application includes `running` and `completed`.  | 
+| totalNodeActive       | Gauge         | Total number of active nodes.                          |
+| totalNodeFailed       | Gauge         | Total number of failed nodes.                          |
+| nodeResourceUsage     | Gauge         | Total resource usage of node, by resource name.        |
+| schedulingLatency     | Histogram     | Latency of the main scheduling routine, in seconds.    |
+| nodeSortingLatency    | Histogram     | Latency of all nodes sorting, in seconds.              |
+| appSortingLatency     | Histogram     | Latency of all applications sorting, in seconds.       |
+| queueSortingLatency   | Histogram     | Latency of all queues sorting, in seconds.             |
+| tryNodeLatency        | Histogram     | Latency of node condition checks for container allocations, such as placement constraints, in seconds, in seconds. |
+
+###    Queue Metrics
+
+| Metrics Name              | Metrics Type  | Description |
+| ------------------------- | ------------- | ----------- |
+| appMetrics                | Counter       | Application Metrics, record the total number of applications. State of the application includes `accepted`,`rejected` and `Completed`.     |
+| usedResourceMetrics       | Gauge         | Queue used resource.     |
+| pendingResourceMetrics    | Gauge         | Queue pending resource.  |
+| availableResourceMetrics  | Gauge         | Used resource metrics related to queues etc.    |
+
+###    Event Metrics
+
+| Metrics Name             | Metrics Type  | Description |
+| ------------------------ | ------------  | ----------- |
+| totalEventsCreated       | Gauge         | Total events created.          |
+| totalEventsChanneled     | Gauge         | Total events channeled.        |
+| totalEventsNotChanneled  | Gauge         | Total events not channeled.    |
+| totalEventsProcessed     | Gauge         | Total events processed.        |
+| totalEventsStored        | Gauge         | Total events stored.           |
+| totalEventsNotStored     | Gauge         | Total events not stored.       |
+| totalEventsCollected     | Gauge         | Total events collected.        |
+
+## Access Metrics
+
+YuniKorn metrics are collected through Prometheus client library, and exposed via scheduler restful service.
+Once started, they can be accessed via endpoint http://localhost:9080/ws/v1/metrics.
+
+## Aggregate Metrics to Prometheus
+
+It's simple to setup a Prometheus server to grab YuniKorn metrics periodically. Follow these steps:
+
+- Setup Prometheus (read more from [Prometheus docs](https://prometheus.io/docs/prometheus/latest/installation/))
+
+- Configure Prometheus rules: a sample configuration 
+
+```yaml
+global:
+  scrape_interval:     3s
+  evaluation_interval: 15s
+
+scrape_configs:
+  - job_name: 'yunikorn'
+    scrape_interval: 1s
+    metrics_path: '/ws/v1/metrics'
+    static_configs:
+    - targets: ['docker.for.mac.host.internal:9080']
+```
+
+- start Prometheus
+
+```shell script
+docker pull prom/prometheus:latest
+docker run -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
+```
+
+Use `docker.for.mac.host.internal` instead of `localhost` if you are running Prometheus in a local docker container
+on Mac OS. Once started, open Prometheus web UI: http://localhost:9090/graph. You'll see all available metrics from
+YuniKorn scheduler.
+
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/performance/performance_tutorial.md b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/performance/performance_tutorial.md
new file mode 100644
index 0000000..8dfa57c
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/performance/performance_tutorial.md
@@ -0,0 +1,452 @@
+---
+id: performance_tutorial
+title: Benchmarking Tutorial
+keywords:
+ - performance
+ - tutorial
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Overview
+
+The YuniKorn community continues to optimize the performance of the scheduler, ensuring that YuniKorn satisfies the performance requirements of large-scale batch workloads. Thus, the community has built some useful tools for performance benchmarking that can be reused across releases. This document introduces all these tools and steps to run them.
+
+## Hardware
+
+Be aware that performance result is highly variable depending on the underlying  hardware. All results published in the doc can only be used as references. We encourage each individual to run similar tests on their own environments in order to get a result based on your own hardware. This doc is just for demonstration purpose.
+
+A list of servers being used in this test are (Huge thanks to [National Taichung University of Education](http://www.ntcu.edu.tw/newweb/index.htm), [Kuan-Chou Lai](http://www.ntcu.edu.tw/kclai/) for providing these servers for running tests):
+
+| Manchine Type         | CPU | Memory | Download/upload(Mbps) |
+| --------------------- | --- | ------ | --------------------- |
+| HP                    | 16  | 36G    | 525.74/509.86         |
+| HP                    | 16  | 30G    | 564.84/461.82         |
+| HP                    | 16  | 30G    | 431.06/511.69         |
+| HP                    | 24  | 32G    | 577.31/576.21         |
+| IBM blade H22         | 16  | 38G    | 432.11/4.15           |
+| IBM blade H22         | 16  | 36G    | 714.84/4.14           |
+| IBM blade H22         | 16  | 42G    | 458.38/4.13           |
+| IBM blade H22         | 16  | 42G    | 445.42/4.13           |
+| IBM blade H22         | 16  | 32G    | 400.59/4.13           |
+| IBM blade H22         | 16  | 12G    | 499.87/4.13           |
+| IBM blade H23         | 8   | 32G    | 468.51/4.14           |
+| WS660T                | 8   | 16G    | 87.73/86.30           |
+| ASUSPRO D640MB_M640SA | 4   | 8G     | 92.43/93.77           |
+| PRO E500 G6_WS720T    | 16  | 8G     | 90/87.18              |
+| WS E500 G6_WS720T     | 8   | 40G    | 92.61/89.78           |
+| E500 G5               | 8   | 8G     | 91.34/85.84           |
+| WS E500 G5_WS690T     | 12  | 16G    | 92.2/93.76            |
+| WS E500 G5_WS690T     | 8   | 32G    | 91/89.41              |
+| WS E900 G4_SW980T     | 80  | 512G   | 89.24/87.97           |
+
+The following steps are needed for each server, otherwise the large scale testing may fail due to the limited number of users/processes/open-files.
+
+### 1. Set /etc/sysctl.conf
+```
+kernel.pid_max=400000
+fs.inotify.max_user_instances=50000
+fs.inotify.max_user_watches=52094
+```
+### 2. Set /etc/security/limits.conf
+
+```
+* soft nproc 4000000
+* hard nproc 4000000
+root soft nproc 4000000
+root hard nproc 4000000
+* soft nofile 50000
+* hard nofile 50000
+root soft nofile 50000
+root hard nofile 50000
+```
+---
+
+## Deploy workflow
+
+Before going into the details, here are the general steps used in our tests:
+
+- [Step 1](#Kubernetes): Properly configure Kubernetes API server and controller manager, then add worker nodes.
+- [Step 2](#Setup-Kubemark): Deploy hollow pods,which will simulate worker nodes, name hollow nodes. After all hollow nodes in ready status, we need to cordon all native nodes, which are physical presence in the cluster, not the simulated nodes, to avoid we allocated test workload pod to native nodes.
+- [Step 3](#Deploy-YuniKorn): Deploy YuniKorn using the Helm chart on the master node, and scale down the Deployment to 0 replica, and [modify the port](#Setup-Prometheus) in `prometheus.yml` to match the port of the service.
+- [Step 4](#Run-tests): Deploy 50k Nginx pods for testing, and the API server will create them. But since the YuniKorn scheduler Deployment has been scaled down to 0 replica, all Nginx pods will be stuck in pending.
+- [Step 5](../user_guide/trouble_shooting.md#restart-the-scheduler): Scale up The YuniKorn Deployment back to 1 replica, and cordon the master node to avoid YuniKorn allocating Nginx pods there. In this step, YuniKorn will start collecting the metrics.
+- [Step 6](#Collect-and-Observe-YuniKorn-metrics): Observe the metrics exposed in Prometheus UI.
+---
+
+## Setup Kubemark
+
+[Kubemark](https://github.com/kubernetes/kubernetes/tree/master/test/kubemark) is a performance testing tool which allows users to run experiments on simulated clusters. The primary use case is the scalability testing. The basic idea is to run tens or hundreds of fake kubelet nodes on one physical node in order to simulate large scale clusters. In our tests, we leverage Kubemark to simulate up to a 4K-node cluster on less than 20 physical nodes.
+
+### 1. Build image
+
+##### Clone kubernetes repo, and build kubemark binary file
+
+```
+git clone https://github.com/kubernetes/kubernetes.git
+```
+```
+cd kubernetes
+```
+```
+KUBE_BUILD_PLATFORMS=linux/amd64 make kubemark GOFLAGS=-v GOGCFLAGS="-N -l"
+```
+
+##### Copy kubemark binary file to the image folder and build kubemark docker image
+
+```
+cp _output/bin/kubemark cluster/images/kubemark
+```
+```
+IMAGE_TAG=v1.XX.X make build
+```
+After this step, you can get the kubemark image which can simulate cluster node. You can upload it to Docker-Hub or just deploy it locally.
+
+### 2. Install Kubermark
+
+##### Create kubemark namespace
+
+```
+kubectl create ns kubemark
+```
+
+##### Create configmap
+
+```
+kubectl create configmap node-configmap -n kubemark --from-literal=content.type="test-cluster"
+```
+
+##### Create secret
+
+```
+kubectl create secret generic kubeconfig --type=Opaque --namespace=kubemark --from-file=kubelet.kubeconfig={kubeconfig_file_path} --from-file=kubeproxy.kubeconfig={kubeconfig_file_path}
+```
+### 3. Label node
+
+We need to label all native nodes, otherwise the scheduler might allocate hollow pods to other simulated hollow nodes. We can leverage Node selector in yaml to allocate hollow pods to native nodes.
+
+```
+kubectl label node {node name} tag=tagName
+```
+
+### 4. Deploy Kubemark
+
+The hollow-node.yaml is down below, there are some parameters we can configure.
+
+```
+apiVersion: v1
+kind: ReplicationController
+metadata:
+  name: hollow-node
+  namespace: kubemark
+spec:
+  replicas: 2000  // the node number you want to simulate
+  selector:
+      name: hollow-node
+  template:
+    metadata:
+      labels:
+        name: hollow-node
+    spec:
+      nodeSelector:  // leverage label to allocate to native node
+        tag: tagName  
+      initContainers:
+      - name: init-inotify-limit
+        image: docker.io/busybox:latest
+        imagePullPolicy: IfNotPresent
+        command: ['sysctl', '-w', 'fs.inotify.max_user_instances=200'] // set as same as max_user_instance in actual node 
+        securityContext:
+          privileged: true
+      volumes:
+      - name: kubeconfig-volume
+        secret:
+          secretName: kubeconfig
+      - name: logs-volume
+        hostPath:
+          path: /var/log
+      containers:
+      - name: hollow-kubelet
+        image: 0yukali0/kubemark:1.20.10 // the kubemark image you build 
+        imagePullPolicy: IfNotPresent
+        ports:
+        - containerPort: 4194
+        - containerPort: 10250
+        - containerPort: 10255
+        env:
+        - name: NODE_NAME
+          valueFrom:
+            fieldRef:
+              fieldPath: metadata.name
+        command:
+        - /kubemark
+        args:
+        - --morph=kubelet
+        - --name=$(NODE_NAME)
+        - --kubeconfig=/kubeconfig/kubelet.kubeconfig
+        - --alsologtostderr
+        - --v=2
+        volumeMounts:
+        - name: kubeconfig-volume
+          mountPath: /kubeconfig
+          readOnly: true
+        - name: logs-volume
+          mountPath: /var/log
+        resources:
+          requests:    // the resource of hollow pod, can modify it.
+            cpu: 20m
+            memory: 50M
+        securityContext:
+          privileged: true
+      - name: hollow-proxy
+        image: 0yukali0/kubemark:1.20.10 // the kubemark image you build 
+        imagePullPolicy: IfNotPresent
+        env:
+        - name: NODE_NAME
+          valueFrom:
+            fieldRef:
+              fieldPath: metadata.name
+        command:
+        - /kubemark
+        args:
+        - --morph=proxy
+        - --name=$(NODE_NAME)
+        - --use-real-proxier=false
+        - --kubeconfig=/kubeconfig/kubeproxy.kubeconfig
+        - --alsologtostderr
+        - --v=2
+        volumeMounts:
+        - name: kubeconfig-volume
+          mountPath: /kubeconfig
+          readOnly: true
+        - name: logs-volume
+          mountPath: /var/log
+        resources:  // the resource of hollow pod, can modify it.
+          requests:
+            cpu: 20m
+            memory: 50M
+      tolerations:
+      - effect: NoExecute
+        key: node.kubernetes.io/unreachable
+        operator: Exists
+      - effect: NoExecute
+        key: node.kubernetes.io/not-ready
+        operator: Exists
+```
+
+once done editing, apply it to the cluster:
+
+```
+kubectl apply -f hollow-node.yaml
+```
+
+---
+
+## Deploy YuniKorn
+
+#### Install YuniKorn with helm
+
+We can install YuniKorn with Helm, please refer to this [doc](https://yunikorn.apache.org/docs/#install).
+We need to tune some parameters based on the default configuration. We recommend to clone the [release repo](https://github.com/apache/incubator-yunikorn-release) and modify the parameters in `value.yaml`.
+
+```
+git clone https://github.com/apache/incubator-yunikorn-release.git
+cd helm-charts/yunikorn
+```
+
+#### Configuration
+
+The modifications in the `value.yaml` are:
+
+- increased memory/cpu resources for the scheduler pod
+- disabled the admission controller
+- set the app sorting policy to FAIR
+
+please see the changes below:
+
+```
+resources:
+  requests:
+    cpu: 14
+    memory: 16Gi
+  limits:
+    cpu: 14
+    memory: 16Gi
+```
+```
+embedAdmissionController: false
+```
+```
+configuration: |
+  partitions:
+    -
+      name: default
+      queues:
+        - name: root
+          submitacl: '*'
+          queues:
+            -
+              name: sandbox
+              properties:
+                application.sort.policy: fair
+```
+
+#### Install YuniKorn with local release repo
+
+```
+Helm install yunikorn . --namespace yunikorn
+```
+
+---
+
+## Setup Prometheus
+
+YuniKorn exposes its scheduling metrics via Prometheus. Thus, we need to set up a Prometheus server to collect these metrics.
+
+### 1. Download Prometheus release
+
+```
+wget https://github.com/prometheus/prometheus/releases/download/v2.30.3/prometheus-2.30.3.linux-amd64.tar.gz
+```
+```
+tar xvfz prometheus-*.tar.gz
+cd prometheus-*
+```
+
+### 2. Configure prometheus.yml
+
+```
+global:
+  scrape_interval:     3s
+  evaluation_interval: 15s
+
+scrape_configs:
+  - job_name: 'yunikorn'
+    scrape_interval: 1s
+    metrics_path: '/ws/v1/metrics'
+    static_configs:
+    - targets: ['docker.for.mac.host.internal:9080'] 
+    // 9080 is internal port, need port forward or modify 9080 to service's port
+```
+
+### 3. Launch Prometheus
+```
+./prometheus --config.file=prometheus.yml
+```
+
+---
+## Run tests
+
+Once the environment is setup, you are good to run workloads and collect results. YuniKorn community has some useful tools to run workloads and collect metrics, more details will be published here.
+
+---
+
+## Collect and Observe YuniKorn metrics
+
+After Prometheus is launched, YuniKorn metrics can be easily collected. Here is the [docs](./metrics.md) of YuniKorn metrics. YuniKorn tracks some key scheduling metrics which measure the latency of some critical scheduling paths. These metrics include:
+
+ - **scheduling_latency_seconds:** Latency of the main scheduling routine, in seconds.
+ - **app_sorting_latency_seconds**: Latency of all applications sorting, in seconds.
+ - **node_sorting_latency_seconds**: Latency of all nodes sorting, in seconds.
+ - **queue_sorting_latency_seconds**: Latency of all queues sorting, in seconds.
+ - **container_allocation_attempt_total**: Total number of attempts to allocate containers. State of the attempt includes `allocated`, `rejected`, `error`, `released`. Increase only.
+
+you can select and generate graph on Prometheus UI easily, such as:
+
+![Prometheus Metrics List](./../assets/prometheus.png)
+
+
+---
+
+## Performance Tuning
+
+### Kubernetes
+
+The default K8s setup has limited concurrent requests which limits the overall throughput of the cluster. In this section, we introduced a few parameters that need to be tuned up in order to increase the overall throughput of the cluster.
+
+#### kubeadm
+
+Set pod-network mask
+
+```
+kubeadm init --pod-network-cidr=10.244.0.0/8
+```
+
+#### CNI
+
+Modify CNI mask and resources.
+
+```
+  net-conf.json: |
+    {
+      "Network": "10.244.0.0/8",
+      "Backend": {
+        "Type": "vxlan"
+      }
+    }
+```
+```
+  resources:
+    requests:
+      cpu: "100m"
+      memory: "200Mi"
+    limits:
+      cpu: "100m"
+      memory: "200Mi"
+```
+
+
+#### Api-Server
+
+In the Kubernetes API server, we need to modify two parameters: `max-mutating-requests-inflight` and `max-requests-inflight`. Those two parameters represent the API request bandwidth. Because we will generate a large amount of pod request, we need to increase those two parameters. Modify `/etc/kubernetes/manifest/kube-apiserver.yaml`:
+
+```
+--max-mutating-requests-inflight=3000
+--max-requests-inflight=3000
+```
+
+#### Controller-Manager
+
+In the Kubernetes controller manager, we need to increase the value of three parameters: `node-cidr-mask-size`, `kube-api-burst` and `kube-api-qps`. `kube-api-burst` and `kube-api-qps` control the server side request bandwidth. `node-cidr-mask-size` represents the node CIDR. it needs to be increased as well in order to scale up to thousands of nodes. 
+
+
+Modify `/etc/kubernetes/manifest/kube-controller-manager.yaml`:
+
+```
+--node-cidr-mask-size=21 //log2(max number of pods in cluster)
+--kube-api-burst=3000
+--kube-api-qps=3000
+```
+
+#### kubelet
+
+In single worker node, we can run 110 pods as default. But to get higher node resource utilization, we need to add some parameters in Kubelet launch command, and restart it.
+
+Modify start arg in `/etc/systemd/system/kubelet.service.d/10-kubeadm.conf`, add `--max-Pods=300` behind the start arg and restart
+
+```
+systemctl daemon-reload
+systemctl restart kubelet
+```
+
+---
+
+## Summary
+
+With Kubemark and Prometheus, we can easily run benchmark testing, collect YuniKorn metrics and analyze the performance. This helps us to identify the performance bottleneck in the scheduler and further eliminate them. The YuniKorn community will continue to improve these tools in the future, and continue to gain more performance improvements.
\ No newline at end of file
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/user_guide/gang_scheduling.md b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/user_guide/gang_scheduling.md
new file mode 100644
index 0000000..47b5722
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/user_guide/gang_scheduling.md
@@ -0,0 +1,285 @@
+---
+id: gang_scheduling
+title: Gang Scheduling
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## What is Gang Scheduling
+
+When Gang Scheduling is enabled, YuniKorn schedules the app only when
+the app’s minimal resource request can be satisfied. Otherwise, apps
+will be waiting in the queue. Apps are queued in hierarchy queues,
+with gang scheduling enabled, each resource queue is assigned with the
+maximum number of applications running concurrently with min resource guaranteed.
+
+![Gang Scheduling](./../assets/gang_scheduling_iintro.png)
+
+## Enable Gang Scheduling
+
+There is no cluster-wide configuration needed to enable Gang Scheduling.
+The scheduler actively monitors the metadata of each app, if the app has included
+a valid taskGroups definition, it will be considered as gang scheduling desired.
+
+:::info Task Group
+A task group is a “gang” of tasks in an app, these tasks are having the same resource profile
+and the same placement constraints. They are considered as homogeneous requests that can be
+treated as the same kind in the scheduler.
+:::
+
+### Prerequisite
+
+For the queues which runs gang scheduling enabled applications, the queue sorting policy needs to be set either
+`FIFO` or `StateAware`. To configure queue sorting policy, please refer to doc: [app sorting policies](user_guide/sorting_policies.md#Application_sorting).
+
+:::info Why FIFO based sorting policy?
+When Gang Scheduling is enabled, the scheduler proactively reserves resources
+for each application. If the queue sorting policy is not FIFO based (StateAware is FIFO based sorting policy),
+the scheduler might reserve partial resources for each app and causing resource segmentation issues.
+:::
+
+### App Configuration
+
+On Kubernetes, YuniKorn discovers apps by loading metadata from individual pod, the first pod of the app
+is required to enclosed with a full copy of app metadata. If the app doesn’t have any notion about the first or second pod,
+then all pods are required to carry the same taskGroups info. Gang scheduling requires taskGroups definition,
+which can be specified via pod annotations. The required fields are:
+
+| Annotation                                     | Value |
+|----------------------------------------------- |---------------------	|
+| yunikorn.apache.org/task-group-name 	         | Task group name, it must be unique within the application |
+| yunikorn.apache.org/task-groups                | A list of task groups, each item contains all the info defined for the certain task group |
+| yunikorn.apache.org/schedulingPolicyParameters | Optional. A arbitrary key value pairs to define scheduling policy parameters. Please read [schedulingPolicyParameters section](#scheduling-policy-parameters) |
+
+#### How many task groups needed?
+
+This depends on how many different types of pods this app requests from K8s. A task group is a “gang” of tasks in an app,
+these tasks are having the same resource profile and the same placement constraints. They are considered as homogeneous
+requests that can be treated as the same kind in the scheduler. Use Spark as an example, each job will need to have 2 task groups,
+one for the driver pod and the other one for the executor pods.
+
+#### How to define task groups?
+
+The task group definition is a copy of the app’s real pod definition, values for fields like resources, node-selector
+and toleration should be the same as the real pods. This is to ensure the scheduler can reserve resources with the
+exact correct pod specification.
+
+#### Scheduling Policy Parameters
+
+Scheduling policy related configurable parameters. Apply the parameters in the following format in pod's annotation:
+
+```yaml
+annotations:
+   yunikorn.apache.org/schedulingPolicyParameters: "PARAM1=VALUE1 PARAM2=VALUE2 ..."
+```
+
+Currently, the following parameters are supported:
+
+`placeholderTimeoutInSeconds`
+
+Default value: *15 minutes*.
+This parameter defines the reservation timeout for how long the scheduler should wait until giving up allocating all the placeholders.
+The timeout timer starts to tick when the scheduler *allocates the first placeholder pod*. This ensures if the scheduler
+could not schedule all the placeholder pods, it will eventually give up after a certain amount of time. So that the resources can be
+freed up and used by other apps. If non of the placeholders can be allocated, this timeout won't kick-in. To avoid the placeholder
+pods stuck forever, please refer to [troubleshooting](trouble_shooting.md#gang-scheduling) for solutions.
+
+` gangSchedulingStyle`
+
+Valid values: *Soft*, *Hard*
+
+Default value: *Soft*.
+This parameter defines the fallback mechanism if the app encounters gang issues due to placeholder pod allocation.
+See more details in [Gang Scheduling styles](#gang-scheduling-styles) section
+
+More scheduling parameters will added in order to provide more flexibility while scheduling apps.
+
+#### Example
+
+The following example is a yaml file for a job. This job launches 2 pods and each pod sleeps 30 seconds.
+The notable change in the pod spec is *spec.template.metadata.annotations*, where we defined `yunikorn.apache.org/task-group-name`
+and `yunikorn.apache.org/task-groups`.
+
+```yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: gang-scheduling-job-example
+spec:
+  completions: 2
+  parallelism: 2
+  template:
+    metadata:
+      labels:
+        app: sleep
+        applicationId: "gang-scheduling-job-example"
+        queue: root.sandbox
+      annotations:
+        yunikorn.apache.org/task-group-name: task-group-example
+        yunikorn.apache.org/task-groups: |-
+          [{
+              "name": "task-group-example",
+              "minMember": 2,
+              "minResource": {
+                "cpu": "100m",
+                "memory": "50M"
+              },
+              "nodeSelector": {},
+              "tolerations": []
+          }]
+    spec:
+      schedulerName: yunikorn
+      restartPolicy: Never
+      containers:
+        - name: sleep30
+          image: "alpine:latest"
+          command: ["sleep", "30"]
+          resources:
+            requests:
+              cpu: "100m"
+              memory: "50M"
+```
+
+When this job is submitted to Kubernetes, 2 pods will be created using the same template, and they all belong to one taskGroup:
+*“task-group-example”*. YuniKorn will create 2 placeholder pods, each uses the resources specified in the taskGroup definition.
+When all 2 placeholders are allocated, the scheduler will bind the the real 2 sleep pods using the spot reserved by the placeholders.
+
+You can add more than one taskGroups if necessary, each taskGroup is identified by the taskGroup name,
+it is required to map each real pod with a pre-defined taskGroup by setting the taskGroup name. Note,
+the task group name is only required to be unique within an application.
+
+### Enable Gang scheduling for Spark jobs
+
+Each Spark job runs 2 types of pods, driver and executor. Hence, we need to define 2 task groups for each job.
+The annotations for the driver pod looks like:
+
+```yaml
+Annotations:
+  yunikorn.apache.org/schedulingPolicyParameters: “placeholderTimeoutSeconds=30”
+  yunikorn.apache.org/taskGroupName: “spark-driver”
+  yunikorn.apache.org/taskGroup: “
+    TaskGroups: [
+     {
+       Name: “spark-driver”
+       minMember: 1,
+       minResource: {
+         Cpu: 1,
+         Memory: 2Gi
+       },
+       Node-selector: ...
+       Tolerations: ...
+     },
+      {
+        Name: “spark-executor”,
+        minMember: 10, 
+        minResource: {
+          Cpu: 1,
+          Memory: 2Gi
+        }
+      }
+  ]
+  ”
+```
+
+:::note
+Spark driver and executor pod has memory overhead, that needs to be considered in the taskGroup resources. 
+:::
+
+For all the executor pods,
+
+```yaml
+Annotations:
+  # the taskGroup name should match to the names
+  # defined in the taskGroups field
+  yunikorn.apache.org/taskGroupName: “spark-executor”
+```
+
+Once the job is submitted to the scheduler, the job won’t be scheduled immediately.
+Instead, the scheduler will ensure it gets its minimal resources before actually starting the driver/executors. 
+
+## Gang scheduling Styles
+
+There are 2 gang scheduling styles supported, Soft and Hard respectively. It can be configured per app-level to define how the app will behave in case the gang scheduling fails.
+
+- `Hard style`: when this style is used, we will have the initial behavior, more precisely if the application cannot be scheduled according to gang scheduling rules, and it times out, it will be marked as failed, without retrying to schedule it.
+- `Soft style`: when the app cannot be gang scheduled, it will fall back to the normal scheduling, and the non-gang scheduling strategy will be used to achieve the best-effort scheduling. When this happens, the app transits to the Resuming state and all the remaining placeholder pods will be cleaned up.
+
+**Default style used**: `Soft`
+
+**Enable a specific style**: the style can be changed by setting in the application definition the ‘gangSchedulingStyle’ parameter to Soft or Hard.
+
+#### Example
+
+```yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: gang-app-timeout
+spec:
+  completions: 4
+  parallelism: 4
+  template:
+    metadata:
+      labels:
+        app: sleep
+        applicationId: gang-app-timeout
+        queue: fifo
+      annotations:
+        yunikorn.apache.org/task-group-name: sched-style
+        yunikorn.apache.org/schedulingPolicyParameters: "placeholderTimeoutInSeconds=60 gangSchedulingStyle=Hard"
+        yunikorn.apache.org/task-groups: |-
+          [{
+              "name": "sched-style",
+              "minMember": 4,
+              "minResource": {
+                "cpu": "1",
+                "memory": "1000M"
+              },
+              "nodeSelector": {},
+              "tolerations": []
+          }]
+    spec:
+      schedulerName: yunikorn
+      restartPolicy: Never
+      containers:
+        - name: sleep30
+          image: "alpine:latest"
+          imagePullPolicy: "IfNotPresent"
+          command: ["sleep", "30"]
+          resources:
+            requests:
+              cpu: "1"
+              memory: "1000M"
+
+```
+
+## Verify Configuration
+
+To verify if the configuration has been done completely and correctly, check the following things:
+1. When an app is submitted, verify the expected number of placeholders are created by the scheduler.
+If you define 2 task groups, 1 with minMember 1 and the other with minMember 5, that means we are expecting 6 placeholder
+gets created once the job is submitted.
+2. Verify the placeholder spec is correct. Each placeholder needs to have the same info as the real pod in the same taskGroup.
+Check field including: namespace, pod resources, node-selector, and toleration.
+3. Verify the placeholders can be allocated on correct type of nodes, and verify the real pods are started by replacing the placeholder pods.
+
+## Troubleshooting
+
+Please see the troubleshooting doc when gang scheduling is enabled [here](trouble_shooting.md#gang-scheduling).
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/user_guide/trouble_shooting.md b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/user_guide/trouble_shooting.md
new file mode 100644
index 0000000..deada94
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/user_guide/trouble_shooting.md
@@ -0,0 +1,192 @@
+---
+id: trouble_shooting
+title: Trouble Shooting
+---
+
+<!--
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ -->
+ 
+## Scheduler logs
+
+### Retrieve scheduler logs
+
+Currently, the scheduler writes its logs to stdout/stderr, docker container handles the redirection of these logs to a
+local location on the underneath node, you can read more document [here](https://docs.docker.com/config/containers/logging/configure/).
+These logs can be retrieved by [kubectl logs](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#logs). Such as:
+
+```shell script
+// get the scheduler pod
+kubectl get pod -l component=yunikorn-scheduler -n yunikorn
+NAME                                  READY   STATUS    RESTARTS   AGE
+yunikorn-scheduler-766d7d6cdd-44b82   2/2     Running   0          33h
+
+// retrieve logs
+kubectl logs yunikorn-scheduler-766d7d6cdd-44b82 yunikorn-scheduler-k8s -n yunikorn
+```
+
+In most cases, this command cannot get all logs because the scheduler is rolling logs very fast. To retrieve more logs in
+the past, you will need to setup the [cluster level logging](https://kubernetes.io/docs/concepts/cluster-administration/logging/#cluster-level-logging-architectures).
+The recommended setup is to leverage [fluentd](https://www.fluentd.org/) to collect and persistent logs on an external storage, e.g s3. 
+
+### Set Logging Level
+
+:::note
+Changing the logging level requires a restart of the scheduler pod.
+:::
+
+Stop the scheduler:
+
+```shell script
+kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=0
+```
+edit the deployment config in vim:
+
+```shell script
+kubectl edit deployment yunikorn-scheduler -n yunikorn
+```
+
+add `LOG_LEVEL` to the `env` field of the container template. For example setting `LOG_LEVEL` to `0` sets the logging
+level to `INFO`.
+
+```yaml
+apiVersion: extensions/v1beta1
+kind: Deployment
+metadata:
+ ...
+spec:
+  template: 
+   ...
+    spec:
+      containers:
+      - env:
+        - name: LOG_LEVEL
+          value: '0'
+```
+
+Start the scheduler:
+
+```shell script
+kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=1
+```
+
+Available logging levels:
+
+| Value 	| Logging Level 	|
+|:-----:	|:-------------:	|
+|   -1  	|     DEBUG     	|
+|   0   	|      INFO     	|
+|   1   	|      WARN     	|
+|   2   	|     ERROR     	|
+|   3   	|     DPanic    	|
+|   4   	|     Panic     	|
+|   5   	|     Fatal     	|
+
+## Pods are stuck at Pending state
+
+If some pods are stuck at Pending state, that means the scheduler could not find a node to allocate the pod. There are
+several possibilities to cause this:
+
+### 1. Non of the nodes satisfy pod placement requirement
+
+A pod can be configured with some placement constraints, such as [node-selector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector),
+[affinity/anti-affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity),
+do not have certain toleration for node [taints](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/), etc.
+To debug such issues, you can describe the pod by:
+
+```shell script
+kubectl describe pod <pod-name> -n <namespace>
+```
+
+the pod events will contain the predicate failures and that explains why nodes are not qualified for allocation.
+
+### 2. The queue is running out of capacity
+
+If the queue is running out of capacity, pods will be pending for available queue resources. To check if a queue is still
+having enough capacity for the pending pods, there are several approaches:
+
+1) check the queue usage from yunikorn UI
+
+If you do not know how to access the UI, you can refer the document [here](../get_started/get_started.md#访问-web-ui). Go
+to the `Queues` page, navigate to the queue where this job is submitted to. You will be able to see the available capacity
+left for the queue.
+
+2) check the pod events
+
+Run the `kubectl describe pod` to get the pod events. If you see some event like:
+`Application <appID> does not fit into <queuePath> queue`. That means the pod could not get allocated because the queue
+is running out of capacity.
+
+The pod will be allocated if some other pods in this queue is completed or removed. If the pod remains pending even
+the queue has capacity, that may because it is waiting for the cluster to scale up.
+
+## Restart the scheduler
+
+YuniKorn can recover its state upon a restart. YuniKorn scheduler pod is deployed as a deployment, restart the scheduler
+can be done by scale down and up the replica:
+
+```shell script
+kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=0
+kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=1
+```
+
+## Gang Scheduling
+
+### 1. No placeholders created, app's pods are pending
+
+*Reason*: This is usually because the app is rejected by the scheduler, therefore non of the pods are scheduled.
+The common reasons caused the rejection are: 1) The taskGroups definition is invalid. The scheduler does the
+sanity check upon app submission, to ensure all the taskGroups are defined correctly, if these info are malformed,
+the scheduler rejects the app; 2) The total min resources defined in the taskGroups is bigger than the queues' max
+capacity, scheduler rejects the app because it won't fit into the queue's capacity. Check the pod event for relevant messages,
+and you will also be able to find more detail error messages from the schedulers' log.
+
+*Solution*: Correct the taskGroups definition and retry submitting the app. 
+
+### 2. Not all placeholders can be allocated
+
+*Reason*: The placeholders also consume resources, if not all of them can be allocated, that usually means either the queue
+or the cluster has no sufficient resources for them. In this case, the placeholders will be cleaned up after a certain
+amount of time, defined by the `placeholderTimeoutInSeconds` scheduling policy parameter.
+
+*Solution*: Note, if the placeholder timeout reaches, currently the app will transit to failed state and can not be scheduled
+anymore. You can increase the placeholder timeout value if you are willing to wait for a longer time. In the future, a fallback policy
+might be added to provide some retry other than failing the app.
+
+### 3. Not all placeholders are swapped
+
+*Reason*: This usually means the actual app's pods are less than the minMembers defined in the taskGroups.
+
+*Solution*: Check the `minMember` in the taskGroup field and ensure it is correctly set. The `minMember` can be less than
+the actual pods, setting it to bigger than the actual number of pods is invalid.
+
+### 4.Placeholders are not cleaned up when the app terminated
+
+*Reason*: All the placeholders are set an [ownerReference](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents)
+to the first real pod of the app, or the controller reference. If the placeholder could not be cleaned up, that means
+the garbage collection is not working properly. 
+
+*Solution*: check the placeholder `ownerReference` and the garbage collector in Kubernetes.    
+
+
+## Still got questions?
+
+No problem! The Apache YuniKorn community will be happy to help. You can reach out to the community with the following options:
+
+1. Post your questions to dev@yunikorn.apache.org
+2. Join the [YuniKorn slack channel](https://join.slack.com/t/yunikornworkspace/shared_invite/enQtNzAzMjY0OTI4MjYzLTBmMDdkYTAwNDMwNTE3NWVjZWE1OTczMWE4NDI2Yzg3MmEyZjUyYTZlMDE5M2U4ZjZhNmYyNGFmYjY4ZGYyMGE) and post your questions to the `#yunikorn-user` channel.
+3. Join the [community sync up meetings](http://yunikorn.apache.org/community/getInvolved#community-meetings) and directly talk to the community members. 
\ No newline at end of file
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/user_guide/workloads/run_spark.md b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/user_guide/workloads/run_spark.md
new file mode 100644
index 0000000..9f748eb
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/user_guide/workloads/run_spark.md
@@ -0,0 +1,149 @@
+---
+id: run_spark
+title: Run Spark Jobs
+description: How to run Spark jobs with YuniKorn
+keywords:
+ - spark
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+:::note
+This document assumes you have YuniKorn and its admission-controller both installed. Please refer to
+[get started](../../get_started/get_started.md) to see how that is done.
+:::
+
+## Prepare the docker image for Spark
+
+To run Spark on Kubernetes, you'll need the Spark docker images. You can 1) use the docker images provided by the YuniKorn
+team, or 2) build one from scratch. If you want to build your own Spark docker image, you can
+* Download a Spark version that has Kubernetes support, URL: https://github.com/apache/spark
+* Build spark with Kubernetes support:
+```shell script
+mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.4 -Phive -Pkubernetes -Phive-thriftserver -DskipTests package
+```
+
+## Create a namespace for Spark jobs
+
+Create a namespace:
+
+```shell script
+cat <<EOF | kubectl apply -f -
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: spark-test
+EOF
+```
+
+Create service account and cluster role bindings under `spark-test` namespace:
+
+```shell script
+cat <<EOF | kubectl apply -n spark-test -f -
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: spark
+  namespace: spark-test
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: spark-cluster-role
+  namespace: spark-test
+rules:
+- apiGroups: [""]
+  resources: ["pods"]
+  verbs: ["get", "watch", "list", "create", "delete"]
+- apiGroups: [""]
+  resources: ["configmaps"]
+  verbs: ["get", "create", "delete"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: spark-cluster-role-binding
+  namespace: spark-test
+subjects:
+- kind: ServiceAccount
+  name: spark
+  namespace: spark-test
+roleRef:
+  kind: ClusterRole
+  name: spark-cluster-role
+  apiGroup: rbac.authorization.k8s.io
+EOF
+```
+
+:::note
+Do NOT use `ClusterRole` and `ClusterRoleBinding` to run Spark jobs in production, please configure a more fine-grained
+security context for running Spark jobs. See more about how to configure proper RBAC rules [here](https://kubernetes.io/docs/reference/access-authn-authz/rbac/).
+:::
+
+## Submit a Spark job
+
+If this is running from local machine, you will need to start the proxy in order to talk to the api-server.
+```shell script
+kubectl proxy
+```
+
+Run a simple SparkPi job (this assumes that the Spark binaries are installed to `/usr/local` directory).
+```shell script
+export SPARK_HOME=/usr/local/spark-2.4.4-bin-hadoop2.7/
+${SPARK_HOME}/bin/spark-submit --master k8s://http://localhost:8001 --deploy-mode cluster --name spark-pi \
+   --master k8s://http://localhost:8001 --deploy-mode cluster --name spark-pi \
+   --class org.apache.spark.examples.SparkPi \
+   --conf spark.executor.instances=1 \
+   --conf spark.kubernetes.namespace=spark-test \
+   --conf spark.kubernetes.executor.request.cores=1 \
+   --conf spark.kubernetes.container.image=apache/yunikorn:spark-2.4.4 \
+   --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-test:spark \
+   local:///opt/spark/examples/jars/spark-examples_2.11-2.4.4.jar
+```
+
+You'll see Spark driver and executors been created on Kubernetes:
+
+![spark-pods](./../../assets/spark-pods.png)
+
+You can also view the job info from YuniKorn UI. If you do not know how to access the YuniKorn UI, please read the document
+[here](../../get_started/get_started.md#访问-web-ui).
+
+![spark-jobs-on-ui](./../../assets/spark-jobs-on-ui.png)
+
+## What happens behind the scenes?
+
+When the Spark job is submitted to the cluster, the job is submitted to `spark-test` namespace. The Spark driver pod will
+be firstly created under this namespace. Since this cluster has YuniKorn admission-controller enabled, when the driver pod
+get created, the admission-controller mutates the pod's spec and injects `schedulerName=yunikorn`, by doing this, the
+default K8s scheduler will skip this pod and it will be scheduled by YuniKorn instead. See how this is done by [configuring
+another scheduler in Kubernetes](https://kubernetes.io/docs/tasks/extend-kubernetes/configure-multiple-schedulers/).
+
+The default configuration has placement rule enabled, which automatically maps the `spark-test` namespace to a YuniKorn
+queue `root.spark-test`. All Spark jobs submitted to this namespace will be automatically submitted to the queue first.
+To see more about how placement rule works, please see doc [placement-rules](user_guide/placement_rules.md). By far,
+the namespace defines the security context of the pods, and the queue determines how the job and pods will be scheduled
+with considering of job ordering, queue resource fairness, etc. Note, this is the simplest setup, which doesn't enforce
+the queue capacities. The queue is considered as having unlimited capacity.
+
+YuniKorn reuses the Spark application ID set in label `spark-app-selector`, and this job is submitted
+to YuniKorn and being considered as a job. The job is scheduled and running as there is sufficient resources in the cluster.
+YuniKorn allocates the driver pod to a node, binds the pod and starts all the containers. Once the driver pod gets started,
+it requests for a bunch of executor pods to run its tasks. Those pods will be created in the same namespace as well and
+scheduled by YuniKorn as well.
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/user_guide/workloads/run_tensorflow.md b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/user_guide/workloads/run_tensorflow.md
new file mode 100644
index 0000000..3330aa4
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/version-0.12.1/user_guide/workloads/run_tensorflow.md
@@ -0,0 +1,93 @@
+---
+id: run_tf
+title: Run TensorFlow Jobs
+description: How to run TensorFlow jobs with YuniKorn
+keywords:
+ - tensorflow
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+This guide gives an overview of how to set up [training-operator](https://github.com/kubeflow/training-operator)
+and how to run a Tensorflow job with YuniKorn scheduler. The training-operator is a unified training operator maintained by
+Kubeflow. It not only supports TensorFlow but also PyTorch, XGboots, etc.
+
+## Install training-operator
+You can use the following command to install training operator in kubeflow namespace by default. If you have problems with installation,
+please refer to [this doc](https://github.com/kubeflow/training-operator#installation) for details.
+```
+kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.3.0"
+```
+
+## Prepare the docker image
+Before you start running a TensorFlow job on Kubernetes, you'll need to build the docker image.
+1. Download files from [deployment/examples/tfjob](https://github.com/apache/incubator-yunikorn-k8shim/tree/master/deployments/examples/tfjob)
+2. To build this docker image with the following command
+
+```
+docker build -f Dockerfile -t kubeflow/tf-dist-mnist-test:1.0 .
+```
+
+## Run a TensorFlow job
+Here is a TFJob yaml for MNIST [example](https://github.com/apache/incubator-yunikorn-k8shim/blob/master/deployments/examples/tfjob/tf-job-mnist.yaml).
+
+```yaml
+apiVersion: kubeflow.org/v1
+kind: TFJob
+metadata:
+  name: dist-mnist-for-e2e-test
+  namespace: kubeflow
+spec:
+  tfReplicaSpecs:
+    PS:
+      replicas: 2
+      restartPolicy: Never
+      template:
+        metadata:
+          labels:
+            applicationId: "tf_job_20200521_001"
+            queue: root.sandbox
+        spec:
+          schedulerName: yunikorn
+          containers:
+            - name: tensorflow
+              image: kubeflow/tf-dist-mnist-test:1.0
+    Worker:
+      replicas: 4
+      restartPolicy: Never
+      template:
+        metadata:
+          labels:
+            applicationId: "tf_job_20200521_001"
+            queue: root.sandbox
+        spec:
+          schedulerName: yunikorn
+          containers:
+            - name: tensorflow
+              image: kubeflow/tf-dist-mnist-test:1.0
+```
+Create the TFJob
+```
+kubectl create -f deployments/examples/tfjob/tf-job-mnist.yaml
+```
+You can view the job info from YuniKorn UI. If you do not know how to access the YuniKorn UI,
+please read the document [here](../../get_started/get_started.md#访问-web-ui).
+
+![tf-job-on-ui](../../assets/tf-job-on-ui.png)