You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by yu...@apache.org on 2022/08/21 17:13:12 UTC

[yunikorn-site] branch master updated: [YUNIKORN-1034] Add Chinese translation for performance documents (#174)

This is an automated email from the ASF dual-hosted git repository.

yuchaoran pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-site.git


The following commit(s) were added to refs/heads/master by this push:
     new bd57ca76d [YUNIKORN-1034] Add Chinese translation for performance documents (#174)
bd57ca76d is described below

commit bd57ca76dd22633cac46e1191cc505f93c7f6142
Author: YuTeng Chen <45...@users.noreply.github.com>
AuthorDate: Mon Aug 22 01:13:08 2022 +0800

    [YUNIKORN-1034] Add Chinese translation for performance documents (#174)
    
    * Translation
    
    * [YUNIKORN-1034] Add Chinese translation for performance documents
---
 .../evaluate_perf_function_with_kubemark.md        | 120 ++++++++++++++++
 .../current/performance/metrics.md                 |  96 ++++++-------
 .../current/performance/performance_tutorial.md    | 159 ++++++++++-----------
 .../current/performance/profiling.md               | 115 +++++++++++++++
 4 files changed, 360 insertions(+), 130 deletions(-)

diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/evaluate_perf_function_with_kubemark.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/evaluate_perf_function_with_kubemark.md
new file mode 100644
index 000000000..44f4c67eb
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/evaluate_perf_function_with_kubemark.md
@@ -0,0 +1,120 @@
+---
+id: evaluate_perf_function_with_kubemark
+title: 使用 Kubemark 评估 YuniKorn 的性能
+keywords:
+ - 性能
+ - 吞吐量
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+YuniKorn 社区关注调度程序的性能,并继续在发布时对其进行优化。 社区已经开发了一些工具来反复测试和调整性能。
+
+## 环境设置
+
+我们利用[Kubemark](https://github.com/kubernetes/kubernetes/blob/release-1.3/docs/devel/kubemark-guide.md#starting-a-kubemark-cluster)评估调度器的性能。 Kubemark是一个模拟大规模集群的测试工具。 它创建空节点,运行空kubelet以假装原始kubelet行为。 这些空节点上的调度pod不会真正执行。它能够创建一个满足我们实验要求的大集群,揭示yunikorn调度器的性能。 请参阅有关如何设置环境的[详细步骤](performance/performance_tutorial.md)。
+
+## 调度程序吞吐量
+
+我们在模拟的大规模环境中设计了一些简单的基准测试场景,以评估调度器的性能。 我们的工具测量[吞吐量](https://en.wikipedia.org/wiki/Throughput)并使用这些关键指标来评估性能。 简而言之,调度程序吞吐量是处理pod从在集群上发现它们到将它们分配给节点的速率。
+
+在本实验中,我们使用 [Kubemark](https://github.com/kubernetes/kubernetes/blob/release-1.3/docs/devel/kubemark-guide.md#starting-a-kubemark-cluster) 设置了一个模拟的2000/4000节点集群。然后我们启动10个[部署](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/),每个部署分别设置5000个副本。 这模拟了大规模工作负载同时提交到K8s集群。 我们的工具会定期监控和检查pod状态,随着时间的推移,根据 `podSpec.StartTime` 计算启动的pod数量。 作为对比,我们将相同的实验应用到相同环境中的默认调度程序。 我们看到了YuniKorn相对于默认调度程序的性能优势,如下图所示:
+
+![Scheduler Throughput](./../assets/yunirkonVSdefault.png)
+<p align="center">图 1. Yunikorn 和默认调度器吞吐量 </p>
+
+图表记录了集群上所有 Pod 运行所花费的时间:
+
+|  节点数  | yunikorn        | k8s 默认调度器		| 差异   |
+|------------------	|:--------------:	|:---------------------: |:-----:  |
+| 2000(节点)       | 204(pods/秒)			| 49(pods/秒)			        |   416%  |
+| 4000(节点)       | 115(pods/秒)			| 48(pods/秒)			        |   240%  |
+
+为了使结果标准化,我们已经运行了几轮测试。 如上所示,与默认调度程序相比,YuniKorn实现了`2 倍`~`4 倍`的性能提升。
+
+:::note
+
+与其他性能测试一样,结果因底层硬件而异,例如服务器CPU/内存、网络带宽、I/O速度等。为了获得适用于您的环境的准确结果,我们鼓励您运行这些测试在靠近生产环境的集群上。
+
+:::
+
+## 性能分析
+
+我们从实验中得到的结果是有希望的。 我们通过观察更多的YuniKorn内部指标进一步深入分析性能,我们能够找到影响性能的几个关键区域。
+
+### K8s 限制
+
+我们发现整体性能实际上受到了K8s主服务的限制,例如api-server、controller-manager和etcd,在我们所有的实验中都没有达到YuniKorn的限制。 如果您查看内部调度指标,您可以看到:
+
+![Allocation latency](./../assets/allocation_4k.png)
+<p align="center">图 2. 4k 节点中的 Yunikorn 指标 </p>
+
+图2是Prometheus的截图,它记录了YuniKorn中的[内部指标](performance/metrics.md) `containerAllocation`。 它们是调度程序分配的 pod 数量,但不一定绑定到节点。 完成调度50k pod大约需要122秒,即410 pod/秒。 实际吞吐量下降到115个 Pod/秒,额外的时间用于绑定不同节点上的Pod。 如果K8s方面能赶上来,我们会看到更好的结果。 实际上,当我们在大规模集群上调整性能时,我们要做的第一件事就是调整API-server、控制器管理器中的一些参数,以提高吞吐量。 在[性能教程文档](performance/performance_tutorial.md)中查看更多信息。
+
+### 节点排序
+
+当集群大小增加时,我们看到YuniKorn的性能明显下降。 这是因为在YuniKorn中,我们对集群节点进行了完整排序,以便为给定的pod找到 **“best-fit”** 节点。 这种策略使Pod分布更加优化,基于所使用的 [节点排序策略](./../user_guide/sorting_policies#node-sorting)。 但是,对节点进行排序很昂贵,在调度周期中这样做会产生很多开销。 为了克服这个问题,我们在 [YUNIKORN-807](https://issues.apache.org/jira/browse/YUNIKORN-807) 中改进了我们的节点排序机制,其背后的想法是使用 [B-Tree ](https://en.wikipedia.org/wiki/B-tree)来存储所有节点并在必要时应用增量更新。 这显着改善了延迟,根据我们的基准测试,这在500、1000、2000 和 5000个节点的集群上分别提高了 35 倍、42 倍、51 倍、74 倍。
+
+### 每个节点的前提条件检查
+
+在每个调度周期中,另一个耗时的部分是节点的“前提条件检查”。 在这个阶段,YuniKorn评估所有K8s标准断言(Predicates),例如节点选择器、pod亲和性/反亲和性等,以确定pod是否适合节点。 这些评估成本很高。
+
+我们做了两个实验来比较启用和禁用断言评估的情况。 请参阅以下结果:
+
+![Allocation latency](./../assets/predicateComaparation.png)
+<p align="center">图 3. Yunikorn 中的断言效果比较 </p>
+
+当断言评估被禁用时,吞吐量会提高很多。 我们进一步研究了整个调度周期的延迟分布和断言评估延迟。 并发现:
+
+![YK predicate latency](./../assets/predicate_4k.png)
+<p align="center">图 4. 断言延迟 </p>
+
+![YK scheduling with predicate](./../assets/scheduling_with_predicate_4k_.png)
+<p align="center">图 5. 启用断言的调度时间 </p>
+
+![YK scheduling with no predicate](./../assets/scheduling_no_predicate_4k.png)
+<p align="center">图 6. 不启用断言的调度时间 </p>
+
+总体而言,YuniKorn 调度周期运行得非常快,每个周期的延迟下降在 **0.001s - 0.01s** 范围内。 并且大部分时间用于断言评估,10倍于调度周期中的其他部分。
+
+|				| 调度延迟分布(秒)	| 断言-评估延迟分布(秒)	|
+|-----------------------	|:---------------------:		|:---------------------:			|
+| 启用断言		| 0.01 - 0.1				| 0.01-0.1					|
+| 不启用断言	| 0.001 - 0.01				| 无						|
+
+## 为什么 YuniKorn 更快?
+
+默认调度器被创建为面向服务的调度器; 与YuniKorn相比,它在吞吐量方面的敏感性较低。 YuniKorn社区非常努力地保持出色的性能并不断改进。 YuniKorn可以比默认调度器运行得更快的原因是:
+
+* 短调度周期
+
+YuniKorn 保持调度周期短而高效。 YuniKorn 使用所有异步通信协议来确保所有关键路径都是非阻塞调用。 大多数地方只是在进行内存计算,这可能非常高效。 默认调度器利用 [调度框架](https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/),它为扩展调度器提供了很大的灵活性,但是,权衡是性能。 调度周期变成了一条很长的链,因为它需要访问所有这些插件。
+
+* 异步事件处理
+
+YuniKorn利用异步事件处理框架来处理内部状态。 这使得核心调度周期可以快速运行而不会被任何昂贵的调用阻塞。 例如,默认调度程序需要将状态更新、事件写入pod对象,这是在调度周期内完成的。 这涉及将数据持久化到etcd,这可能很慢。 YuniKorn将所有此类事件缓存在一个队列中,并以异步方式写回pod。
+
+* 更快的节点排序
+
+[YUNIKORN-807](https://issues.apache.org/jira/browse/YUNIKORN-807)之后,YuniKorn进行了高效的增量节点排序。 这是建立在所谓的基于“资源权重”的节点评分机制之上的,它也可以通过插件进行扩展。 所有这些一起减少了计算节点分数时的开销。 相比之下,默认调度器提供了一些计算节点分数的扩展点,例如`PreScore`、`Score`和`NormalizeScore`。 这些计算量很大,并且在每个调度周期中都会调用它们。 请参阅[代码行](https://github.com/kubernetes/kubernetes/blob/481459d12dc82ab88e413886e2130c2a5e4a8ec4/pkg/scheduler/framework/runtime/framework.go#L857)中的详细信息。
+
+## 概括
+
+在测试过程中,我们发现YuniKorn的性能非常好,尤其是与默认调度程序相比。 我们已经确定了YuniKorn中可以继续提高性能的主要因素,并解释了为什么YuniKorn的性能优于默认调度程序。 我们还意识到将Kubernetes扩展到数千个节点时的局限性,可以通过使用其他技术(例如联合)来缓解这些局限性。 因此,YuniKorn是一个高效、高吞吐量的调度程序,非常适合在Kubernetes上运行批处理/混合工作负载。
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/metrics.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/metrics.md
index 9dedbec73..7d6fa73e5 100644
--- a/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/metrics.md
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/metrics.md
@@ -1,8 +1,8 @@
 ---
 id: metrics
-title: Scheduler Metrics
+title: 调度程序指标
 keywords:
- - metrics
+ - 指标
 ---
 
 <!--
@@ -24,64 +24,62 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-YuniKorn leverages [Prometheus](https://prometheus.io/) to record metrics. The metrics system keeps tracking of
-scheduler's critical execution paths, to reveal potential performance bottlenecks. Currently, there are three categories
-for these metrics:
+YuniKorn利用[Prometheus](https://prometheus.io/) 记录指标。 度量系统不断跟踪调度程序的关键执行路径,以揭示潜在的性能瓶颈。 目前,这些指标分为三类:
 
-- scheduler: generic metrics of the scheduler, such as allocation latency, num of apps etc.
-- queue: each queue has its own metrics sub-system, tracking queue status.
-- event: record various changes of events in YuniKorn.
+- 调度器:调度器的通用指标,例如分配延迟、应用程序数量等。
+- 队列:每个队列都有自己的指标子系统,跟踪队列状态。
+- 事件:记录YuniKorn中事件的各种变化。
 
-all metrics are declared in `yunikorn` namespace.
-###    Scheduler Metrics
+所有指标都在`yunikorn`命名空间中声明。
+###    调度程序指标
 
-| Metrics Name          | Metrics Type  | Description  | 
+| 指标名称               | 指标类型        | 描述         | 
 | --------------------- | ------------  | ------------ |
-| containerAllocation   | Counter       | Total number of attempts to allocate containers. State of the attempt includes `allocated`, `rejected`, `error`, `released`. Increase only.  |
-| applicationSubmission | Counter       | Total number of application submissions. State of the attempt includes `accepted` and `rejected`. Increase only. |
-| applicationStatus     | Gauge         | Total number of application status. State of the application includes `running` and `completed`.  | 
-| totalNodeActive       | Gauge         | Total number of active nodes.                          |
-| totalNodeFailed       | Gauge         | Total number of failed nodes.                          |
-| nodeResourceUsage     | Gauge         | Total resource usage of node, by resource name.        |
-| schedulingLatency     | Histogram     | Latency of the main scheduling routine, in seconds.    |
-| nodeSortingLatency    | Histogram     | Latency of all nodes sorting, in seconds.              |
-| appSortingLatency     | Histogram     | Latency of all applications sorting, in seconds.       |
-| queueSortingLatency   | Histogram     | Latency of all queues sorting, in seconds.             |
-| tryNodeLatency        | Histogram     | Latency of node condition checks for container allocations, such as placement constraints, in seconds, in seconds. |
-
-###    Queue Metrics
-
-| Metrics Name              | Metrics Type  | Description |
+| containerAllocation   | Counter       | 尝试分配容器的总次数。 尝试状态包括`allocated`, `rejected`, `error`, `released`。 该指标只会增加。  |
+| applicationSubmission | Counter       | 提交申请的总数。 尝试的状态包括 `accepted`和`rejected`。 该指标只会增加。 |
+| applicationStatus     | Gauge         | 申请状态总数。 应用程序的状态包括`running`和`completed`。  | 
+| totalNodeActive       | Gauge         | 活动节点总数。                          |
+| totalNodeFailed       | Gauge         | 失败节点的总数。                          |
+| nodeResourceUsage     | Gauge         | 节点的总资源使用情况,按资源名称。        |
+| schedulingLatency     | Histogram     | 主调度例程的延迟,以秒为单位。    |
+| nodeSortingLatency    | Histogram     | 所有节点排序的延迟,以秒为单位。              |
+| appSortingLatency     | Histogram     | 所有应用程序排序的延迟,以秒为单位。      |
+| queueSortingLatency   | Histogram     | 所有队列排序的延迟,以秒为单位。             |
+| tryNodeLatency        | Histogram     | 节点条件检查容器分配的延迟,例如放置约束,以秒为单位。 |
+
+###    队列指标
+
+| 指标名称                   | 指标类型        | 描述        |
 | ------------------------- | ------------- | ----------- |
-| appMetrics                | Counter       | Application Metrics, record the total number of applications. State of the application includes `accepted`,`rejected` and `Completed`.     |
-| usedResourceMetrics       | Gauge         | Queue used resource.     |
-| pendingResourceMetrics    | Gauge         | Queue pending resource.  |
-| availableResourceMetrics  | Gauge         | Used resource metrics related to queues etc.    |
+| appMetrics                | Counter       | 应用程序指标,记录申请总数。 应用程序的状态包括`accepted`、`rejected`和`Completed`。    |
+| usedResourceMetrics       | Gauge         | 排队使用的资源。     |
+| pendingResourceMetrics    | Gauge         | 排队等待的资源。  |
+| availableResourceMetrics  | Gauge         | 与队列等相关的已用资源指标。    |
 
-###    Event Metrics
+###    事件指标
 
-| Metrics Name             | Metrics Type  | Description |
+| 指标名称                   | 指标类型        | 描述        |
 | ------------------------ | ------------  | ----------- |
-| totalEventsCreated       | Gauge         | Total events created.          |
-| totalEventsChanneled     | Gauge         | Total events channeled.        |
-| totalEventsNotChanneled  | Gauge         | Total events not channeled.    |
-| totalEventsProcessed     | Gauge         | Total events processed.        |
-| totalEventsStored        | Gauge         | Total events stored.           |
-| totalEventsNotStored     | Gauge         | Total events not stored.       |
-| totalEventsCollected     | Gauge         | Total events collected.        |
+| totalEventsCreated       | Gauge         | 创建的事件总数。          |
+| totalEventsChanneled     | Gauge         | 引导的事件总数。        |
+| totalEventsNotChanneled  | Gauge         | 引导的事件总数。    |
+| totalEventsProcessed     | Gauge         | 处理的事件总数。        |
+| totalEventsStored        | Gauge         | 存储的事件总数。           |
+| totalEventsNotStored     | Gauge         | 未存储的事件总数。       |
+| totalEventsCollected     | Gauge         | 收集的事件总数。        |
 
-## Access Metrics
+## 访问指标
 
-YuniKorn metrics are collected through Prometheus client library, and exposed via scheduler restful service.
-Once started, they can be accessed via endpoint http://localhost:9080/ws/v1/metrics.
+YuniKorn指标通过Prometheus客户端库收集,并通过调度程序RESTful服务公开。
+启动后,可以通过端点http://localhost:9080/ws/v1/metrics访问它们。
 
-## Aggregate Metrics to Prometheus
+## Prometheus 的聚合指标
 
-It's simple to setup a Prometheus server to grab YuniKorn metrics periodically. Follow these steps:
+设置 Prometheus 服务器以定期获取 YuniKorn 指标很简单。 按着这些次序:
 
-- Setup Prometheus (read more from [Prometheus docs](https://prometheus.io/docs/prometheus/latest/installation/))
+- 设置Prometheus(从[Prometheus 文档](https://prometheus.io/docs/prometheus/latest/installation/)了解更多信息)
 
-- Configure Prometheus rules: a sample configuration 
+- 配置Prometheus规则:示例配置
 
 ```yaml
 global:
@@ -96,14 +94,12 @@ scrape_configs:
     - targets: ['docker.for.mac.host.internal:9080']
 ```
 
-- start Prometheus
+- 启动 Prometheus
 
 ```shell script
 docker pull prom/prometheus:latest
 docker run -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus
 ```
 
-Use `docker.for.mac.host.internal` instead of `localhost` if you are running Prometheus in a local docker container
-on Mac OS. Once started, open Prometheus web UI: http://localhost:9090/graph. You'll see all available metrics from
-YuniKorn scheduler.
+如果您在Mac OS上的本地docker容器中运行Prometheus,请使用`docker.for.mac.host.internal`而不是`localhost`。 启动后,打开Prometheus网页界面:http://localhost:9090/graph。 您将看到来自YuniKorn调度程序的所有可用指标。
 
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/performance_tutorial.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/performance_tutorial.md
index 16293b68c..32e4df7d0 100644
--- a/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/performance_tutorial.md
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/performance_tutorial.md
@@ -1,9 +1,9 @@
 ---
 id: performance_tutorial
-title: Benchmarking Tutorial
+title: 基准测试教程
 keywords:
- - performance
- - tutorial
+ - 性能
+ - 教程
 ---
 
 <!--
@@ -25,17 +25,17 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-## Overview
+## 概述
 
-The YuniKorn community continues to optimize the performance of the scheduler, ensuring that YuniKorn satisfies the performance requirements of large-scale batch workloads. Thus, the community has built some useful tools for performance benchmarking that can be reused across releases. This document introduces all these tools and steps to run them.
+YuniKorn社区不断优化调度器的性能,确保YuniKorn满足大规模批处理工作负载的性能要求。 因此,社区为性能基准测试构建了一些有用的工具,可以跨版本重用。 本文档介绍了所有这些工具和运行它们的步骤。
 
-## Hardware
+## 硬件
 
-Be aware that performance result is highly variable depending on the underlying  hardware. All results published in the doc can only be used as references. We encourage each individual to run similar tests on their own environments in order to get a result based on your own hardware. This doc is just for demonstration purpose.
+请注意,性能结果因底层硬件而异。 文档中发布的所有结果只能作为参考。 我们鼓励每个人在自己的环境中运行类似的测试,以便根据您自己的硬件获得结果。 本文档仅用于演示目的。
 
-A list of servers being used in this test are (Huge thanks to [National Taichung University of Education](http://www.ntcu.edu.tw/newweb/index.htm), [Kuan-Chou Lai](http://www.ntcu.edu.tw/kclai/) for providing these servers for running tests):
+本次测试中使用的服务器列表是(非常感谢[国立台中教育大学](http://www.ntcu.edu.tw/newweb/index.htm), [Kuan-Chou Lai](http://www.ntcu.edu.tw/kclai/) 为运行测试提供这些服务器):
 
-| Manchine Type         | CPU | Memory | Download/upload(Mbps) |
+| 机型                   | CPU |  内存  |   下载/上传(Mbps)       |
 | --------------------- | --- | ------ | --------------------- |
 | HP                    | 16  | 36G    | 525.74/509.86         |
 | HP                    | 16  | 30G    | 564.84/461.82         |
@@ -57,15 +57,15 @@ A list of servers being used in this test are (Huge thanks to [National Taichung
 | WS E500 G5_WS690T     | 8   | 32G    | 91/89.41              |
 | WS E900 G4_SW980T     | 80  | 512G   | 89.24/87.97           |
 
-The following steps are needed for each server, otherwise the large scale testing may fail due to the limited number of users/processes/open-files.
+每个服务器都需要执行以下步骤,否则由于用户/进程/打开文件的数量有限,大规模测试可能会失败。
 
-### 1. Set /etc/sysctl.conf
+### 1. 设置/etc/sysctl.conf
 ```
 kernel.pid_max=400000
 fs.inotify.max_user_instances=50000
 fs.inotify.max_user_watches=52094
 ```
-### 2. Set /etc/security/limits.conf
+### 2. 设置/etc/security/limits.conf
 
 ```
 * soft nproc 4000000
@@ -79,25 +79,25 @@ root hard nofile 50000
 ```
 ---
 
-## Deploy workflow
+## 部署工作流
 
-Before going into the details, here are the general steps used in our tests:
+在进入细节之前,这里是我们测试中使用的一般步骤:
 
-- [Step 1](#Kubernetes): Properly configure Kubernetes API server and controller manager, then add worker nodes.
-- [Step 2](#Setup-Kubemark): Deploy hollow pods,which will simulate worker nodes, name hollow nodes. After all hollow nodes in ready status, we need to cordon all native nodes, which are physical presence in the cluster, not the simulated nodes, to avoid we allocated test workload pod to native nodes.
-- [Step 3](#Deploy-YuniKorn): Deploy YuniKorn using the Helm chart on the master node, and scale down the Deployment to 0 replica, and [modify the port](#Setup-Prometheus) in `prometheus.yml` to match the port of the service.
-- [Step 4](#Run-tests): Deploy 50k Nginx pods for testing, and the API server will create them. But since the YuniKorn scheduler Deployment has been scaled down to 0 replica, all Nginx pods will be stuck in pending.
-- [Step 5](../user_guide/trouble_shooting.md#restart-the-scheduler): Scale up The YuniKorn Deployment back to 1 replica, and cordon the master node to avoid YuniKorn allocating Nginx pods there. In this step, YuniKorn will start collecting the metrics.
-- [Step 6](#Collect-and-Observe-YuniKorn-metrics): Observe the metrics exposed in Prometheus UI.
+- [步骤 1](#Kubernetes): 正确配置Kubernetes API服务器和控制器管理器,然后添加工作节点。
+- [步骤 2](#Setup-Kubemark): 部署空pod,将模拟工作节点,命名空节点。 在所有空节点都处于就绪状态后,我们需要封锁(cordon)所有本地节点,这些本地节点是集群中的物理存在,而不是模拟节点,以避免我们将测试工作负载 pod 分配给本地节点。
+- [步骤 3](#Deploy-YuniKorn): 在主节点上使用Helm chart部署YuniKorn,并将 Deployment 缩减为 0 副本,并在`prometheus.yml`中 [修改端口](#Setup-Prometheus) 以匹配服务的端口。
+- [步骤 4](#Run-tests): 部署50k Nginx pod进行测试,API服务器将创建它们。 但是由于YuniKorn调度程序Deployment已经被缩减到0个副本,所有的Nginx pod都将停留在等待状态。
+- [步骤 5](../user_guide/trouble_shooting.md#restart-the-scheduler): 将YuniKorn部署扩展回1个副本,并封锁主节点以避免YuniKorn 在那里分配Nginx pod。 在这一步中,YuniKorn将开始收集指标。
+- [步骤 6](#Collect-and-Observe-YuniKorn-metrics): 观察Prometheus UI中公开的指标。
 ---
 
-## Setup Kubemark
+## 设置 Kubemark
 
-[Kubemark](https://github.com/kubernetes/kubernetes/tree/master/test/kubemark) is a performance testing tool which allows users to run experiments on simulated clusters. The primary use case is the scalability testing. The basic idea is to run tens or hundreds of fake kubelet nodes on one physical node in order to simulate large scale clusters. In our tests, we leverage Kubemark to simulate up to a 4K-node cluster on less than 20 physical nodes.
+[Kubemark](https://github.com/kubernetes/kubernetes/tree/master/test/kubemark)是一个性能测试工具,允许用户在模拟集群上运行实验。 主要用例是可扩展性测试。 基本思想是在一个物理节点上运行数十或数百个假kubelet节点,以模拟大规模集群。 在我们的测试中,我们利用 Kubemark 在少于20个物理节点上模拟多达4K节点的集群。
 
-### 1. Build image
+### 1. 构建镜像
 
-##### Clone kubernetes repo, and build kubemark binary file
+##### 克隆kubernetes仓库,并构建kubemark二进制文件
 
 ```
 git clone https://github.com/kubernetes/kubernetes.git
@@ -109,7 +109,7 @@ cd kubernetes
 KUBE_BUILD_PLATFORMS=linux/amd64 make kubemark GOFLAGS=-v GOGCFLAGS="-N -l"
 ```
 
-##### Copy kubemark binary file to the image folder and build kubemark docker image
+##### 将kubemark二进制文件复制到镜像文件夹并构建kubemark docker镜像
 
 ```
 cp _output/bin/kubemark cluster/images/kubemark
@@ -117,38 +117,38 @@ cp _output/bin/kubemark cluster/images/kubemark
 ```
 IMAGE_TAG=v1.XX.X make build
 ```
-After this step, you can get the kubemark image which can simulate cluster node. You can upload it to Docker-Hub or just deploy it locally.
+完成此步骤后,您可以获得可以模拟集群节点的kubemark镜像。 您可以将其上传到Docker-Hub或仅在本地部署。
 
-### 2. Install Kubermark
+### 2. 安装Kubermark
 
-##### Create kubemark namespace
+##### 创建kubemark命名空间
 
 ```
 kubectl create ns kubemark
 ```
 
-##### Create configmap
+##### 创建configmap
 
 ```
 kubectl create configmap node-configmap -n kubemark --from-literal=content.type="test-cluster"
 ```
 
-##### Create secret
+##### 创建secret
 
 ```
 kubectl create secret generic kubeconfig --type=Opaque --namespace=kubemark --from-file=kubelet.kubeconfig={kubeconfig_file_path} --from-file=kubeproxy.kubeconfig={kubeconfig_file_path}
 ```
-### 3. Label node
+### 3. 标签节点
 
-We need to label all native nodes, otherwise the scheduler might allocate hollow pods to other simulated hollow nodes. We can leverage Node selector in yaml to allocate hollow pods to native nodes.
+我们需要给所有的原生节点打上标签,否则调度器可能会将空pod分配给其他模拟的空节点。 我们可以利用yaml中的节点选择器将空pod分配给本地节点。
 
 ```
 kubectl label node {node name} tag=tagName
 ```
 
-### 4. Deploy Kubemark
+### 4. 部署Kubemark
 
-The hollow-node.yaml is down below, there are some parameters we can configure.
+hollow-node.yaml如下所示,我们可以配置一些参数。
 
 ```
 apiVersion: v1
@@ -157,7 +157,7 @@ metadata:
   name: hollow-node
   namespace: kubemark
 spec:
-  replicas: 2000  # the node number you want to simulate
+  replicas: 2000  # 要模拟的节点数
   selector:
       name: hollow-node
   template:
@@ -165,13 +165,13 @@ spec:
       labels:
         name: hollow-node
     spec:
-      nodeSelector:  # leverage label to allocate to native node
+      nodeSelector:  # 利用标签分配给本地节点
         tag: tagName  
       initContainers:
       - name: init-inotify-limit
         image: docker.io/busybox:latest
         imagePullPolicy: IfNotPresent
-        command: ['sysctl', '-w', 'fs.inotify.max_user_instances=200'] # set as same as max_user_instance in actual node 
+        command: ['sysctl', '-w', 'fs.inotify.max_user_instances=200'] # 设置为与实际节点中的max_user_instance相同
         securityContext:
           privileged: true
       volumes:
@@ -183,7 +183,7 @@ spec:
           path: /var/log
       containers:
       - name: hollow-kubelet
-        image: 0yukali0/kubemark:1.20.10 # the kubemark image you build 
+        image: 0yukali0/kubemark:1.20.10 # 您构建的kubemark映像 
         imagePullPolicy: IfNotPresent
         ports:
         - containerPort: 4194
@@ -209,13 +209,13 @@ spec:
         - name: logs-volume
           mountPath: /var/log
         resources:
-          requests:    # the resource of hollow pod, can modify it.
+          requests:    # 空pod的资源,可以修改。
             cpu: 20m
             memory: 50M
         securityContext:
           privileged: true
       - name: hollow-proxy
-        image: 0yukali0/kubemark:1.20.10 # the kubemark image you build 
+        image: 0yukali0/kubemark:1.20.10 # 您构建的kubemark映像 
         imagePullPolicy: IfNotPresent
         env:
         - name: NODE_NAME
@@ -237,7 +237,7 @@ spec:
           readOnly: true
         - name: logs-volume
           mountPath: /var/log
-        resources:  # the resource of hollow pod, can modify it.
+        resources:  # 空pod的资源,可以修改。
           requests:
             cpu: 20m
             memory: 50M
@@ -250,7 +250,7 @@ spec:
         operator: Exists
 ```
 
-once done editing, apply it to the cluster:
+完成编辑后,将其应用于集群:
 
 ```
 kubectl apply -f hollow-node.yaml
@@ -258,27 +258,26 @@ kubectl apply -f hollow-node.yaml
 
 ---
 
-## Deploy YuniKorn
+## 部署 YuniKorn
 
-#### Install YuniKorn with helm
+#### 使用helm安装YuniKorn
 
-We can install YuniKorn with Helm, please refer to this [doc](https://yunikorn.apache.org/docs/#install).
-We need to tune some parameters based on the default configuration. We recommend to clone the [release repo](https://github.com/apache/yunikorn-release) and modify the parameters in `value.yaml`.
+我们可以用 Helm 安装 YuniKorn,请参考这个[文档](https://yunikorn.apache.org/docs/#install)。 我们需要根据默认配置调整一些参数。 我们建议克隆[发布仓库](https://github.com/apache/yunikorn-release)并修改`value.yaml`中的参数。
 
 ```
 git clone https://github.com/apache/yunikorn-release.git
 cd helm-charts/yunikorn
 ```
 
-#### Configuration
+#### 配置
 
-The modifications in the `value.yaml` are:
+`value.yaml`中的修改是:
 
-- increased memory/cpu resources for the scheduler pod
-- disabled the admission controller
-- set the app sorting policy to FAIR
+- 增加调度程序 pod 的内存/cpu 资源
+- 禁用 admission controller
+- 将应用排序策略设置为 FAIR
 
-please see the changes below:
+请参阅以下更改:
 
 ```
 resources:
@@ -307,7 +306,7 @@ configuration: |
                 application.sort.policy: fair
 ```
 
-#### Install YuniKorn with local release repo
+#### 使用本地版本库安装YuniKorn
 
 ```
 Helm install yunikorn . --namespace yunikorn
@@ -315,11 +314,11 @@ Helm install yunikorn . --namespace yunikorn
 
 ---
 
-## Setup Prometheus
+## 设置Prometheus
 
-YuniKorn exposes its scheduling metrics via Prometheus. Thus, we need to set up a Prometheus server to collect these metrics.
+YuniKorn通过Prometheus公开其调度指标。 因此,我们需要设置一个Prometheus服务器来收集这些指标。
 
-### 1. Download Prometheus release
+### 1. 下载Prometheus版本
 
 ```
 wget https://github.com/prometheus/prometheus/releases/download/v2.30.3/prometheus-2.30.3.linux-amd64.tar.gz
@@ -329,7 +328,7 @@ tar xvfz prometheus-*.tar.gz
 cd prometheus-*
 ```
 
-### 2. Configure prometheus.yml
+### 2. 配置prometheus.yml
 
 ```
 global:
@@ -342,47 +341,47 @@ scrape_configs:
     metrics_path: '/ws/v1/metrics'
     static_configs:
     - targets: ['docker.for.mac.host.internal:9080'] 
-    # 9080 is internal port, need port forward or modify 9080 to service's port
+    # 9080为内部端口,需要端口转发或修改9080为服务端口
 ```
 
-### 3. Launch Prometheus
+### 3. 启动Prometheus
 ```
 ./prometheus --config.file=prometheus.yml
 ```
 
 ---
-## Run tests
+## 运行测试
 
-Once the environment is setup, you are good to run workloads and collect results. YuniKorn community has some useful tools to run workloads and collect metrics, more details will be published here.
+设置环境后,您就可以运行工作负载并收集结果了。 YuniKorn社区有一些有用的工具来运行工作负载和收集指标,更多详细信息将在此处发布。
 
 ---
 
-## Collect and Observe YuniKorn metrics
+## 收集和观察YuniKorn指标
 
-After Prometheus is launched, YuniKorn metrics can be easily collected. Here is the [docs](metrics.md) of YuniKorn metrics. YuniKorn tracks some key scheduling metrics which measure the latency of some critical scheduling paths. These metrics include:
+Prometheus 启动后,可以轻松收集 YuniKorn 指标。 这是 YuniKorn 指标的[文档](metrics.md)。 YuniKorn 跟踪一些关键调度指标,这些指标衡量一些关键调度路径的延迟。 这些指标包括:
 
- - **scheduling_latency_seconds:** Latency of the main scheduling routine, in seconds.
- - **app_sorting_latency_seconds**: Latency of all applications sorting, in seconds.
- - **node_sorting_latency_seconds**: Latency of all nodes sorting, in seconds.
- - **queue_sorting_latency_seconds**: Latency of all queues sorting, in seconds.
- - **container_allocation_attempt_total**: Total number of attempts to allocate containers. State of the attempt includes `allocated`, `rejected`, `error`, `released`. Increase only.
+ - **scheduling_latency_seconds:** 主调度例程的延迟,以秒为单位。
+ - **app_sorting_latency_seconds**: 所有应用程序排序的延迟,以秒为单位。
+ - **node_sorting_latency_seconds**: 所有节点排序的延迟,以秒为单位。
+ - **queue_sorting_latency_seconds**: 所有队列排序的延迟,以秒为单位。
+ - **container_allocation_attempt_total**: 尝试分配容器的总次数。 尝试状态包括 `allocated`、`rejected`、`error`、`released`。 该指标仅增加。
 
-you can select and generate graph on Prometheus UI easily, such as:
+您可以在Prometheus UI上轻松选择和生成图形,例如:
 
 ![Prometheus Metrics List](./../assets/prometheus.png)
 
 
 ---
 
-## Performance Tuning
+## 性能调优
 
 ### Kubernetes
 
-The default K8s setup has limited concurrent requests which limits the overall throughput of the cluster. In this section, we introduced a few parameters that need to be tuned up in order to increase the overall throughput of the cluster.
+默认的 K8s 设置限制了并发请求,这限制了集群的整体吞吐量。 在本节中,我们介绍了一些需要调整的参数,以提高集群的整体吞吐量。
 
 #### kubeadm
 
-Set pod-network mask
+设置pod网络掩码
 
 ```
 kubeadm init --pod-network-cidr=10.244.0.0/8
@@ -390,7 +389,7 @@ kubeadm init --pod-network-cidr=10.244.0.0/8
 
 #### CNI
 
-Modify CNI mask and resources.
+修改CNI掩码和资源。
 
 ```
   net-conf.json: |
@@ -414,7 +413,7 @@ Modify CNI mask and resources.
 
 #### Api-Server
 
-In the Kubernetes API server, we need to modify two parameters: `max-mutating-requests-inflight` and `max-requests-inflight`. Those two parameters represent the API request bandwidth. Because we will generate a large amount of pod request, we need to increase those two parameters. Modify `/etc/kubernetes/manifest/kube-apiserver.yaml`:
+在 Kubernetes API 服务器中,我们需要修改两个参数:`max-mutating-requests-inflight`和`max-requests-inflight`。 这两个参数代表API请求带宽。 因为我们会产生大量的Pod请求,所以我们需要增加这两个参数。修改`/etc/kubernetes/manifest/kube-apiserver.yaml`:
 
 ```
 --max-mutating-requests-inflight=3000
@@ -423,22 +422,22 @@ In the Kubernetes API server, we need to modify two parameters: `max-mutating-re
 
 #### Controller-Manager
 
-In the Kubernetes controller manager, we need to increase the value of three parameters: `node-cidr-mask-size`, `kube-api-burst` and `kube-api-qps`. `kube-api-burst` and `kube-api-qps` control the server side request bandwidth. `node-cidr-mask-size` represents the node CIDR. it needs to be increased as well in order to scale up to thousands of nodes. 
+在Kubernetes控制器管理器中,我们需要增加三个参数的值:`node-cidr-mask-size`、`kube-api-burst` `kube-api-qps`. `kube-api-burst`和`kube-api-qps`控制服务器端请求带宽。`node-cidr-mask-size`表示节点 CIDR。 为了扩展到数千个节点,它也需要增加。
 
 
 Modify `/etc/kubernetes/manifest/kube-controller-manager.yaml`:
 
 ```
---node-cidr-mask-size=21 //log2(max number of pods in cluster)
+--node-cidr-mask-size=21 //log2(集群中的最大pod数)
 --kube-api-burst=3000
 --kube-api-qps=3000
 ```
 
 #### kubelet
 
-In single worker node, we can run 110 pods as default. But to get higher node resource utilization, we need to add some parameters in Kubelet launch command, and restart it.
+在单个工作节点中,我们可以默认运行110个pod。 但是为了获得更高的节点资源利用率,我们需要在Kubelet启动命令中添加一些参数,然后重启它。
 
-Modify start arg in `/etc/systemd/system/kubelet.service.d/10-kubeadm.conf`, add `--max-Pods=300` behind the start arg and restart
+修改`/etc/systemd/system/kubelet.service.d/10-kubeadm.conf`中的起始参数,在起始参数后面添加`--max-Pods=300`并重启。
 
 ```
 systemctl daemon-reload
@@ -447,6 +446,6 @@ systemctl restart kubelet
 
 ---
 
-## Summary
+## 概括
 
-With Kubemark and Prometheus, we can easily run benchmark testing, collect YuniKorn metrics and analyze the performance. This helps us to identify the performance bottleneck in the scheduler and further eliminate them. The YuniKorn community will continue to improve these tools in the future, and continue to gain more performance improvements.
+借助Kubemark和Prometheus,我们可以轻松运行基准测试、收集YuniKorn指标并分析性能。 这有助于我们识别调度程序中的性能瓶颈并进一步消除它们。 YuniKorn社区未来将继续改进这些工具,并继续获得更多的性能改进。
diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/profiling.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/profiling.md
new file mode 100644
index 000000000..eb2ae7442
--- /dev/null
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/performance/profiling.md
@@ -0,0 +1,115 @@
+---
+id: profiling
+title: 分析
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+使用[pprof](https://github.com/google/pprof)做CPU,Memory profiling可以帮助你了解YuniKorn调度器的运行状态。YuniKorn REST服务中添加了分析工具,我们可以轻松地从HTTP端点检索和分析它们。
+
+## CPU 分析
+
+在这一步,确保你已经运行了YuniKorn,它可以通过`make run`命令从本地运行,也可以部署为在K8s内运行的pod。 然后运行
+
+```
+go tool pprof http://localhost:9080/debug/pprof/profile
+```
+
+配置文件数据将保存在本地文件系统中,一旦完成,它就会进入交互模式。 现在您可以运行分析命令,例如
+
+```
+(pprof) top
+Showing nodes accounting for 14380ms, 44.85% of 32060ms total
+Dropped 145 nodes (cum <= 160.30ms)
+Showing top 10 nodes out of 106
+      flat  flat%   sum%        cum   cum%
+    2130ms  6.64%  6.64%     2130ms  6.64%  __tsan_read
+    1950ms  6.08% 12.73%     1950ms  6.08%  __tsan::MetaMap::FreeRange
+    1920ms  5.99% 18.71%     1920ms  5.99%  __tsan::MetaMap::GetAndLock
+    1900ms  5.93% 24.64%     1900ms  5.93%  racecall
+    1290ms  4.02% 28.67%     1290ms  4.02%  __tsan_write
+    1090ms  3.40% 32.06%     3270ms 10.20%  runtime.mallocgc
+    1080ms  3.37% 35.43%     1080ms  3.37%  __tsan_func_enter
+    1020ms  3.18% 38.62%     1120ms  3.49%  runtime.scanobject
+    1010ms  3.15% 41.77%     1010ms  3.15%  runtime.nanotime
+     990ms  3.09% 44.85%      990ms  3.09%  __tsan::DenseSlabAlloc::Refill
+```
+
+您可以键入诸如`web`或`gif`之类的命令来获得可以更好地帮助您的图表
+了解关键代码路径的整体性能。 你可以得到一些东西
+如下所示:
+
+![CPU Profiling](./../assets/cpu_profile.jpg)
+
+注意,要使用这些选项,您需要先安装虚拟化工具`graphviz`,如果您使用的是 Mac,只需运行`brew install graphviz`,更多信息请参考[这里](https://graphviz. gitlab.io/)。
+
+## 内存分析
+
+同样,您可以运行
+
+```
+go tool pprof http://localhost:9080/debug/pprof/heap
+```
+
+这将返回当前堆的快照,允许我们检查内存使用情况。 进入交互模式后,您可以运行一些有用的命令。 比如top可以列出top内存消耗的对象。
+```
+(pprof) top
+Showing nodes accounting for 83.58MB, 98.82% of 84.58MB total
+Showing top 10 nodes out of 86
+      flat  flat%   sum%        cum   cum%
+      32MB 37.84% 37.84%       32MB 37.84%  github.com/apache/yunikorn-core/pkg/cache.NewClusterInfo
+      16MB 18.92% 56.75%       16MB 18.92%  github.com/apache/yunikorn-core/pkg/rmproxy.NewRMProxy
+      16MB 18.92% 75.67%       16MB 18.92%  github.com/apache/yunikorn-core/pkg/scheduler.NewScheduler
+      16MB 18.92% 94.59%       16MB 18.92%  github.com/apache/yunikorn-k8shim/pkg/dispatcher.init.0.func1
+    1.04MB  1.23% 95.81%     1.04MB  1.23%  k8s.io/apimachinery/pkg/runtime.(*Scheme).AddKnownTypeWithName
+    0.52MB  0.61% 96.43%     0.52MB  0.61%  github.com/gogo/protobuf/proto.RegisterType
+    0.51MB  0.61% 97.04%     0.51MB  0.61%  sync.(*Map).Store
+    0.50MB   0.6% 97.63%     0.50MB   0.6%  regexp.onePassCopy
+    0.50MB  0.59% 98.23%     0.50MB  0.59%  github.com/json-iterator/go.(*Iterator).ReadString
+    0.50MB  0.59% 98.82%     0.50MB  0.59%  text/template/parse.(*Tree).newText
+```
+
+您还可以运行 `web`、`pdf` 或 `gif` 命令来获取堆图形。
+
+## 下载分析样本并在本地进行分析
+
+我们在调度程序docker映像中包含了基本的go/go-tool二进制文件,您应该能够进行一些基本的分析
+docker容器内的分析。 但是,如果您想深入研究一些问题,最好进行分析
+本地。 然后您需要先将示例文件复制到本地环境。 复制文件的命令如下:
+
+```
+kubectl cp ${SCHEDULER_POD_NAME}:${SAMPLE_PATH_IN_DOCKER_CONTAINER} ${LOCAL_COPY_PATH}
+```
+
+例如
+
+```
+kubectl cp yunikorn-scheduler-cf8f8dd8-6szh5:/root/pprof/pprof.k8s_yunikorn_scheduler.samples.cpu.001.pb.gz /Users/wyang/Downloads/pprof.k8s_yunikorn_scheduler.samples.cpu.001.pb.gz
+```
+
+在本地环境中获取文件后,您可以运行“pprof”命令进行分析。
+
+```
+go tool pprof /Users/wyang/Downloads/pprof.k8s_yunikorn_scheduler.samples.cpu.001.pb.gz
+```
+
+## 资源
+
+* pprof 文档 https://github.com/google/pprof/tree/master/doc。