You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by yu...@apache.org on 2022/12/30 08:26:48 UTC

[yunikorn-site] branch master updated: [YUNIKORN-1503] Chinese translation of troubleshooting (#244)

This is an automated email from the ASF dual-hosted git repository.

yuchaoran pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-site.git


The following commit(s) were added to refs/heads/master by this push:
     new 1cc016cbf [YUNIKORN-1503] Chinese translation of troubleshooting (#244)
1cc016cbf is described below

commit 1cc016cbfc8598829f31a9ac53e15760c10e8e2b
Author: wusamzong <48...@users.noreply.github.com>
AuthorDate: Fri Dec 30 16:26:43 2022 +0800

    [YUNIKORN-1503] Chinese translation of troubleshooting (#244)
---
 .../current/user_guide/troubleshooting.md          | 183 ++++++++++-----------
 1 file changed, 90 insertions(+), 93 deletions(-)

diff --git a/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/troubleshooting.md b/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/troubleshooting.md
index 6bd02835e..4b47d6b0a 100644
--- a/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/troubleshooting.md
+++ b/i18n/zh-cn/docusaurus-plugin-content-docs/current/user_guide/troubleshooting.md
@@ -1,6 +1,6 @@
 ---
 id: troubleshooting
-title: Troubleshooting
+title: 故障排除
 ---
 
 <!--
@@ -20,73 +20,67 @@ title: Troubleshooting
  * See the License for the specific language governing permissions and
  * limitations under the License.
  -->
- 
-## Scheduler logs
 
-### Retrieve scheduler logs
+## 调度日志(Scheduler logs)
 
-Currently, the scheduler writes its logs to stdout/stderr, docker container handles the redirection of these logs to a
-local location on the underneath node, you can read more document [here](https://docs.docker.com/config/containers/logging/configure/).
-These logs can be retrieved by [kubectl logs](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#logs). Such as:
+### 检索调度日志
+
+调度器会将日志写入stdout/stderr,docker容器就会将这些日志重新导向到节点的本地位置,你可以从[这里](https://docs.docker.com/config/containers/logging/configure/)读到更多的文档,这些日志可以通过[kuberctl logs](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#logs)检索。如:
 
 ```shell script
-// get the scheduler pod
+//获取调度的pod
 kubectl get pod -l component=yunikorn-scheduler -n yunikorn
-NAME                                  READY   STATUS    RESTARTS   AGE
-yunikorn-scheduler-766d7d6cdd-44b82   2/2     Running   0          33h
+NAME READY STATUS RESTARTS AGE
+yunikorn-scheduler-766d7d6cdd-44b82 2/2 Running 0 33h
 
-// retrieve logs
+//检索日志
 kubectl logs yunikorn-scheduler-766d7d6cdd-44b82 yunikorn-scheduler-k8s -n yunikorn
 ```
 
-In most cases, this command cannot get all logs because the scheduler is rolling logs very fast. To retrieve more logs in
-the past, you will need to setup the [cluster level logging](https://kubernetes.io/docs/concepts/cluster-administration/logging/#cluster-level-logging-architectures).
-The recommended setup is to leverage [fluentd](https://www.fluentd.org/) to collect and persistent logs on an external storage, e.g s3. 
+在大多数的情况下,这个命令没有办法获取所有的日志,因为调度程序的日志数量庞大,您需要设置[集群级别的日志收集](https://kubernetes.io/docs/concepts/cluster-administration/logging/#cluster-level-logging-architectures)。推荐的设置方式是利用[fluentd](https://www.fluentd.org/)在外部储存(例如s3)上持久的收集日志。
 
-### Set Logging Level
+### 设定日志级别
 
+###
 :::note
-Changing the logging level requires a restart of the scheduler pod.
+我们建议通过REST API来调整日志级别,如此以来我们不需要每次修改级别时重新启动调动程序的pod。但是透过编辑部署配置来设定日志级别时,需要重新启用调度程序的pod,因此强烈不建议这么做。
 :::
 
-Stop the scheduler:
-
+停止调度器:
 ```shell script
 kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=0
 ```
-edit the deployment config in vim:
 
+使用vim编辑部署配置:
 ```shell script
 kubectl edit deployment yunikorn-scheduler -n yunikorn
 ```
 
-add `LOG_LEVEL` to the `env` field of the container template. For example setting `LOG_LEVEL` to `0` sets the logging
-level to `INFO`.
+在容器模板的`env`字段中加入`LOG_LEVEL`。例如将`LOG_LEVEL`设置为0会将日志纪录的级别设置为`INFO`。
 
 ```yaml
 apiVersion: extensions/v1beta1
 kind: Deployment
 metadata:
- ...
+…
+spec:
+template:
+…
 spec:
-  template: 
-   ...
-    spec:
-      containers:
-      - env:
-        - name: LOG_LEVEL
-          value: '0'
+containers:
+- env:
+- name: LOG_LEVEL
+value: '0'
 ```
 
-Start the scheduler:
-
+启用调度器:
 ```shell script
 kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=1
 ```
 
-Available logging levels:
+可使用的日志级别:
 
-| Value 	| Logging Level 	|
+|  值   	| 日志级别        	|
 |:-----:	|:-------------:	|
 |   -1  	|     DEBUG     	|
 |   0   	|      INFO     	|
@@ -96,97 +90,100 @@ Available logging levels:
 |   4   	|     Panic     	|
 |   5   	|     Fatal     	|
 
-## Pods are stuck at Pending state
+## Pods卡在`Pending`状态
 
-If some pods are stuck at Pending state, that means the scheduler could not find a node to allocate the pod. There are
-several possibilities to cause this:
+如果Pod卡在Pending状态,则意味着调度程序找不到分配Pod的节点。造成这种情况有以下几种可能:
 
-### 1. Non of the nodes satisfy pod placement requirement
+### 1.没有节点满足pod的放置要求
 
-A pod can be configured with some placement constraints, such as [node-selector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector),
-[affinity/anti-affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity),
-do not have certain toleration for node [taints](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/), etc.
-To debug such issues, you can describe the pod by:
+可以在Pod中配置一些放置限制,例如[节点选择器(node-selector)](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector)、[亲合/反亲合性(affinity/anti-affinity)](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity)、对节点的[污点(taints)](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/)没有一定的容忍度等。若要修正此类问题,你可以通过以下方法观察pod:
 
 ```shell script
-kubectl describe pod <pod-name> -n <namespace>
+kubectl describe pod <pod名称> -n <命名空间>
 ```
 
-the pod events will contain the predicate failures and that explains why nodes are not qualified for allocation.
+pod事件中包含预测失败,而这解释了为什么节点不符合分配条件
 
-### 2. The queue is running out of capacity
+### 2.队列的可用资源不足
 
-If the queue is running out of capacity, pods will be pending for available queue resources. To check if a queue is still
-having enough capacity for the pending pods, there are several approaches:
+如果队列的可用资源不足,Pod将等待队列资源。检查队列是否还有空间可以给Pending pod的方法有以下几种:
 
-1) check the queue usage from yunikorn UI
+1 ) 从Yunikorn UI检查队列使用情况
 
-If you do not know how to access the UI, you can refer the document [here](../get_started/get_started.md#访问-web-ui). Go
-to the `Queues` page, navigate to the queue where this job is submitted to. You will be able to see the available capacity
-left for the queue.
+如果你不知道如何访问UI,可以参考[这里](../get_started/get_started.md#访问-web-ui)的文档。在`Queues`页面中,寻找Pod对应到的队列。你将能够看到队列中剩馀的可用资源。
 
-2) check the pod events
+2 ) 检查pod事件
 
-Run the `kubectl describe pod` to get the pod events. If you see some event like:
-`Application <appID> does not fit into <queuePath> queue`. That means the pod could not get allocated because the queue
-is running out of capacity.
+运行`kubectl describe pod`以获取pod事件。如果你看到类似以下的事件:`Application <appID> does not fit into <队列路径> queue`。则代表pod无法分配,因为队列的资源用完了。
 
-The pod will be allocated if some other pods in this queue is completed or removed. If the pod remains pending even
-the queue has capacity, that may because it is waiting for the cluster to scale up.
+当队列中的其他Pod完成工作或被删除时,代表目前Pending的pod能得到分配,如果Pod依旧在有足够的剩馀资源下,保持pending状态,则可能是因为他正在等待集群扩展。
 
-## Restart the scheduler
+## 获取完整的状态
 
-YuniKorn can recover its state upon a restart. YuniKorn scheduler pod is deployed as a deployment, restart the scheduler
-can be done by scale down and up the replica:
+Yunikorn状态储存中,包含了对每个进程中每个对象的状态。透过端点来检索,我们可以有很多有用的信息,举一个故障排除的例子:分区列表、应用程序列表(包括正在运行的、已完成的以及历史应用程序的详细信息)、节点数量、节点利用率、通用集群信息、集群利用率的详细信息、容器历史纪录和队列信息。
 
-```shell script
-kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=0
-kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=1
-```
+状态是Yunikorn提供的用于故障排除的宝贵资源。
+
+有几种方法可以获得完整的状态:
+### 1.调度器URL
+
+步骤:
+*在浏览器中打开Yunikorn UI,且在URL中编辑:
+*将`/#/dashboard`取代为`/ws/v1/fullstatedump`,(例如,`http://localhost:9889/ws/v1/fullstatedump`)
+*按下回车键。
 
-## Gang Scheduling
+透过这个简单的方法来观看即时且完整的状态。
 
-### 1. No placeholders created, app's pods are pending
+### 2.调度器的REST API
 
-*Reason*: This is usually because the app is rejected by the scheduler, therefore non of the pods are scheduled.
-The common reasons caused the rejection are: 1) The taskGroups definition is invalid. The scheduler does the
-sanity check upon app submission, to ensure all the taskGroups are defined correctly, if these info are malformed,
-the scheduler rejects the app; 2) The total min resources defined in the taskGroups is bigger than the queues' max
-capacity, scheduler rejects the app because it won't fit into the queue's capacity. Check the pod event for relevant messages,
-and you will also be able to find more detail error messages from the schedulers' log.
+使用以下的调度器REST API,能够让我们看到Yunikorn的完整状态。
 
-*Solution*: Correct the taskGroups definition and retry submitting the app. 
+`curl -X 'GET'http://localhost:9889/ws/v1/fullstatedump-H 'accept: application/json'`
+
+有关储存状态的更多信息,可以参阅[检索完整状态](api/scheduler.md#retrieve-full-state-dump)的文档。
+
+## 重启调度器
+:::note
+最好的故障排除方法是─把「重启调度器」当作完全没有其他方法之下的最后一步,他不应该在搜集所有日志和状态之前使用。
+:::
+
+Yunikorn可以在重启之后恢复其状态。Yunikorn调度器的pod作为deployment部署。我们可以透过`scale`来增加和减少副本数量来重启Yunikorn调度器,方法如下:
+
+```shell script
+kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=0
+kubectl scale deployment yunikorn-scheduler -n yunikorn --replicas=1
+```
 
-### 2. Not all placeholders can be allocated
+## 成组调度
+### 1.没有占位符被创建,且app处于pending状态
+*原因*:这通常是因为应用程序被调度气拒绝,因此没有一个pod被调度。
 
-*Reason*: The placeholders also consume resources, if not all of them can be allocated, that usually means either the queue
-or the cluster has no sufficient resources for them. In this case, the placeholders will be cleaned up after a certain
-amount of time, defined by the `placeholderTimeoutInSeconds` scheduling policy parameter.
+导致拒绝的常见原因有:
 
-*Solution*: Note, if the placeholder timeout reaches, currently the app will transit to failed state and can not be scheduled
-anymore. You can increase the placeholder timeout value if you are willing to wait for a longer time. In the future, a fallback policy
-might be added to provide some retry other than failing the app.
+1)任务群组(taskGroups)定义无效。调度程序在应用程序被提交时会进行健全性检查,以确保正确定义所有任务群组,如果这些信息的格式不正确,调度程序将拒绝该应用程序
 
-### 3. Not all placeholders are swapped
+2)任务群组中定义的总最小资源量大于队列的最大资源量,因此调度程序拒绝该应用程序,因为队列的资源量无法满足它。可以通过检查pod事件中的相关消息,或调度器日志来找到更多详细的错误信息。
 
-*Reason*: This usually means the actual app's pods are less than the minMembers defined in the taskGroups.
+*解决方案*:更正任务群组的定义并重新提交应用程序。
 
-*Solution*: Check the `minMember` in the taskGroup field and ensure it is correctly set. The `minMember` can be less than
-the actual pods, setting it to bigger than the actual number of pods is invalid.
+### 2.有些占位符没有被分配
+*原因*:占位符也会消耗资源,如果不能全部分配,通常是队列或者集群没有足够的资源分配给它们。在这种情况下,占位符将在一定时间后被清理,该时间由调度策略参数─`placeholderTimeoutInSeconds`所定义。
 
-### 4.Placeholders are not cleaned up when the app terminated
+*解决方案*:如果占位符超时了,目前的app将会转为failed状态,无法再被调度。如果您愿意等待更长的时间,可以增加占位符超时的值。将来可能会添加倒退策略以提供重试,而不是使应用程序失败。
 
-*Reason*: All the placeholders are set an [ownerReference](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents)
-to the first real pod of the app, or the controller reference. If the placeholder could not be cleaned up, that means
-the garbage collection is not working properly. 
+### 3.有些占位符没有被交换
+*原因*:这通常代表应用程序的pod少于任务群组中定义的最小成员数(`minMembers`)
 
-*Solution*: check the placeholder `ownerReference` and the garbage collector in Kubernetes.    
+*解决方案*:检查任务群组字段中的`minMember`并确保其设置正确。`minMember`可以小于实际的pod数,设置大于实际pod数是无效的。
 
+### 4.应用程序终止时不会清除占位符
+*原因*:所有占位符都会设置[ownerReference](https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/#owners-and-dependents)到应用程序的第一个真实的pod,或控制器参考。如果无法清理占位符,则意味着垃圾回收(garbage collector)的机制不正常。
 
-## Still got questions?
+*解决方案*:检查占位符的`ownerReference`和Kubernetes中的垃圾收集器。
 
-No problem! The Apache YuniKorn community will be happy to help. You can reach out to the community with the following options:
+## 仍然遇到问题?
+没问题!Apache Yunikorn社区将很乐意提供帮助。您可以通过以下选项联系社区:
 
-1. Post your questions to dev@yunikorn.apache.org
-2. Join the [YuniKorn slack channel](https://join.slack.com/t/yunikornworkspace/shared_invite/enQtNzAzMjY0OTI4MjYzLTBmMDdkYTAwNDMwNTE3NWVjZWE1OTczMWE4NDI2Yzg3MmEyZjUyYTZlMDE5M2U4ZjZhNmYyNGFmYjY4ZGYyMGE) and post your questions to the `#yunikorn-user` channel.
-3. Join the [community sync up meetings](http://yunikorn.apache.org/community/getInvolved#community-meetings) and directly talk to the community members. 
\ No newline at end of file
+1. 将您的问题发布到dev@yunikorn.apache.org。
+2. 加入[YuniKorn slack](https://join.slack.com/t/yunikornworkspace/shared_invite/enQtNzAzMjY0OTI4MjYzLTBmMDdkYTAwNDMwNTE3NWVjZWE1OTczMWE4NDI2Yzg3MmEyZjUyYTZlMDE5M2U4ZjZhNmYyNGFmYjY4ZGYyMGE)并将您的问题发布到`#yunikorn-user`频道。
+3. 加入[社区会议](http://yunikorn.apache.org/community/get_involved#community-meetings)并直接与社区成员交谈。
\ No newline at end of file