You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by wi...@apache.org on 2022/12/19 05:06:43 UTC
[yunikorn-site] branch master updated: [YUNIKORN-1418] Troubleshooting page update (#231)

This is an automated email from the ASF dual-hosted git repository.

wilfreds pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-site.git


The following commit(s) were added to refs/heads/master by this push:
     new 07ed65931 [YUNIKORN-1418] Troubleshooting page update (#231)
07ed65931 is described below

commit 07ed6593134b8856c0c4f7315a774e49772ff5ba
Author: Jagadeesan A S <ja...@gmail.com>
AuthorDate: Mon Dec 19 16:03:11 2022 +1100

    [YUNIKORN-1418] Troubleshooting page update (#231)
    
    Add full state dump to troubleshooting page. Use the REST API to change
    log level to remove the need for a restart of the scheduler.
    Rename "Trouble Shooting" to Troubleshooting
    Multiple pages URL were broken for website
    
    Fixed link
    
    Closes: #231
    
    Signed-off-by: Wilfred Spiegelenburg <wi...@apache.org>
---
 docs/performance/performance_tutorial.md           |  2 +-
 docs/user_guide/gang_scheduling.md                 |  4 +--
 .../{trouble_shooting.md => troubleshooting.md}    | 37 +++++++++++++++++++---
 sidebars.js                                        |  2 +-
 4 files changed, 37 insertions(+), 8 deletions(-)

diff --git a/docs/performance/performance_tutorial.md b/docs/performance/performance_tutorial.md
index b17923838..51c607faa 100644
--- a/docs/performance/performance_tutorial.md
+++ b/docs/performance/performance_tutorial.md
@@ -87,7 +87,7 @@ Before going into the details, here are the general steps used in our tests:
 - [Step 2](#Setup-Kubemark): Deploy hollow pods,which will simulate worker nodes, name hollow nodes. After all hollow nodes in ready status, we need to cordon all native nodes, which are physical presence in the cluster, not the simulated nodes, to avoid we allocated test workload pod to native nodes.
 - [Step 3](#Deploy-YuniKorn): Deploy YuniKorn using the Helm chart on the master node, and scale down the Deployment to 0 replica, and [modify the port](#Setup-Prometheus) in `prometheus.yml` to match the port of the service.
 - [Step 4](#Run-tests): Deploy 50k Nginx pods for testing, and the API server will create them. But since the YuniKorn scheduler Deployment has been scaled down to 0 replica, all Nginx pods will be stuck in pending.
-- [Step 5](../user_guide/trouble_shooting.md#restart-the-scheduler): Scale up The YuniKorn Deployment back to 1 replica, and cordon the master node to avoid YuniKorn allocating Nginx pods there. In this step, YuniKorn will start collecting the metrics.
+- [Step 5](../user_guide/troubleshooting.md#restart-the-scheduler): Scale up The YuniKorn Deployment back to 1 replica, and cordon the master node to avoid YuniKorn allocating Nginx pods there. In this step, YuniKorn will start collecting the metrics.
 - [Step 6](#Collect-and-Observe-YuniKorn-metrics): Observe the metrics exposed in Prometheus UI.
 ---
 
diff --git a/docs/user_guide/gang_scheduling.md b/docs/user_guide/gang_scheduling.md
index f7593a573..678aec0a1 100644
--- a/docs/user_guide/gang_scheduling.md
+++ b/docs/user_guide/gang_scheduling.md
@@ -99,7 +99,7 @@ This parameter defines the reservation timeout for how long the scheduler should
 The timeout timer starts to tick when the scheduler *allocates the first placeholder pod*. This ensures if the scheduler
 could not schedule all the placeholder pods, it will eventually give up after a certain amount of time. So that the resources can be
 freed up and used by other apps. If non of the placeholders can be allocated, this timeout won't kick-in. To avoid the placeholder
-pods stuck forever, please refer to [troubleshooting](trouble_shooting.md#gang-scheduling) for solutions.
+pods stuck forever, please refer to [troubleshooting](troubleshooting.md#gang-scheduling) for solutions.
 
 ` gangSchedulingStyle`
 
@@ -285,4 +285,4 @@ Check field including: namespace, pod resources, node-selector, toleration and a
 
 ## Troubleshooting
 
-Please see the troubleshooting doc when gang scheduling is enabled [here](trouble_shooting.md#gang-scheduling).
+Please see the troubleshooting doc when gang scheduling is enabled [here](troubleshooting.md#gang-scheduling).
diff --git a/docs/user_guide/trouble_shooting.md b/docs/user_guide/troubleshooting.md
similarity index 80%
rename from docs/user_guide/trouble_shooting.md
rename to docs/user_guide/troubleshooting.md
index 549d5e0e3..9da841852 100644
--- a/docs/user_guide/trouble_shooting.md
+++ b/docs/user_guide/troubleshooting.md
@@ -1,6 +1,6 @@
 ---
-id: trouble_shooting
-title: Trouble Shooting
+id: troubleshooting
+title: Troubleshooting
 ---
 
 <!--
@@ -46,7 +46,7 @@ The recommended setup is to leverage [fluentd](https://www.fluentd.org/) to coll
 ### Set Logging Level
 
 :::note
-Changing the logging level requires a restart of the scheduler pod.
+We recommend altering the log level via REST API call as this way we don't need to restart the scheduler pod every time. But changing the logging level via editing the deployment config requires a restart of the scheduler pod and it's not highly recommended.
 :::
 
 Stop the scheduler:
@@ -134,8 +134,37 @@ is running out of capacity.
 The pod will be allocated if some other pods in this queue is completed or removed. If the pod remains pending even
 the queue has capacity, that may because it is waiting for the cluster to scale up.
 
+## Obtain full state dump
+
+A Yunikorn state dump contains the every state object for every process which getting dumped. With endpoint to retrieve we can have many useful information in a single response for troubleshooting for example:  list of partitions, list of applications which includes running, completed also historical application details, number of nodes, utilization of nodes, generic cluster information, cluster utilization details, container history and queues information. 
+
+The state dump is a valuable resource that Yunikorn offers for use while troubleshooting.
+
+There are a few ways to obtain the full state dump.
+
+### 1. Scheduler URL
+
+STEPS:
+* Open the Scheduler URL in your browser window/tab and edit the URL as follows:
+* Replace `/#/dashboard` with `/ws/v1/fullstatedump`, (For example, `http://localhost:9889/ws/v1/fullstatedump`)
+* Press Enter
+
+That displays and provides an easy user experience to view live full state dump.
+
+### 2. Scheduler REST API  
+
+With the below scheduler REST API returns information about full state dump used by the YuniKorn Scheduler.
+
+`curl -X 'GET' http://localhost:9889/ws/v1/fullstatedump -H 'accept: application/json'`
+
+For more details around the content of the state dump, please refer to the documentation on [retrieve-full-state-dump](api/scheduler.md#retrieve-full-state-dump)
+
 ## Restart the scheduler
 
+:::note
+In accordance with best practices for troubleshooting, restarting the scheduler should only be done as a last effort to get everything back up and running. It should never be done before gathering all logs and state dumps.
+:::
+
 YuniKorn can recover its state upon a restart. YuniKorn scheduler pod is deployed as a deployment, restart the scheduler
 can be done by scale down and up the replica:
 
@@ -189,4 +218,4 @@ No problem! The Apache YuniKorn community will be happy to help. You can reach o
 
 1. Post your questions to dev@yunikorn.apache.org
 2. Join the [YuniKorn slack channel](https://join.slack.com/t/yunikornworkspace/shared_invite/enQtNzAzMjY0OTI4MjYzLTBmMDdkYTAwNDMwNTE3NWVjZWE1OTczMWE4NDI2Yzg3MmEyZjUyYTZlMDE5M2U4ZjZhNmYyNGFmYjY4ZGYyMGE) and post your questions to the `#yunikorn-user` channel.
-3. Join the [community sync up meetings](http://yunikorn.apache.org/community/getInvolved#community-meetings) and directly talk to the community members. 
\ No newline at end of file
+3. Join the [community sync up meetings](http://yunikorn.apache.org/community/get_involved#community-meetings) and directly talk to the community members. 
diff --git a/sidebars.js b/sidebars.js
index e68867f9e..c34ae6738 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -53,7 +53,7 @@ module.exports = {
                     'api/system'
                 ]
             },
-            'user_guide/trouble_shooting'
+            'user_guide/troubleshooting'
         ],
         'Developer Guide': [
             'developer_guide/env_setup',