You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by ww...@apache.org on 2021/06/29 23:05:24 UTC
[incubator-yunikorn-site] branch master updated: [YUNIKORN-728] Document Soft/Hard scheduling styles (#61)

This is an automated email from the ASF dual-hosted git repository.

wwei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-yunikorn-site.git


The following commit(s) were added to refs/heads/master by this push:
     new 53bbc15  [YUNIKORN-728] Document Soft/Hard scheduling styles (#61)
53bbc15 is described below

commit 53bbc15259c2c683b3c2f9ec83053f7a8017f0da
Author: Kinga Marton <km...@apache.org>
AuthorDate: Wed Jun 30 01:05:16 2021 +0200

    [YUNIKORN-728] Document Soft/Hard scheduling styles (#61)
---
 docs/user_guide/gang_scheduling.md | 66 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/docs/user_guide/gang_scheduling.md b/docs/user_guide/gang_scheduling.md
index abdfe7b..47b5722 100644
--- a/docs/user_guide/gang_scheduling.md
+++ b/docs/user_guide/gang_scheduling.md
@@ -101,6 +101,14 @@ could not schedule all the placeholder pods, it will eventually give up after a
 freed up and used by other apps. If non of the placeholders can be allocated, this timeout won't kick-in. To avoid the placeholder
 pods stuck forever, please refer to [troubleshooting](trouble_shooting.md#gang-scheduling) for solutions.
 
+` gangSchedulingStyle`
+
+Valid values: *Soft*, *Hard*
+
+Default value: *Soft*.
+This parameter defines the fallback mechanism if the app encounters gang issues due to placeholder pod allocation.
+See more details in [Gang Scheduling styles](#gang-scheduling-styles) section
+
 More scheduling parameters will added in order to provide more flexibility while scheduling apps.
 
 #### Example
@@ -206,6 +214,62 @@ Annotations:
 Once the job is submitted to the scheduler, the job won’t be scheduled immediately.
 Instead, the scheduler will ensure it gets its minimal resources before actually starting the driver/executors. 
 
+## Gang scheduling Styles
+
+There are 2 gang scheduling styles supported, Soft and Hard respectively. It can be configured per app-level to define how the app will behave in case the gang scheduling fails.
+
+- `Hard style`: when this style is used, we will have the initial behavior, more precisely if the application cannot be scheduled according to gang scheduling rules, and it times out, it will be marked as failed, without retrying to schedule it.
+- `Soft style`: when the app cannot be gang scheduled, it will fall back to the normal scheduling, and the non-gang scheduling strategy will be used to achieve the best-effort scheduling. When this happens, the app transits to the Resuming state and all the remaining placeholder pods will be cleaned up.
+
+**Default style used**: `Soft`
+
+**Enable a specific style**: the style can be changed by setting in the application definition the ‘gangSchedulingStyle’ parameter to Soft or Hard.
+
+#### Example
+
+```yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: gang-app-timeout
+spec:
+  completions: 4
+  parallelism: 4
+  template:
+    metadata:
+      labels:
+        app: sleep
+        applicationId: gang-app-timeout
+        queue: fifo
+      annotations:
+        yunikorn.apache.org/task-group-name: sched-style
+        yunikorn.apache.org/schedulingPolicyParameters: "placeholderTimeoutInSeconds=60 gangSchedulingStyle=Hard"
+        yunikorn.apache.org/task-groups: |-
+          [{
+              "name": "sched-style",
+              "minMember": 4,
+              "minResource": {
+                "cpu": "1",
+                "memory": "1000M"
+              },
+              "nodeSelector": {},
+              "tolerations": []
+          }]
+    spec:
+      schedulerName: yunikorn
+      restartPolicy: Never
+      containers:
+        - name: sleep30
+          image: "alpine:latest"
+          imagePullPolicy: "IfNotPresent"
+          command: ["sleep", "30"]
+          resources:
+            requests:
+              cpu: "1"
+              memory: "1000M"
+
+```
+
 ## Verify Configuration
 
 To verify if the configuration has been done completely and correctly, check the following things:
@@ -218,4 +282,4 @@ Check field including: namespace, pod resources, node-selector, and toleration.
 
 ## Troubleshooting
 
-Please see the troubleshooting doc when gang scheduling is enabled [here](trouble_shooting.md#gang-scheduling).
\ No newline at end of file
+Please see the troubleshooting doc when gang scheduling is enabled [here](trouble_shooting.md#gang-scheduling).