You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@yunikorn.apache.org by "Wilfred Spiegelenburg (Jira)" <ji...@apache.org> on 2021/01/27 02:57:00 UTC
[jira] [Resolved] (YUNIKORN-516) Yunikorn scheduler seems to be in
deadlock state
[ https://issues.apache.org/jira/browse/YUNIKORN-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wilfred Spiegelenburg resolved YUNIKORN-516.
--------------------------------------------
Resolution: Duplicate
> Yunikorn scheduler seems to be in deadlock state
> ------------------------------------------------
>
> Key: YUNIKORN-516
> URL: https://issues.apache.org/jira/browse/YUNIKORN-516
> Project: Apache YuniKorn
> Issue Type: Bug
> Components: core - scheduler
> Reporter: Ayub Pathan
> Assignee: Wilfred Spiegelenburg
> Priority: Blocker
> Attachments: metrics, stack, yk.log
>
>
> Apply below job templates to reproduce the issue.
> # First application with gang scheduling annotations
>
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
> name: batch-sleep-job-1
> spec:
> completions: 2
> parallelism: 2
> template:
> metadata:
> labels:
> app: sleep
> applicationId: "batch-sleep-job-1"
> queue: root.sandbox
> annotations:
> yunikorn.apache.org/task-group-name: tg1
> yunikorn.apache.org/task-groups: |-
> [{
> "name": "tg1",
> "minMember": 2,
> "minResource": {
> "cpu": "100m",
> "memory": "500M"
> },
> "nodeSelector": {},
> "tolerations": []
> }]
> spec:
> schedulerName: yunikorn
> restartPolicy: Never
> containers:
> - name: sleep300
> image: "alpine:latest"
> command: ["sleep", "300"]
> resources:
> requests:
> cpu: "100m"
> memory: "500M" {noformat}
>
> 2. First application to the same task group
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
> name: batch-sleep-job-2
> spec:
> completions: 4
> parallelism: 4
> template:
> metadata:
> labels:
> app: sleep
> applicationId: "batch-sleep-job-2"
> queue: root.sandbox
> annotations:
> yunikorn.apache.org/task-group-name: tg1
> yunikorn.apache.org/task-groups: |-
> [{
> "name": "tg1",
> "minMember": 2,
> "minResource": {
> "cpu": "100m",
> "memory": "500M"
> },
> "nodeSelector": {},
> "tolerations": []
> }]
> spec:
> schedulerName: yunikorn
> restartPolicy: Never
> containers:
> - name: sleep300
> image: "alpine:latest"
> command: ["sleep", "300"]
> resources:
> requests:
> cpu: "100m"
> memory: "500M"{noformat}
>
> 3. Third application to the same task group
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
> name: batch-sleep-job-3
> spec:
> completions: 10
> parallelism: 10
> template:
> metadata:
> labels:
> app: sleep
> applicationId: "batch-sleep-job-3"
> queue: root.sandbox
> annotations:
> yunikorn.apache.org/task-group-name: tg1
> yunikorn.apache.org/task-groups: |-
> [{
> "name": "tg1",
> "minMember": 3,
> "minResource": {
> "cpu": "100m",
> "memory": "500M"
> },
> "nodeSelector": {},
> "tolerations": []
> }]
> spec:
> schedulerName: yunikorn
> restartPolicy: Never
> containers:
> - name: sleep300
> image: "alpine:latest"
> command: ["sleep", "300"]
> resources:
> requests:
> cpu: "100m"
> memory: "500M" {noformat}
> Now it can be seen that, the 3rd application is in pending state even though the place holder apps are created and terminated.
> {noformat}
> NAME↑ READY STATUS RS CPU MEM %CPU/R %MEM/R %CPU/L %MEM/L IP NODE QOS AGE │
> │ batch-sleep-job-1-7lrd5 0/1 Completed 0 n/a n/a n/a n/a n/a n/a 100.100.142.208 ip-10-192-143-108.ca-central-1.compute.internal BU 18m │
> │ batch-sleep-job-1-lw4t9 0/1 Completed 0 n/a n/a n/a n/a n/a n/a 100.100.134.213 ip-10-192-136-201.ca-central-1.compute.internal BU 18m │
> │ batch-sleep-job-2-c95dg 0/1 Completed 0 n/a n/a n/a n/a n/a n/a 100.100.142.210 ip-10-192-143-108.ca-central-1.compute.internal BU 17m │
> │ batch-sleep-job-2-vnfjb 0/1 Completed 0 n/a n/a n/a n/a n/a n/a 100.100.142.211 ip-10-192-143-108.ca-central-1.compute.internal BU 17m │
> │ batch-sleep-job-2-x4mcz 0/1 Completed 0 n/a n/a n/a n/a n/a n/a 100.100.134.216 ip-10-192-136-201.ca-central-1.compute.internal BU 17m │
> │ batch-sleep-job-2-ztnfq 0/1 Completed 0 n/a n/a n/a n/a n/a n/a 100.100.134.217 ip-10-192-136-201.ca-central-1.compute.internal BU 17m │
> │ batch-sleep-job-3-7tp5t 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │
> │ batch-sleep-job-3-59mnj 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │
> │ batch-sleep-job-3-bm4fd 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │
> │ batch-sleep-job-3-c4mxg 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │
> │ batch-sleep-job-3-cljfj 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │
> │ batch-sleep-job-3-gcvnp 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │
> │ batch-sleep-job-3-gwgnn 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │
> │ batch-sleep-job-3-kj88t 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │
> │ batch-sleep-job-3-p8c7w 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m │
> │ batch-sleep-job-3-td575 0/0 Pending 0 n/a n/a n/a n/a n/a n/a n/a n/a BU 16m{noformat}
> Attaching [^stack]trace, [^yk.log]and [^metrics] API response for reference. This is observed with v0.10 build.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: dev-help@yunikorn.apache.org