You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@yunikorn.apache.org by "Wilfred Spiegelenburg (Jira)" <ji...@apache.org> on 2021/01/27 02:57:00 UTC

[jira] [Resolved] (YUNIKORN-516) Yunikorn scheduler seems to be in deadlock state

     [ https://issues.apache.org/jira/browse/YUNIKORN-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wilfred Spiegelenburg resolved YUNIKORN-516.
--------------------------------------------
    Resolution: Duplicate

> Yunikorn scheduler seems to be in deadlock state
> ------------------------------------------------
>
>                 Key: YUNIKORN-516
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-516
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Ayub Pathan
>            Assignee: Wilfred Spiegelenburg
>            Priority: Blocker
>         Attachments: metrics, stack, yk.log
>
>
> Apply below job templates to reproduce the issue.
>  # First application with gang scheduling annotations
>   
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: batch-sleep-job-1
> spec:
>   completions: 2
>   parallelism: 2
>   template:
>     metadata:
>       labels:
>         app: sleep
>         applicationId: "batch-sleep-job-1"
>         queue: root.sandbox
>       annotations:
>         yunikorn.apache.org/task-group-name: tg1
>         yunikorn.apache.org/task-groups: |-
>           [{
>               "name": "tg1",
>               "minMember": 2,
>               "minResource": {
>                 "cpu": "100m",
>                 "memory": "500M"
>               },
>               "nodeSelector": {},
>               "tolerations": []
>           }]
>     spec:
>       schedulerName: yunikorn
>       restartPolicy: Never
>       containers:
>         - name: sleep300
>           image: "alpine:latest"
>           command: ["sleep", "300"]
>           resources:
>             requests:
>               cpu: "100m"
>               memory: "500M" {noformat}
>  
> 2.  First application to the same task group
> {noformat}
>  apiVersion: batch/v1
> kind: Job
> metadata:
>   name: batch-sleep-job-2
> spec:
>   completions: 4
>   parallelism: 4
>   template:
>     metadata:
>       labels:
>         app: sleep
>         applicationId: "batch-sleep-job-2"
>         queue: root.sandbox
>       annotations:
>         yunikorn.apache.org/task-group-name: tg1
>         yunikorn.apache.org/task-groups: |-
>           [{
>               "name": "tg1",
>               "minMember": 2,
>               "minResource": {
>                 "cpu": "100m",
>                 "memory": "500M"
>               },
>               "nodeSelector": {},
>               "tolerations": []
>           }]
>     spec:
>       schedulerName: yunikorn
>       restartPolicy: Never
>       containers:
>         - name: sleep300
>           image: "alpine:latest"
>           command: ["sleep", "300"]
>           resources:
>             requests:
>               cpu: "100m"
>               memory: "500M"{noformat}
>  
> 3. Third application to the same task group
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: batch-sleep-job-3
> spec:
>   completions: 10
>   parallelism: 10
>   template:
>     metadata:
>       labels:
>         app: sleep
>         applicationId: "batch-sleep-job-3"
>         queue: root.sandbox
>       annotations:
>         yunikorn.apache.org/task-group-name: tg1
>         yunikorn.apache.org/task-groups: |-
>           [{
>               "name": "tg1",
>               "minMember": 3,
>               "minResource": {
>                 "cpu": "100m",
>                 "memory": "500M"
>               },
>               "nodeSelector": {},
>               "tolerations": []
>           }]
>     spec:
>       schedulerName: yunikorn
>       restartPolicy: Never
>       containers:
>         - name: sleep300
>           image: "alpine:latest"
>           command: ["sleep", "300"]
>           resources:
>             requests:
>               cpu: "100m"
>               memory: "500M" {noformat}
> Now it can be seen that, the 3rd application is in pending state even though the place holder apps are created and terminated.
> {noformat}
> NAME↑                    READY STATUS     RS CPU MEM %CPU/R  %MEM/R  %CPU/L  %MEM/L IP                NODE                                              QOS  AGE    │
> │ batch-sleep-job-1-7lrd5  0/1   Completed   0 n/a n/a    n/a     n/a     n/a     n/a 100.100.142.208   ip-10-192-143-108.ca-central-1.compute.internal   BU   18m    │
> │ batch-sleep-job-1-lw4t9  0/1   Completed   0 n/a n/a    n/a     n/a     n/a     n/a 100.100.134.213   ip-10-192-136-201.ca-central-1.compute.internal   BU   18m    │
> │ batch-sleep-job-2-c95dg  0/1   Completed   0 n/a n/a    n/a     n/a     n/a     n/a 100.100.142.210   ip-10-192-143-108.ca-central-1.compute.internal   BU   17m    │
> │ batch-sleep-job-2-vnfjb  0/1   Completed   0 n/a n/a    n/a     n/a     n/a     n/a 100.100.142.211   ip-10-192-143-108.ca-central-1.compute.internal   BU   17m    │
> │ batch-sleep-job-2-x4mcz  0/1   Completed   0 n/a n/a    n/a     n/a     n/a     n/a 100.100.134.216   ip-10-192-136-201.ca-central-1.compute.internal   BU   17m    │
> │ batch-sleep-job-2-ztnfq  0/1   Completed   0 n/a n/a    n/a     n/a     n/a     n/a 100.100.134.217   ip-10-192-136-201.ca-central-1.compute.internal   BU   17m    │
> │ batch-sleep-job-3-7tp5t  0/0   Pending     0 n/a n/a    n/a     n/a     n/a     n/a n/a               n/a                                               BU   16m    │
> │ batch-sleep-job-3-59mnj  0/0   Pending     0 n/a n/a    n/a     n/a     n/a     n/a n/a               n/a                                               BU   16m    │
> │ batch-sleep-job-3-bm4fd  0/0   Pending     0 n/a n/a    n/a     n/a     n/a     n/a n/a               n/a                                               BU   16m    │
> │ batch-sleep-job-3-c4mxg  0/0   Pending     0 n/a n/a    n/a     n/a     n/a     n/a n/a               n/a                                               BU   16m    │
> │ batch-sleep-job-3-cljfj  0/0   Pending     0 n/a n/a    n/a     n/a     n/a     n/a n/a               n/a                                               BU   16m    │
> │ batch-sleep-job-3-gcvnp  0/0   Pending     0 n/a n/a    n/a     n/a     n/a     n/a n/a               n/a                                               BU   16m    │
> │ batch-sleep-job-3-gwgnn  0/0   Pending     0 n/a n/a    n/a     n/a     n/a     n/a n/a               n/a                                               BU   16m    │
> │ batch-sleep-job-3-kj88t  0/0   Pending     0 n/a n/a    n/a     n/a     n/a     n/a n/a               n/a                                               BU   16m    │
> │ batch-sleep-job-3-p8c7w  0/0   Pending     0 n/a n/a    n/a     n/a     n/a     n/a n/a               n/a                                               BU   16m    │
> │ batch-sleep-job-3-td575  0/0   Pending     0 n/a n/a    n/a     n/a     n/a     n/a n/a               n/a                                               BU   16m{noformat}
> Attaching [^stack]trace, [^yk.log]and [^metrics] API response for reference. This is observed with v0.10 build.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: dev-help@yunikorn.apache.org