You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@yunikorn.apache.org by "PoAn Yang (Jira)" <ji...@apache.org> on 2023/08/14 09:34:00 UTC

[jira] [Created] (YUNIKORN-1919) runningApps is not correct when app state from starting to completing

PoAn Yang created YUNIKORN-1919:
-----------------------------------

             Summary: runningApps is not correct when app state from starting to completing
                 Key: YUNIKORN-1919
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1919
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: core - scheduler
            Reporter: PoAn Yang
            Assignee: PoAn Yang
             Fix For: 1.4.0


We increase runningApps when app gets into starting state[1]. We decrease runningApps when app leaves running state[2]. However, in some cases, app doesn't get into running state, so the runningApps result will get error. Finally, we can't allocate another app[3].

 

Reproduce steps:

1. Set queue config.

 
{noformat}
data:
  queues.yaml: |
    partitions:
    - name: default
      nodesortpolicy:
        type: fair
      queues:
      - name: root
        parent: true
        queues:
        - name: default # default queue for applications that don't specify a queue
          submitacl: '*'
        - name: sandbox1
          submitacl: '*'
          maxapplications: 1{noformat}
2. Apply a deployment.

 

 
{noformat}
apiVersion: apps/v1
kind: Deployment
metadata:
  name: sleep-deployment
  labels:
    app: sleep-deployment
    applicationId: "sleep-deployment"
    queue: "root.sandbox1"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: sleep-deployment
      applicationId: "sleep-deployment"
      queue: "root.sandbox1"
  template:
    metadata:
      labels:
        app: sleep-deployment
        applicationId: "sleep-deployment"
        queue: "root.sandbox1"
    spec:
      containers:
      - name: sleep-30s
        image: alpine:latest
        command: ["sleep", "30"]{noformat}
3. Apply a job.

 

 
{noformat}
apiVersion: batch/v1
kind: Job
metadata:
  name: sleep-job
spec:
  parallelism: 1
  template:
    metadata:
      labels:
        app: sleep-job
        applicationId: "sleep-job"
        queue: "root.sandbox1"
    spec:
      containers:
      - name: sleep-job
        image: alpine:latest
        command: ["sleep",  "30"]
      restartPolicy: Never{noformat}
4. Delete the deployment.

 

5. The pod of job can't get started.

 

[1] [https://github.com/apache/yunikorn-core/blob/9abd5bff0b0340935f1a4467f433a941ad5f476f/pkg/scheduler/objects/application_state.go#L152]

[2] [https://github.com/apache/yunikorn-core/blob/9abd5bff0b0340935f1a4467f433a941ad5f476f/pkg/scheduler/objects/application_state.go#L188]

[3] [https://github.com/apache/yunikorn-core/blob/9abd5bff0b0340935f1a4467f433a941ad5f476f/pkg/scheduler/objects/queue.go#L1300-L1302]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: dev-help@yunikorn.apache.org