You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@yunikorn.apache.org by "PoAn Yang (Jira)" <ji...@apache.org> on 2023/08/14 09:34:00 UTC
[jira] [Created] (YUNIKORN-1919) runningApps is not correct when app state from starting to completing
PoAn Yang created YUNIKORN-1919:
-----------------------------------
Summary: runningApps is not correct when app state from starting to completing
Key: YUNIKORN-1919
URL: https://issues.apache.org/jira/browse/YUNIKORN-1919
Project: Apache YuniKorn
Issue Type: Bug
Components: core - scheduler
Reporter: PoAn Yang
Assignee: PoAn Yang
Fix For: 1.4.0
We increase runningApps when app gets into starting state[1]. We decrease runningApps when app leaves running state[2]. However, in some cases, app doesn't get into running state, so the runningApps result will get error. Finally, we can't allocate another app[3].
Reproduce steps:
1. Set queue config.
{noformat}
data:
queues.yaml: |
partitions:
- name: default
nodesortpolicy:
type: fair
queues:
- name: root
parent: true
queues:
- name: default # default queue for applications that don't specify a queue
submitacl: '*'
- name: sandbox1
submitacl: '*'
maxapplications: 1{noformat}
2. Apply a deployment.
{noformat}
apiVersion: apps/v1
kind: Deployment
metadata:
name: sleep-deployment
labels:
app: sleep-deployment
applicationId: "sleep-deployment"
queue: "root.sandbox1"
spec:
replicas: 1
selector:
matchLabels:
app: sleep-deployment
applicationId: "sleep-deployment"
queue: "root.sandbox1"
template:
metadata:
labels:
app: sleep-deployment
applicationId: "sleep-deployment"
queue: "root.sandbox1"
spec:
containers:
- name: sleep-30s
image: alpine:latest
command: ["sleep", "30"]{noformat}
3. Apply a job.
{noformat}
apiVersion: batch/v1
kind: Job
metadata:
name: sleep-job
spec:
parallelism: 1
template:
metadata:
labels:
app: sleep-job
applicationId: "sleep-job"
queue: "root.sandbox1"
spec:
containers:
- name: sleep-job
image: alpine:latest
command: ["sleep", "30"]
restartPolicy: Never{noformat}
4. Delete the deployment.
5. The pod of job can't get started.
[1] [https://github.com/apache/yunikorn-core/blob/9abd5bff0b0340935f1a4467f433a941ad5f476f/pkg/scheduler/objects/application_state.go#L152]
[2] [https://github.com/apache/yunikorn-core/blob/9abd5bff0b0340935f1a4467f433a941ad5f476f/pkg/scheduler/objects/application_state.go#L188]
[3] [https://github.com/apache/yunikorn-core/blob/9abd5bff0b0340935f1a4467f433a941ad5f476f/pkg/scheduler/objects/queue.go#L1300-L1302]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: dev-help@yunikorn.apache.org