You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Peter Bacsko (Jira)" <ji...@apache.org> on 2022/03/30 13:36:00 UTC
[jira] [Comment Edited] (YUNIKORN-1161) Pods not linked to placeholders are stuck in Running state if YK is restarted

    [ https://issues.apache.org/jira/browse/YUNIKORN-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514705#comment-17514705 ] 

Peter Bacsko edited comment on YUNIKORN-1161 at 3/30/22, 1:35 PM:
------------------------------------------------------------------

I managed to find out the root cause.

State transitions when everything works properly:
{{New}} -> {{Accepted}} -> {{Started}}

When YK restarts, the app ends up in {{Resuming}} state:
{{New}} -> {{Resuming}} -> {{Accepted}}

When the placeholder is timed out, we don't hit the first branch, instead we go to {{{}default{}}}:
{noformat}
func (sa *Application) timeoutPlaceholderProcessing() {
	sa.Lock()
	defer sa.Unlock()

	switch {
	// Case 1: if all app's placeholders are allocated, only part of them gets replaced, just delete the remaining placeholders
       case (sa.IsRunning() || sa.IsStarting() || sa.IsCompleting()) && !resources.IsZero(sa.allocatedPlaceholder):
       ... skip code - we collect releases here
       sa.notifyRMAllocationReleased(sa.rmID, toRelease, si.TerminationType_TIMEOUT, "releasing allocated placeholders on placeholder timeout")
       default:
       log.Logger().Info("Placeholder timeout, releasing asks and placeholders",
			zap.String("AppID", sa.ApplicationID),
			zap.Int("releasing placeholders", len(sa.allocations)),
			zap.Int("releasing asks", len(sa.requests)),
			zap.String("gang scheduling style", sa.gangSchedulingStyle))
		// change the status of the app to Failing. Once all the placeholders are cleaned up, if will be changed to Failed
		event := ResumeApplication
		if sa.gangSchedulingStyle == Hard {
			event = FailApplication
		}
{noformat}
Since in my example, "soft" gang scheduling was used, we sent a {{ResumeApplication}} event, which then puts the application back to {{{}Accepted{}}}:
{noformat}
cache/task.go:563	releasing allocations	{"numOfAsksToRelease": 0, "numOfAllocationsToRelease": 1}
2022-03-28T13:56:55.797Z	INFO	scheduler/partition.go:1295	removing allocation from application	{"appID": "batch-sleep-job-3", "allocationId": "9c21b7b6-83d0-449e-805a-dbda2a5e0dd5", "terminationType": "TIMEOUT"}
2022-03-28T13:56:55.797Z	INFO	objects/application_state.go:128	Application state transition	{"appID": "batch-sleep-job-3", "source": "Resuming", "destination": "Accepted", "event": "runApplication"}
{noformat}


was (Author: pbacsko):
I managed to find out the root cause.

State transitions when everything works properly:
{{New}} -> {{Accepted}} -> {{Started}}

When YK restarts, the app ends up in {{Resuming}} state:
{{New}} -> {{Resuming}} -> {{Accepted}}

When the placeholder is timed out, we don't hit the first branch, instead we go to {{default}}:
{noformat}
case (sa.IsRunning() || sa.IsStarting() || sa.IsCompleting()) && !resources.IsZero(sa.allocatedPlaceholder):
... skip code - we collect releases here
sa.notifyRMAllocationReleased(sa.rmID, toRelease, si.TerminationType_TIMEOUT, "releasing allocated placeholders on placeholder timeout")
default:
log.Logger().Info("Placeholder timeout, releasing asks and placeholders",
			zap.String("AppID", sa.ApplicationID),
			zap.Int("releasing placeholders", len(sa.allocations)),
			zap.Int("releasing asks", len(sa.requests)),
			zap.String("gang scheduling style", sa.gangSchedulingStyle))
		// change the status of the app to Failing. Once all the placeholders are cleaned up, if will be changed to Failed
		event := ResumeApplication
		if sa.gangSchedulingStyle == Hard {
			event = FailApplication
		}
{noformat}

Since in my example, "soft" gang scheduling was used, we sent a {{ResumeApplication}} event, which then puts the application back to {{Accepted}}:
{noformat}
cache/task.go:563	releasing allocations	{"numOfAsksToRelease": 0, "numOfAllocationsToRelease": 1}
2022-03-28T13:56:55.797Z	INFO	scheduler/partition.go:1295	removing allocation from application	{"appID": "batch-sleep-job-3", "allocationId": "9c21b7b6-83d0-449e-805a-dbda2a5e0dd5", "terminationType": "TIMEOUT"}
2022-03-28T13:56:55.797Z	INFO	objects/application_state.go:128	Application state transition	{"appID": "batch-sleep-job-3", "source": "Resuming", "destination": "Accepted", "event": "runApplication"}
{noformat}

> Pods not linked to placeholders are stuck in Running state if YK is restarted
> -----------------------------------------------------------------------------
>
>                 Key: YUNIKORN-1161
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1161
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: shim - kubernetes
>            Reporter: Peter Bacsko
>            Assignee: Peter Bacsko
>            Priority: Major
>         Attachments: logs-from-yunikorn-scheduler-k8s-in-yunikorn-scheduler-after_restart_nomatchingtaskgroupname.txt, logs-from-yunikorn-scheduler-k8s-in-yunikorn-scheduler-before_restart_nomatchingtaskgroupname.txt, pods_nomatchingtaskgroupname.txt
>
>
> If we create pods where the nam of the task group does not match the {{task-group-name}} annotation, then the real pods will not transition to Running state when the placeholder pods expire and Yunikorn was restarted in the meantime.
> For example, extend the sleep batch job like that:
> {noformat}
> apiVersion: batch/v1
> kind: Job
> metadata:
>   name: batch-sleep-job-9
> spec:
>   completions: 5
>   parallelism: 5
>   template:
>     metadata:
>       labels:
>         app: sleep
>         applicationId: "batch-sleep-job-9"
>         queue: root.sandbox
>       annotations:
>         yunikorn.apache.org/task-group-name: sleep-groupxxx
>         yunikorn.apache.org/task-groups: |-
>           [{
>               "name": "sleep-group",
>               "minMember": 5,
>               "minResource": {
>                 "cpu": "100m",
>                 "memory": "2000M"
>               },
>               "nodeSelector": {},
>               "tolerations": []
>           }]
> ...
> {noformat}
> Submit the job and restart Yunikorn when the placeholders are already running.
> This will result in "batch-sleep-job-9-nnnnn" pods that are not transitioning to {{Running}} and they have to be manually terminated.
> {noformat}
> $ kubectl get pods -A | grep -E "(batch-sleep-job-9|yunikorn)"
> default                batch-sleep-job-9-hgxxl                          0/1     Pending     0          20m
> default                batch-sleep-job-9-j6twt                          0/1     Pending     0          20m
> default                batch-sleep-job-9-l4jhm                          0/1     Pending     0          20m
> default                batch-sleep-job-9-swlm4                          0/1     Pending     0          20m
> default                batch-sleep-job-9-z6wqx                          0/1     Pending     0          20m
> default                yunikorn-admission-controller-78c775cfd9-6pp8d   1/1     Running     4          3d22h
> default                yunikorn-scheduler-77dd7c665b-f8kkn              2/2     Running     0          18m
> {noformat}
> Note that without YK restart, they are deallocated and removed properly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org