You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Peter Bacsko (Jira)" <ji...@apache.org> on 2022/03/25 14:22:00 UTC
[jira] [Comment Edited] (YUNIKORN-560) Yunikorn recovery deletes existing placeholders

    [ https://issues.apache.org/jira/browse/YUNIKORN-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512406#comment-17512406 ] 

Peter Bacsko edited comment on YUNIKORN-560 at 3/25/22, 2:21 PM:
-----------------------------------------------------------------

I tried to repro this problem with Minikube version 1.22.0.

I used {{kubectl deployment scale}} and {{kubectl delete pod}} to restart YK, but I haven't seen anything. No placeholder pods were deleted.

What I did see is though, it that after restart, placeholders do not time out and just keep running:
{noformat}
default                batch-sleep-job-3-5drfs                          0/1     Pending     0          37m
default                batch-sleep-job-3-cfl7c                          0/1     Pending     0          37m
default                batch-sleep-job-3-fvddw                          0/1     Pending     0          37m
default                batch-sleep-job-3-jqhnb                          0/1     Pending     0          37m
default                batch-sleep-job-3-v5qz4                          0/1     Pending     0          37m
default                tg-sleep-group-batch-sleep-job-3-0               1/1     Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-1               1/1     Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-2               1/1     Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-3               1/1     Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-4               1/1     Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-5               1/1     Running     0          37m
default                yunikorn-admission-controller-78c775cfd9-6pp8d   1/1     Running     2          158m
default                yunikorn-scheduler-77dd7c665b-jcmt6              2/2     Running     0          11m
{noformat}
Eventually, {{tg-sleep}} pods disappeared, but the {{batch-sleep-job}} pods did not transition into Running state:
{noformat}
default                batch-sleep-job-3-5drfs                          0/1     Pending     0          57m
default                batch-sleep-job-3-cfl7c                          0/1     Pending     0          57m
default                batch-sleep-job-3-fvddw                          0/1     Pending     0          57m
default                batch-sleep-job-3-jqhnb                          0/1     Pending     0          57m
default                batch-sleep-job-3-v5qz4                          0/1     Pending     0          57m
default                yunikorn-admission-controller-78c775cfd9-6pp8d   1/1     Running     2          178m
default                yunikorn-scheduler-77dd7c665b-jcmt6              2/2     Running     0          31m
{noformat}
As a "bonus", we still have the total container count problem:
{noformat}
2022-03-25T13:59:31.633Z	WARN	metrics/metrics_collector.go:85	Could not calculate the totalContainersRunning.	{"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T13:59:35.018Z	INFO	shim/scheduler.go:356	No outstanding apps found for a while	{"timeout": "2m0s"}
2022-03-25T14:00:31.633Z	WARN	metrics/metrics_collector.go:85	Could not calculate the totalContainersRunning.	{"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:01:31.634Z	WARN	metrics/metrics_collector.go:85	Could not calculate the totalContainersRunning.	{"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:01:35.021Z	INFO	shim/scheduler.go:356	No outstanding apps found for a while	{"timeout": "2m0s"}
2022-03-25T14:02:31.638Z	WARN	metrics/metrics_collector.go:85	Could not calculate the totalContainersRunning.	{"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:03:31.634Z	WARN	metrics/metrics_collector.go:85	Could not calculate the totalContainersRunning.	{"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:03:35.022Z	INFO	shim/scheduler.go:356	No outstanding apps found for a while	{"timeout": "2m0s"}
{noformat}
To me it looks like we're having three problems:
1) Placeholder timers are started at a fixed period. We don't account for the amount of time the pod has already spent in {{Running}} state.
2) Applications are stuck in {{{}Pending{}}}.
3) Metrics are back to zero, so we don't have allocation counters set properly.


was (Author: pbacsko):
I tried to repro this problem with Minikube version 1.22.0.

I tried with {{kubectl deployment scale}} and {{kubectl delete pod}} to restart YK, but I haven't seen anything. No placeholder pods were deleted.

What I did see is though, it that after restart, placeholders do not time out and just keep running:
{noformat}
default                batch-sleep-job-3-5drfs                          0/1     Pending     0          37m
default                batch-sleep-job-3-cfl7c                          0/1     Pending     0          37m
default                batch-sleep-job-3-fvddw                          0/1     Pending     0          37m
default                batch-sleep-job-3-jqhnb                          0/1     Pending     0          37m
default                batch-sleep-job-3-v5qz4                          0/1     Pending     0          37m
default                tg-sleep-group-batch-sleep-job-3-0               1/1     Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-1               1/1     Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-2               1/1     Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-3               1/1     Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-4               1/1     Running     0          37m
default                tg-sleep-group-batch-sleep-job-3-5               1/1     Running     0          37m
default                yunikorn-admission-controller-78c775cfd9-6pp8d   1/1     Running     2          158m
default                yunikorn-scheduler-77dd7c665b-jcmt6              2/2     Running     0          11m
{noformat}

Eventually, {{tg-sleep}} pods disappeared, but the {{batch-sleep-job}} pods did not transition into Running state:
{noformat}
default                batch-sleep-job-3-5drfs                          0/1     Pending     0          57m
default                batch-sleep-job-3-cfl7c                          0/1     Pending     0          57m
default                batch-sleep-job-3-fvddw                          0/1     Pending     0          57m
default                batch-sleep-job-3-jqhnb                          0/1     Pending     0          57m
default                batch-sleep-job-3-v5qz4                          0/1     Pending     0          57m
default                yunikorn-admission-controller-78c775cfd9-6pp8d   1/1     Running     2          178m
default                yunikorn-scheduler-77dd7c665b-jcmt6              2/2     Running     0          31m
{noformat}

As a "bonus", we still have the total container count problem:
{noformat}
2022-03-25T13:59:31.633Z	WARN	metrics/metrics_collector.go:85	Could not calculate the totalContainersRunning.	{"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T13:59:35.018Z	INFO	shim/scheduler.go:356	No outstanding apps found for a while	{"timeout": "2m0s"}
2022-03-25T14:00:31.633Z	WARN	metrics/metrics_collector.go:85	Could not calculate the totalContainersRunning.	{"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:01:31.634Z	WARN	metrics/metrics_collector.go:85	Could not calculate the totalContainersRunning.	{"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:01:35.021Z	INFO	shim/scheduler.go:356	No outstanding apps found for a while	{"timeout": "2m0s"}
2022-03-25T14:02:31.638Z	WARN	metrics/metrics_collector.go:85	Could not calculate the totalContainersRunning.	{"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:03:31.634Z	WARN	metrics/metrics_collector.go:85	Could not calculate the totalContainersRunning.	{"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:03:35.022Z	INFO	shim/scheduler.go:356	No outstanding apps found for a while	{"timeout": "2m0s"}
{noformat}

To me it looks like we're having three problems:
1) Placeholder timers are started at a fixed period. We don't account for the amount of time the pod has already spent in {{Running}} state.
2) Applications are stuck in {{Pending}}.
3) Metrics are back to zero, so we don't have allocation counters set properly.

> Yunikorn recovery deletes existing placeholders
> -----------------------------------------------
>
>                 Key: YUNIKORN-560
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-560
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: shim - kubernetes
>            Reporter: Kinga Marton
>            Assignee: Peter Bacsko
>            Priority: Major
>              Labels: recovery
>
> On recovery, Yunikorn may intermittently delete placeholder pods. To reproduce, submit a gang job with minMembers > job parallelism (to guarantee that there are some placeholders running) and then delete yunikorn scheduler pod. 
> After recovery, there may not be any placeholder pods remaining.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org