You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Peter Bacsko (Jira)" <ji...@apache.org> on 2022/03/25 14:22:00 UTC
[jira] [Comment Edited] (YUNIKORN-560) Yunikorn recovery deletes existing placeholders
[ https://issues.apache.org/jira/browse/YUNIKORN-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512406#comment-17512406 ]
Peter Bacsko edited comment on YUNIKORN-560 at 3/25/22, 2:21 PM:
-----------------------------------------------------------------
I tried to repro this problem with Minikube version 1.22.0.
I used {{kubectl deployment scale}} and {{kubectl delete pod}} to restart YK, but I haven't seen anything. No placeholder pods were deleted.
What I did see is though, it that after restart, placeholders do not time out and just keep running:
{noformat}
default batch-sleep-job-3-5drfs 0/1 Pending 0 37m
default batch-sleep-job-3-cfl7c 0/1 Pending 0 37m
default batch-sleep-job-3-fvddw 0/1 Pending 0 37m
default batch-sleep-job-3-jqhnb 0/1 Pending 0 37m
default batch-sleep-job-3-v5qz4 0/1 Pending 0 37m
default tg-sleep-group-batch-sleep-job-3-0 1/1 Running 0 37m
default tg-sleep-group-batch-sleep-job-3-1 1/1 Running 0 37m
default tg-sleep-group-batch-sleep-job-3-2 1/1 Running 0 37m
default tg-sleep-group-batch-sleep-job-3-3 1/1 Running 0 37m
default tg-sleep-group-batch-sleep-job-3-4 1/1 Running 0 37m
default tg-sleep-group-batch-sleep-job-3-5 1/1 Running 0 37m
default yunikorn-admission-controller-78c775cfd9-6pp8d 1/1 Running 2 158m
default yunikorn-scheduler-77dd7c665b-jcmt6 2/2 Running 0 11m
{noformat}
Eventually, {{tg-sleep}} pods disappeared, but the {{batch-sleep-job}} pods did not transition into Running state:
{noformat}
default batch-sleep-job-3-5drfs 0/1 Pending 0 57m
default batch-sleep-job-3-cfl7c 0/1 Pending 0 57m
default batch-sleep-job-3-fvddw 0/1 Pending 0 57m
default batch-sleep-job-3-jqhnb 0/1 Pending 0 57m
default batch-sleep-job-3-v5qz4 0/1 Pending 0 57m
default yunikorn-admission-controller-78c775cfd9-6pp8d 1/1 Running 2 178m
default yunikorn-scheduler-77dd7c665b-jcmt6 2/2 Running 0 31m
{noformat}
As a "bonus", we still have the total container count problem:
{noformat}
2022-03-25T13:59:31.633Z WARN metrics/metrics_collector.go:85 Could not calculate the totalContainersRunning. {"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T13:59:35.018Z INFO shim/scheduler.go:356 No outstanding apps found for a while {"timeout": "2m0s"}
2022-03-25T14:00:31.633Z WARN metrics/metrics_collector.go:85 Could not calculate the totalContainersRunning. {"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:01:31.634Z WARN metrics/metrics_collector.go:85 Could not calculate the totalContainersRunning. {"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:01:35.021Z INFO shim/scheduler.go:356 No outstanding apps found for a while {"timeout": "2m0s"}
2022-03-25T14:02:31.638Z WARN metrics/metrics_collector.go:85 Could not calculate the totalContainersRunning. {"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:03:31.634Z WARN metrics/metrics_collector.go:85 Could not calculate the totalContainersRunning. {"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:03:35.022Z INFO shim/scheduler.go:356 No outstanding apps found for a while {"timeout": "2m0s"}
{noformat}
To me it looks like we're having three problems:
1) Placeholder timers are started at a fixed period. We don't account for the amount of time the pod has already spent in {{Running}} state.
2) Applications are stuck in {{{}Pending{}}}.
3) Metrics are back to zero, so we don't have allocation counters set properly.
was (Author: pbacsko):
I tried to repro this problem with Minikube version 1.22.0.
I tried with {{kubectl deployment scale}} and {{kubectl delete pod}} to restart YK, but I haven't seen anything. No placeholder pods were deleted.
What I did see is though, it that after restart, placeholders do not time out and just keep running:
{noformat}
default batch-sleep-job-3-5drfs 0/1 Pending 0 37m
default batch-sleep-job-3-cfl7c 0/1 Pending 0 37m
default batch-sleep-job-3-fvddw 0/1 Pending 0 37m
default batch-sleep-job-3-jqhnb 0/1 Pending 0 37m
default batch-sleep-job-3-v5qz4 0/1 Pending 0 37m
default tg-sleep-group-batch-sleep-job-3-0 1/1 Running 0 37m
default tg-sleep-group-batch-sleep-job-3-1 1/1 Running 0 37m
default tg-sleep-group-batch-sleep-job-3-2 1/1 Running 0 37m
default tg-sleep-group-batch-sleep-job-3-3 1/1 Running 0 37m
default tg-sleep-group-batch-sleep-job-3-4 1/1 Running 0 37m
default tg-sleep-group-batch-sleep-job-3-5 1/1 Running 0 37m
default yunikorn-admission-controller-78c775cfd9-6pp8d 1/1 Running 2 158m
default yunikorn-scheduler-77dd7c665b-jcmt6 2/2 Running 0 11m
{noformat}
Eventually, {{tg-sleep}} pods disappeared, but the {{batch-sleep-job}} pods did not transition into Running state:
{noformat}
default batch-sleep-job-3-5drfs 0/1 Pending 0 57m
default batch-sleep-job-3-cfl7c 0/1 Pending 0 57m
default batch-sleep-job-3-fvddw 0/1 Pending 0 57m
default batch-sleep-job-3-jqhnb 0/1 Pending 0 57m
default batch-sleep-job-3-v5qz4 0/1 Pending 0 57m
default yunikorn-admission-controller-78c775cfd9-6pp8d 1/1 Running 2 178m
default yunikorn-scheduler-77dd7c665b-jcmt6 2/2 Running 0 31m
{noformat}
As a "bonus", we still have the total container count problem:
{noformat}
2022-03-25T13:59:31.633Z WARN metrics/metrics_collector.go:85 Could not calculate the totalContainersRunning. {"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T13:59:35.018Z INFO shim/scheduler.go:356 No outstanding apps found for a while {"timeout": "2m0s"}
2022-03-25T14:00:31.633Z WARN metrics/metrics_collector.go:85 Could not calculate the totalContainersRunning. {"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:01:31.634Z WARN metrics/metrics_collector.go:85 Could not calculate the totalContainersRunning. {"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:01:35.021Z INFO shim/scheduler.go:356 No outstanding apps found for a while {"timeout": "2m0s"}
2022-03-25T14:02:31.638Z WARN metrics/metrics_collector.go:85 Could not calculate the totalContainersRunning. {"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:03:31.634Z WARN metrics/metrics_collector.go:85 Could not calculate the totalContainersRunning. {"allocatedContainers": 0, "releasedContainers": 6}
2022-03-25T14:03:35.022Z INFO shim/scheduler.go:356 No outstanding apps found for a while {"timeout": "2m0s"}
{noformat}
To me it looks like we're having three problems:
1) Placeholder timers are started at a fixed period. We don't account for the amount of time the pod has already spent in {{Running}} state.
2) Applications are stuck in {{Pending}}.
3) Metrics are back to zero, so we don't have allocation counters set properly.
> Yunikorn recovery deletes existing placeholders
> -----------------------------------------------
>
> Key: YUNIKORN-560
> URL: https://issues.apache.org/jira/browse/YUNIKORN-560
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: shim - kubernetes
> Reporter: Kinga Marton
> Assignee: Peter Bacsko
> Priority: Major
> Labels: recovery
>
> On recovery, Yunikorn may intermittently delete placeholder pods. To reproduce, submit a gang job with minMembers > job parallelism (to guarantee that there are some placeholders running) and then delete yunikorn scheduler pod.
> After recovery, there may not be any placeholder pods remaining.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org