You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Wilfred Spiegelenburg (Jira)" <ji...@apache.org> on 2023/03/21 03:31:00 UTC
[jira] [Commented] (YUNIKORN-1642) Scheduler recovery failed due to listing operation timeout

    [ https://issues.apache.org/jira/browse/YUNIKORN-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703012#comment-17703012 ] 

Wilfred Spiegelenburg commented on YUNIKORN-1642:
-------------------------------------------------

In YUNIKORN-1609 we have made the timeout configurable for YuniKorn 1.3 and later.

I agree that just logging a WARN level message and proceeding is not right. Crashing the scheduler and hoping the next one is better is also wrong.

We need to understand better how the recovery time is affected by large clusters. We need to recover to make sure we have the right state. There have already been changes to "save" the events that come in while we recover and process them as soon as recovery is done.

Do we have some generic numbers on the cluster size: nodes and pods etc. YUNIKORN-1609 was logged after a 3000 node cluster recovery failed. Better to get some details and maybe even remove the time out or make it much larger...

> Scheduler recovery failed due to listing operation timeout
> ----------------------------------------------------------
>
>                 Key: YUNIKORN-1642
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1642
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Major
>              Labels: pull-request-available
>
> The listing operation in the recovery phase: https://github.com/apache/yunikorn-k8shim/blob/c25ac60ffbc175c4966f917da21d184f34dea7b4/pkg/client/apifactory.go#L225. This could sometimes fail on some large clusters, the response time from API server is not guaranteed. And we see logs like this
> {noformat}
> 2023-03-16T07:00:46.181Z	WARN	client/apifactory.go:218	Failed to sync informers	{"error": "timeout waiting for condition"}
> 2023-03-16T07:00:46.182Z	INFO	general/general.go:344	Pod list retrieved from api server	{"nr of pods": 0}
> 2023-03-16T07:00:46.182Z	INFO	general/general.go:365	Application recovery statistics	{"nr of recoverable apps": 0, "nr of total pods": 0, "nr of pods without application metadata": 0, "nr of pods to be recovered": 0}
> I0316 07:00:51.319100       1 trace.go:205] Trace[140954425]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.20.11/tools/cache/reflector.go:167 (16-Mar-2023 07:00:16.168) (total time: 35150ms):
> {noformat}
> Since it is a WARN, it continues but the informers did not return anything. This confuses the scheduler that nothing needs to be recovered, and it goes ahead doing the scheduling. This causes subsequential scheduler failures.  And eventually, nothing can be scheduled anymore.
> This should be a FATAL error. So the scheduler can be restarted to retry recoverying.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org