You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@yunikorn.apache.org by "Weiwei Yang (Jira)" <ji...@apache.org> on 2023/03/20 22:31:00 UTC

[jira] [Created] (YUNIKORN-1642) Scheduler recovery failed due to listing operation timeout

Weiwei Yang created YUNIKORN-1642:
-------------------------------------

             Summary: Scheduler recovery failed due to listing operation timeout
                 Key: YUNIKORN-1642
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1642
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: shim - kubernetes
            Reporter: Weiwei Yang


The listing operation in the recovery phase: https://github.com/apache/yunikorn-k8shim/blob/c25ac60ffbc175c4966f917da21d184f34dea7b4/pkg/client/apifactory.go#L225. This could sometimes fail on some large clusters, the response time from API server is not guaranteed. And we see logs like this

{noformat}
2023-03-16T07:00:46.181Z	WARN	client/apifactory.go:218	Failed to sync informers	{"error": "timeout waiting for condition"}
2023-03-16T07:00:46.182Z	INFO	general/general.go:344	Pod list retrieved from api server	{"nr of pods": 0}
2023-03-16T07:00:46.182Z	INFO	general/general.go:365	Application recovery statistics	{"nr of recoverable apps": 0, "nr of total pods": 0, "nr of pods without application metadata": 0, "nr of pods to be recovered": 0}
I0316 07:00:51.319100       1 trace.go:205] Trace[140954425]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.20.11/tools/cache/reflector.go:167 (16-Mar-2023 07:00:16.168) (total time: 35150ms):
{noformat}

Since it is a WARN, it continues but the informers did not return anything. This confuses the scheduler that nothing needs to be recovered, and it goes ahead doing the scheduling. This causes subsequential scheduler failures.  And eventually, nothing can be scheduled anymore.

This should be a FATAL error. So the scheduler can be restarted to retry recoverying.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: dev-help@yunikorn.apache.org