You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Weiwei Yang (Jira)" <ji...@apache.org> on 2021/07/16 04:40:00 UTC

[jira] [Commented] (YUNIKORN-703) During recovery no nodes are added to the scheduler cache

    [ https://issues.apache.org/jira/browse/YUNIKORN-703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17381754#comment-17381754 ] 

Weiwei Yang commented on YUNIKORN-703:
--------------------------------------

See our discussion in [https://yunikornworkspace.slack.com/archives/CL9CRJ1KM/p1626406603067200.]

Basically a queue got removed but there are still jobs running. The queue was removed via direct update in configmap. Since the queue doesn’t exist anymore, YK rejects the app, and the during recovery, those pods were not able to be recovered because the app was rejected. And recovery failed and we see what we are seeing here. The log is attached, the scheduler is waiting for recovery, there is a 3-minute timeout, and before that reached, the scheduler was aborted from outside.

Most things are working as expected, except the following issues:

1. For the following log, it should be in ERROR level instead of INFO:

{code}

2021-07-16T03:21:46.199Z INFO scheduler/context.go:460 Failed to add application to partition (placement rejected) \{“applicationID”: “yunikorn-remote-shuffle-service-autogen”, “partitionName”: “[mycluster]default”, “error”: “application ‘yunikorn-remote-shuffle-service-autogen’ rejected, cannot create queue ‘root.spark’ without placement rules”}

{code}

2. During recovery, when there some pods are not able to be recovered due to the app is rejected, we should fail fast instead of waiting for the 3 minutes timeout. And necessarily we need to expose a detailed log.

> During recovery no nodes are added to the scheduler cache
> ---------------------------------------------------------
>
>                 Key: YUNIKORN-703
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-703
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Chaoran Yu
>            Priority: Major
>             Fix For: 1.0.0
>
>
> When the scheduler is installed or restarted, sometimes (about 1 in 3 times) no nodes are added to the cache during the initial recovery phase. The nodes REST endpoint (/ws/v1/nodes) shows an empty list. The issue will be fixed if restart is attempted multiple times.
> I'll add some logs later to this Jira once I get some



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org