You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Craig Condit (Jira)" <ji...@apache.org> on 2024/01/04 18:14:00 UTC

[jira] [Resolved] (YUNIKORN-2099) [Umbrella] State initialisation simplification (phase 2)

     [ https://issues.apache.org/jira/browse/YUNIKORN-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Craig Condit resolved YUNIKORN-2099.
------------------------------------
    Fix Version/s: 1.5.0
       Resolution: Fixed

Resolving as all subtasks are complete.

> [Umbrella] State initialisation simplification (phase 2)
> --------------------------------------------------------
>
>                 Key: YUNIKORN-2099
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2099
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>          Components: core - scheduler, shim - kubernetes
>            Reporter: Craig Condit
>            Assignee: Craig Condit
>            Priority: Major
>             Fix For: 1.5.0
>
>
> Startup rebuilds all state of the cluster. This is called recovery. The name is a bit misleading as it is not really recovery as it is loading the current state. State initialisation is a better term to use.
> The current recovery code links the loading of applications and tasks (pods) to node loading. This makes the recovery code complex and thus fragile. It could, in a worst case scenario, lead to a pod not being recovered correctly.
> Recovery should be a step by step process that has boundaries and steps:
>  * load node
>  ** register nodes with the core
>  * load pods
>  ** create applications in core
>  ** register running pods as allocations with the core
>  ** register pending pods as asks with the core
>  * process changes for nodes and pods
>  * start scheduling
> No nodes, applications or asks on existing apps should be declined. Even if theĀ  queue does not exist a running application must be added and handled. The current rejection of an application if it cannot be placed in the queue is an incorrect behaviour.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org