You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@yunikorn.apache.org by "Peter Bacsko (Jira)" <ji...@apache.org> on 2023/02/23 12:42:00 UTC

[jira] [Created] (YUNIKORN-1597) Gang scheduling: application might not transition to Running after recovery

Peter Bacsko created YUNIKORN-1597:
--------------------------------------

             Summary: Gang scheduling: application might not transition to Running after recovery
                 Key: YUNIKORN-1597
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1597
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: shim - kubernetes
            Reporter: Peter Bacsko
            Assignee: Peter Bacsko


Pods get suck in a certain recovery scenario which involves gang scheduling.

High level overview:
1. All placeholders are running and allocated
2. The real pod is in Pending state
3. Yunikorn crashes and recovers

In this case, the real pod will not transition to Running. It's because:
1. Upon recovery, the state of recovered tasks will be set to "Allocated", not "Bound".
2. If placeholder tasks are already running and allocated, there will be no call to {{postTaskBound()}}.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: dev-help@yunikorn.apache.org