You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by "wilfred-s (via GitHub)" <gi...@apache.org> on 2023/02/27 07:05:14 UTC

[GitHub] [yunikorn-k8shim] wilfred-s commented on pull request #541: [YUNIKORN-1597] Gang scheduling: application might not transition to Running after recovery

wilfred-s commented on PR #541:
URL: https://github.com/apache/yunikorn-k8shim/pull/541#issuecomment-1445809081

   The comment in `postAppAccepted()` says that we skip reserving and directly go through to running under certain circumstances. It is the only point we generate a `TryReserve` event which could have us end up in `onReserving()` . 
   In the case we saw we had all placeholder pods and the originating Spark driver pod as part of the recovery. 
   
   In `OnReserving()` we create the placeholders. We thus would try to create all the placeholders from scratch. We would just log "placeholder exists" for most or not all of them. Since we could have crashed half way through creating the placeholders I think that it is OK to retry the creation part. Placeholder timeout should kick in correctly and cleanup anyway. 
   
   Back to our issue: the originator pod (driver) and 1 placeholder were not allocated. The change adds an extra call to make sure the application progresses. That only works if all placeholders are Bound to a node when the transition is made. I think that call is correct, but does not work by itself to fix the case we saw.
   The allocation of that last placeholder did not trigger the application change as expected. After 15 minutes all placeholders timed out and we left the originating driver just sitting there.
   The recovered tasks, all the placeholders, were initialised with the state `Allocated`. Unless they moved to a state `Bound` as part of the recovery the check in `onReservationStateChange()` would have failed. The change you stated as "more sound" is the main fix I think.
   
   Not sure if we can separately test that one change...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@yunikorn.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org