You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@yunikorn.apache.org by GitBox <gi...@apache.org> on 2020/06/24 22:05:42 UTC

[GitHub] [incubator-yunikorn-k8shim] yangwwei opened a new pull request #140: [YUNIKORN-251] Post recovery release a pod may disrupt running pods

yangwwei opened a new pull request #140:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/140


   I found this issue while testing recovery. It can be reproduced with the following steps:
   
   1. Create an application, it launches multiple pods, keeps them running
   2. Restart the scheduler, the scheduler will recover the allocations based on allocated pods
   3. App gets recovered, so as its pods
   4. Kill one of the pod
   
   Expectation: only one pod gets released and removed from this app. But I saw: all existing allocations are released.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wilfred-s closed pull request #140: [YUNIKORN-251] Post recovery release a pod may disrupt running pods

Posted by GitBox <gi...@apache.org>.
wilfred-s closed pull request #140:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/140


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wilfred-s commented on pull request #140: [YUNIKORN-251] Post recovery release a pod may disrupt running pods

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #140:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/140#issuecomment-649234974


   I think you have far more generic issue here: you should never set the UUID on a `si.allocation` in the shim. You cannot reuse the UUID. The core owns the value and it is never reused, not even on recovery.
   
   When the recovered allocation is returned as confirmed it will come back with a new UUID. Looking at the core code I do not see that the recovery handles communicating to the shim that the allocations have been recovered successfully.
   
   In the end that new UUID should be set as the allocation UUID. If you want to remove that allocation you need to use that new UUID. The old UUID most likely will not exist and the core will just ignore the request and say that it cannot find the requested allocation.
   
   Based on this analysis I do not think the provided solution is sufficient.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wangdatan commented on pull request #140: [YUNIKORN-251] Post recovery release a pod may disrupt running pods

Posted by GitBox <gi...@apache.org>.
wangdatan commented on pull request #140:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/140#issuecomment-649100869


   @yangwwei , this looks bad. Is it possible to add a unit test to ensure this won't break in the future? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] yangwwei commented on pull request #140: [YUNIKORN-251] Post recovery release a pod may disrupt running pods

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #140:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/140#issuecomment-649149303


   > @yangwwei , this looks bad. Is it possible to add a unit test to ensure this won't break in the future?
   
   Yes. I've added the UT to cover this. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wilfred-s commented on pull request #140: [YUNIKORN-251] Post recovery release a pod may disrupt running pods

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #140:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/140#issuecomment-649946220


   The lint checks fail locally when I run the checks, filed YUNIKORN-258 to fix that
   Can you check with the change from #144  and fix the lint issues?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] wilfred-s commented on pull request #140: [YUNIKORN-251] Post recovery release a pod may disrupt running pods

Posted by GitBox <gi...@apache.org>.
wilfred-s commented on pull request #140:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/140#issuecomment-649923800


   After discussion off-line the analysis above turned out to be correct and a new change is needed on the core side to support this fix. apache/incubator-yunikorn-core#179 is needed to fix the core side to allow this to properly work.
   The changes on the shim side are still needed to fix the difference in behaviour around the `AllocationKey`. 
   
   The fact that `pod.name` can be empty means we need to be really careful using that in any communication between shim and core


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-yunikorn-k8shim] yangwwei commented on pull request #140: [YUNIKORN-251] Post recovery release a pod may disrupt running pods

Posted by GitBox <gi...@apache.org>.
yangwwei commented on pull request #140:
URL: https://github.com/apache/incubator-yunikorn-k8shim/pull/140#issuecomment-649279353


   Hi @wilfred-s 
   
   Please see the workflow for recovery http://yunikorn.apache.org/docs/next/design/resilience.
   
   During recovery, `si.Allocation` will be sent by the shim, the `UUID` will be set the same as the podUID. This is because the core doesn't need to really schedule this request again, it assumes the pod already scheduled onto a node, so it won't generate another `UUID`. 
   
   After recovery, the shim will use the same `UUID` to release the request from the core. This is the same `UUID` sent by shim's `UpdateRequest->NewNodeInfo->ExistingAllocations->Allocation->UUID`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org