You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Weiwei Yang (Jira)" <ji...@apache.org> on 2021/07/07 21:34:00 UTC

[jira] [Assigned] (YUNIKORN-741) Regression: occupied resources miscalculated sometimes for yunikorn pods

     [ https://issues.apache.org/jira/browse/YUNIKORN-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Weiwei Yang reassigned YUNIKORN-741:
------------------------------------

    Assignee: Weiwei Yang

> Regression: occupied resources miscalculated sometimes for yunikorn pods
> ------------------------------------------------------------------------
>
>                 Key: YUNIKORN-741
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-741
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: shim - kubernetes
>            Reporter: Weiwei Yang
>            Assignee: Weiwei Yang
>            Priority: Major
>
> This is a regression caused by YUNIKORN-677. 
> YUNIKORN-677 changes the check of how we see a pod needs recovery, now it is based on whether a pod is allocated to a node (when pod.Spec.NodeName is set). For occupied resources, it is similar, however, the fix in YUNIKORN-677 changes the condition for occupied resource recovery but leaves the node coordinator code (where we handle pod updates) as the old way. This caused the following issue:
>  * During recovery, the scheduler sees the scheduler pod was already allocated (pod.Spec.NodeName is set), so the occupied resources were reported to the core, code: [https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/context_recovery.go#L113-L128].
>  * Once the scheduler is recovered, the pod informers will be started, and the node coordinator starts to run. In some cases, the node informer will inform us of the scheduler pod and the admission-controller pod phase changes (from Pending to Running), and this triggers another occupied resource update. Code: [https://github.com/apache/incubator-yunikorn-k8shim/blob/5658ce32f630d5ea75cea2772522a76ced30250a/pkg/cache/node_coordinator.go#L74-L101]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org