You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Craig Condit (Jira)" <ji...@apache.org> on 2024/01/19 18:55:00 UTC

[jira] [Resolved] (YUNIKORN-2327) Race condition during update Occupied Resource from Shim to Core

     [ https://issues.apache.org/jira/browse/YUNIKORN-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Craig Condit resolved YUNIKORN-2327.
------------------------------------
     Fix Version/s: 1.5.0
    Target Version: 1.5.0
        Resolution: Delivered

> Race condition during update Occupied Resource from Shim to Core
> ----------------------------------------------------------------
>
>                 Key: YUNIKORN-2327
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-2327
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - common, shim - kubernetes
>            Reporter: Yu-Lin Chen
>            Assignee: Yu-Lin Chen
>            Priority: Major
>             Fix For: 1.5.0
>
>         Attachments: Verify_basic_preemption_k8sClusterInfo.txt, Verify_basic_preemption_ykContainerLog.txt, Verify_basic_preemption_ykFullStateDump.json
>
>
> When initializing YuniKorn, existing Non-YuniKorn pods (ForeignPods) are counted as node's occupied resources. An SchedulerAPI.UpdateNode(request) is triggered asynchronously to update the occupied resources for the node in the core. However, a race condition occurs on the core side during this asynchronous update process.
> {*}How to reproduce{*}:
>  - Add 2 seconds delay for the first pod, the final occupied resource will equal to the first pod's resource size after restart YuniKorn. ([example|https://github.com/apache/yunikorn-core/compare/master...chenyulin0719:yunikorn-core:YUNIKORN-2313-ADD-2-SECOND-DELAY#diff-3bd07740ee12121844b14ddafec10a36332fe2cd80421174110edf042f780e23R397-R399])
>  - The issue is the root cause of YUNIKORN-2313
> {*}Error Logs{*}: ([E2E test link-v1.29.0|https://github.com/chenyulin0719/yunikorn-k8shim/actions/runs/7530262471])
> Shim logs: (yk8s-worker)
> {code:json}
> 2024-01-15T14:39:20.135Z Shim trigger SchedulerAPI() request: occupied: resources:{key:"pods" value:{value:1}}
> 2024-01-15T14:39:20.135Z Shim trigger SchedulerAPI() request: occupied: resources:{key:"memory" value:{value:52428800}} resources:{key:"pods" value:{value:2}} resources:{key:"vcore" value:{value:100}}
> 2024-01-15T14:39:20.136Z Shim trigger SchedulerAPI() request: occupied: resources:{key:"memory" value:{value:576716800}} resources:{key:"pods" value:{value:3}} resources:{key:"vcore" value:{value:200}}
> {code}
> Core logs: (yk8s-worker)
> {code:json}
> 2024-01-15T14:39:20.137Z set occupiedResource: map[memory:52428800 pods:2 vcore:100]
> 2024-01-15T14:39:20.137Z set occupiedResource: map[memory:576716800 pods:3 vcore:200]
> 2024-01-15T14:39:22.136Z set occupiedResource: map[pods:1]
> {code}
> {*}Final occupied resource in state dump{*}:
> {code:json}
> ...
>         {
>           "nodeID": "yk8s-worker",
>           "attributes": {
>             "ready": "true",
>             "si.io/hostname": "yk8s-worker",
>             "si.io/rackname": "/rack-default",
>             "si/node-partition": "[mycluster]default"
>           },
>            ...
>           "occupied": {
>             "pods": 1
>           }...
>         }
> {code}
> Key code of this issue:
> * (Go routine) https://github.com/apache/yunikorn-core/blob/master/pkg/rmproxy/rmproxy.go#L378-L389 
> * (Set function) https://github.com/apache/yunikorn-core/blob/master/pkg/scheduler/objects/node.go#L195-L197 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org