You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Yu-Lin Chen (Jira)" <ji...@apache.org> on 2023/09/25 12:49:00 UTC

[jira] [Created] (YUNIKORN-1998) Stale AdmissionControllerConf was used in e2e test

Yu-Lin Chen created YUNIKORN-1998:
-------------------------------------

             Summary: Stale AdmissionControllerConf was used in e2e test
                 Key: YUNIKORN-1998
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1998
             Project: Apache YuniKorn
          Issue Type: Bug
          Components: test - e2e
            Reporter: Yu-Lin Chen


In e2e test for "user_group_limit", we updated Yunikorn ConfigMap before submit sleep pods.  [https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/user_group_limit/user_group_limit_test.go#L67-L70]

However, the AdmissionControllerConf in Admission controller doesn't immediately reflect the changes. The time gap leads following error in CI flow:  [https://github.com/apache/yunikorn-k8shim/actions/runs/6282144385/job/17062987755?pr=677#step:5:2346]


We need to find a way to ensure AdmissionControllerConf has been updated before e2e test submits a new pods.


This issue can be reproduced if we introduce 1 second when the admission controller updates the AdmissionControllerConf. Here is the example code.
{code:java}
func (h *configMapUpdateHandler) OnUpdate(_, newObj interface{}) {
    cm := utils.Convert2ConfigMap(newObj)
    
    // sleep 1 to deply AdmissionControllerConf update
    time.Sleep(1 * time.Second)
    
    if idx, ok := h.configMapIndex(cm); ok {
        h.conf.configUpdated(idx, cm)
    }
} {code}
[https://github.com/apache/yunikorn-k8shim/blob/master/pkg/admission/conf/am_conf.go#L237-L238]




Below are the before/after AM logs when we added 1 sec delay:



Success without delay 1 sec:
 * (AM configMapUpdateHandler)             05:36:39.184Z.  AdmissionControllerConf trying to upgrade config
 * (AM configMapUpdateHandler)             05:36:39.185Z . AdmissionControllerConf config upgraded
 * (AM Webhook)                                       05:36:39.218Z.  AM receive AdmissionReview request
 * (AM Webhook)                                       05:36:39.221Z.  AM check Pods with new config,  E2E Test Passed 

Failed with delay 1 sec:
 * (AM configMapUpdateHandler)            08:19:31.025Z.   AdmissionControllerConf trying to upgrade config
 * (AM Webhook)                                      08:19:31.067Z.   AM receive AdmissionReview request
 * (AM Webhook)                                      08:19:31.069Z    {color:#FF0000}*AM check Pods with stale config*{color}, {color:#FF0000}*E2E Test failed* {color}
 * (AM configMapUpdateHandler)            08:19:32.026Z.  AdmissionControllerConf config upgraded

In my kind cluster(v1.24.15), there is only 0.033 sec gap.  It's possible that the admission controller in CI flow runs those steps in different order.



The possible Solution:
1. In the e2e test, sleep for 1 seconds between updating the configmap and submitting new sleep pods. (It's quick fix, I assumed the time gap is always less than 1 sec ) 
2. In the e2e test, check current AdmissionControllerConf's value before submit a new pod. (How do client dump current AdmissionControllerConf? Need to seek for more advice.)

Since this issue only impact e2e test for now, we can go with solution #1 as a quick fix.  But it'll be better if we allow client to check whether AdmissionControllerConf is up-to-date.

Please kindly let me know if I have any misunderstandings. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org