You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Craig Condit (Jira)" <ji...@apache.org> on 2023/09/27 21:44:00 UTC

[jira] [Commented] (YUNIKORN-1998) Stale AdmissionControllerConf was used in e2e test

    [ https://issues.apache.org/jira/browse/YUNIKORN-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769826#comment-17769826 ] 

Craig Condit commented on YUNIKORN-1998:
----------------------------------------

There's no magic to a 1 second sleep. Even that may not be enough, but probably masks the issue a bit more often. Updates happen asynchronously, so there's no guarantee that new configuration has been applied. I'd probably go with at least 5 seconds personally, this could fail on a slow enough cluster.

There's no way to dump out the current AM conf (nor do I believe we need to add it just for this). You could grab the scheduler conf, which is the same one, but since they run in separate pods there's no guarantee that both have been updated.

> Stale AdmissionControllerConf was used in e2e test
> --------------------------------------------------
>
>                 Key: YUNIKORN-1998
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1998
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: test - e2e
>            Reporter: Yu-Lin Chen
>            Assignee: Yu-Lin Chen
>            Priority: Minor
>
> In e2e test for "user_group_limit", we updated Yunikorn ConfigMap before submit sleep pods.  [https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/user_group_limit/user_group_limit_test.go#L67-L70]
> However, the AdmissionControllerConf in Admission controller doesn't immediately reflect the changes. The time gap leads following error in CI flow:  [https://github.com/apache/yunikorn-k8shim/actions/runs/6282144385/job/17062987755?pr=677#step:5:2346]
> We need to find a way to ensure AdmissionControllerConf has been updated before e2e test submits a new pods.
> This issue can be reproduced if we introduce 1 second delay when the admission controller updates the AdmissionControllerConf. Here is the example code.
> {code:java}
> func (h *configMapUpdateHandler) OnUpdate(_, newObj interface{}) {
>     cm := utils.Convert2ConfigMap(newObj)
>     
>     // sleep 1 to delay AdmissionControllerConf update
>     time.Sleep(1 * time.Second)
>     
>     if idx, ok := h.configMapIndex(cm); ok {
>         h.conf.configUpdated(idx, cm)
>     }
> } {code}
> [https://github.com/apache/yunikorn-k8shim/blob/master/pkg/admission/conf/am_conf.go#L237-L238]
> Below are the before/after AM logs when we added 1 sec delay:
> Success without delay 1 sec:
>  * (AM configMapUpdateHandler)             05:36:39.184Z.  AdmissionControllerConf trying to upgrade config
>  * (AM configMapUpdateHandler)             {color:#de350b}05:36:39.185Z{color} . AdmissionControllerConf config upgraded
>  * (AM Webhook)                                       05:36:39.218Z.  AM receive AdmissionReview request
>  * (AM Webhook)                                       {color:#de350b}05:36:39.221Z{color}.  AM check Pods with new config,  E2E Test Passed 
> Failed with delay 1 sec:
>  * (AM configMapUpdateHandler)            08:19:31.025Z.   AdmissionControllerConf trying to upgrade config
>  * (AM Webhook)                                      08:19:31.067Z.   AM receive AdmissionReview request
>  * (AM Webhook)                                      08:19:31.069Z    {color:#ff0000}*AM check Pods with stale config*{color}, {color:#ff0000}*E2E Test failed* {color}
>  * (AM configMapUpdateHandler)            08:19:32.026Z.  AdmissionControllerConf config upgraded
> In my kind cluster(v1.24.15), there is only 0.036 sec gap ({color:#de350b}05:36:39.185Z~ {color}{color:#de350b}05:36:39.221Z{color}).  It's possible that the admission controller in CI flow runs those steps in different order.
> The possible Solution:
> 1. In the e2e test, sleep for 1 seconds between updating the configmap and submitting new sleep pods. (It's quick fix, I assumed the time gap is always less than 1 sec ) 
> 2. In the e2e test, check current AdmissionControllerConf's value before submit a new pod. (How do client dump current AdmissionControllerConf? Need to seek for more advice.)
> Since this issue only impact e2e test for now, we can go with solution #1 as a quick fix.  But it'll be better if we allow client to check whether AdmissionControllerConf is up-to-date.
> Please kindly let me know if I have any misunderstandings. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org