You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Craig Condit (Jira)" <ji...@apache.org> on 2024/01/03 21:39:00 UTC
[jira] [Resolved] (YUNIKORN-2292) Flaky E2E Test: Orphan pods still exist after TearDownNamespace()
[ https://issues.apache.org/jira/browse/YUNIKORN-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Craig Condit resolved YUNIKORN-2292.
------------------------------------
Fix Version/s: 1.5.0
Target Version: 1.5.0
Resolution: Fixed
Merged to master. Thanks [~Yu-Lin Chen] for the contribution.
> Flaky E2E Test: Orphan pods still exist after TearDownNamespace()
> -----------------------------------------------------------------
>
> Key: YUNIKORN-2292
> URL: https://issues.apache.org/jira/browse/YUNIKORN-2292
> Project: Apache YuniKorn
> Issue Type: Sub-task
> Components: test - e2e
> Reporter: Yu-Lin Chen
> Assignee: Yu-Lin Chen
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.5.0
>
>
> The current [TearDownNamespace() |https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/framework/helpers/k8s/k8s_utils.go#L393]function delete pods before deleting namesapce. However, there might be a Job still alive that can recreate the deleted Pods. The incompleted cleanup will cause other e2e test to fail.
> For example: This failed E2E test ([PR #742|https://github.com/apache/yunikorn-k8shim/actions/runs/7078108545/job/19271866521?pr=742#step:6:3952]) was caused by an orphan pod, which was created by a Job submitted in [recovery_and_restart_test.go#L339|https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/recovery_and_restart/recovery_and_restart_test.go#L339] .
> {*}Time series{*}:
> * 2023-12-04T04:39:38.4083946Z Submit gang job. (recovery_and_restart)
> * 2023-12-04T04:39:46.8767068Z {color:#de350b}Tear down namespace{color}: devjyy8x (Delete pod → Delete Namesapce)
> * 2023-12-04T04:39:47.0512900Z Tear down namespace completed ({color:#de350b}One orphan remains{color})
> * 2023-12-04T04:39:50.5110355Z Restart the scheduler pod. (resource_fairness)
> * 2023-12-04T04:39:52.178Z ERROR Failed to add application to partition (placement rejected)
> → {color:#de350b}The orphan pod led to node removal after scheduler recovery. {color}
> * 2023-12-04T04:41:35.9084137Z test faield. (simple_preemptor)
> * 2023-12-04T04:41:36.3406477Z Cluster dump output shows 1 orphan pod with age 110s ({color:#de350b}The pod was recreated when running TearDownNamespace(){color} )
> {*}Solution{*}:
> In addition to calling deletePod, we should also check and delete Jobs in [TearDownNamespace() |https://github.com/apache/yunikorn-k8shim/blob/master/test/e2e/framework/helpers/k8s/k8s_utils.go#L393]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org