You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Rainie Li (Jira)" <ji...@apache.org> on 2023/09/14 21:36:00 UTC
[jira] [Updated] (YUNIKORN-1988) Preemption happens when a queue lower than its guaranteed capacity

     [ https://issues.apache.org/jira/browse/YUNIKORN-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rainie Li updated YUNIKORN-1988:
--------------------------------
    Description: 
*Background:* 
We set tier based priorityClass and using YuniKorn 1.3 with Admission controller in production (our prod cluster has hundreds of EKS nodes). 
Many production tier2 jobs got preempted unexpectedly. From application log, we saw driver pods all got shutdown.

Most failed jobs were from the same queue, we set 300G as guaranteed memory for queue that got preempted, all driver pods required 24G memory. We disabled preemption feature in production to mitigate the issue.

*Investigation:* 

Reproduced the issue on dev env, preemption can happen when a queue lower than its guaranteed capacity 

I am investigating how to fix the issue. 

  was:
*Background:* 
We set tier based priorityClass and using 1.3 with Admission controller in production (our prod cluster has hundreds of EKS nodes). 
Many production tier2 jobs got preempted unexpectedly. From application log, we saw driver pods all got shutdown.

Most failed jobs were from the same queue, we set 300G as guaranteed memory for queue that got preempted, all driver pods required 24G memory. We disabled preemption feature in production to mitigate the issue.

*Investigation:* 

Reproduced the issue on dev env, preemption can happen when a queue lower than its guaranteed capacity 

I am investigating how to fix the issue. 


> Preemption happens when a queue lower than its guaranteed capacity 
> -------------------------------------------------------------------
>
>                 Key: YUNIKORN-1988
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1988
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>            Reporter: Rainie Li
>            Assignee: Rainie Li
>            Priority: Critical
>
> *Background:* 
> We set tier based priorityClass and using YuniKorn 1.3 with Admission controller in production (our prod cluster has hundreds of EKS nodes). 
> Many production tier2 jobs got preempted unexpectedly. From application log, we saw driver pods all got shutdown.
> Most failed jobs were from the same queue, we set 300G as guaranteed memory for queue that got preempted, all driver pods required 24G memory. We disabled preemption feature in production to mitigate the issue.
> *Investigation:* 
> Reproduced the issue on dev env, preemption can happen when a queue lower than its guaranteed capacity 
> I am investigating how to fix the issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org