You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Eli Schiff (Jira)" <ji...@apache.org> on 2023/03/03 14:50:00 UTC
[jira] [Updated] (YUNIKORN-1616) Terminating scheduler pods still actively scheduling when replacement pod launches

     [ https://issues.apache.org/jira/browse/YUNIKORN-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eli Schiff updated YUNIKORN-1616:
---------------------------------
    Description: 
If a yunikorn scheduler pod gets shut down for any reason (EX: manually deleted) the pod goes into a terminating state. After maybe 30 seconds the pod is fully shut down. However, once the pod goes into that terminating state, the replica set from the k8s deployment immediately creates a new pod. This can cause race conditions where both pods are trying to schedule for a short period of time. 

I have noticed errors like `failed to create placeholder pod \{"error": "pods \"tg-spark-executor-abcdefg-0\" already exists"}` caused by both scheduler pods attempting to make this placeholder pod at once. I believe this has also caused pods to get stuck pending when they should have been scheduled.

 

Some more context https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1677764993552089

 

There is currently discussion about adding a way to tell k8s deployments to not allow new pods to start before the old pod is fully shut down. [https://github.com/kubernetes/kubernetes/issues/115844]

 

In the meantime the solutions seems to be to switch to a statefulset.

[https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment] 

> *Note:* This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a [StatefulSet|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/].

 

From what I can tell, the use of a StatefulSet here is a pretty smooth transition, but I am not sure if there are wider issues or implications to this change that I do not know about.

  was:
If a yunikorn scheduler pod gets shut down for any reason (EX: manually deleted) the pod goes into a terminating state. After maybe 30 seconds the pod is fully shut down. However, once the pod goes into that terminating state, the replica set from the k8s deployment immediately creates a new pod. This can cause race conditions where both pods are trying to schedule for a short period of time. 

I have noticed errors like `failed to create placeholder pod \{"error": "pods \"tg-spark-executor-abcdefg-0\" already exists"}` caused by both scheduler pods attempting to make this placeholder pod at once. I believe this has also caused pods to get stuck pending when they should have been scheduled.

 

There is currently discussion about adding a way to tell k8s deployments to not allow new pods to start before the old pod is fully shut down. [https://github.com/kubernetes/kubernetes/issues/115844]

 

In the meantime the solutions seems to be to switch to a statefulset.

[https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment] 

> *Note:* This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a [StatefulSet|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/].

 

From what I can tell, the use of a StatefulSet here is a pretty smooth transition, but I am not sure if there are wider issues or implications to this change that I do not know about.


> Terminating scheduler pods still actively scheduling when replacement pod launches
> ----------------------------------------------------------------------------------
>
>                 Key: YUNIKORN-1616
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1616
>             Project: Apache YuniKorn
>          Issue Type: Bug
>            Reporter: Eli Schiff
>            Priority: Minor
>
> If a yunikorn scheduler pod gets shut down for any reason (EX: manually deleted) the pod goes into a terminating state. After maybe 30 seconds the pod is fully shut down. However, once the pod goes into that terminating state, the replica set from the k8s deployment immediately creates a new pod. This can cause race conditions where both pods are trying to schedule for a short period of time. 
> I have noticed errors like `failed to create placeholder pod \{"error": "pods \"tg-spark-executor-abcdefg-0\" already exists"}` caused by both scheduler pods attempting to make this placeholder pod at once. I believe this has also caused pods to get stuck pending when they should have been scheduled.
>  
> Some more context https://yunikornworkspace.slack.com/archives/CLNUW68MU/p1677764993552089
>  
> There is currently discussion about adding a way to tell k8s deployments to not allow new pods to start before the old pod is fully shut down. [https://github.com/kubernetes/kubernetes/issues/115844]
>  
> In the meantime the solutions seems to be to switch to a statefulset.
> [https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment] 
> > *Note:* This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a [StatefulSet|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/].
>  
> From what I can tell, the use of a StatefulSet here is a pretty smooth transition, but I am not sure if there are wider issues or implications to this change that I do not know about.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org