You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@yunikorn.apache.org by "Eli Schiff (Jira)" <ji...@apache.org> on 2023/03/03 14:41:00 UTC

[jira] [Created] (YUNIKORN-1616) Terminating scheduler pods still actively scheduling when replacement pod launches

Eli Schiff created YUNIKORN-1616:
------------------------------------

             Summary: Terminating scheduler pods still actively scheduling when replacement pod launches
                 Key: YUNIKORN-1616
                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1616
             Project: Apache YuniKorn
          Issue Type: Bug
            Reporter: Eli Schiff
            Assignee: Eli Schiff


If a yunikorn scheduler pod gets shut down for any reason (EX: manually deleted) the pod goes into a terminating state. After maybe 30 seconds the pod is fully shut down. However, once the pod goes into that terminating state, the replica set from the k8s deployment immediately creates a new pod. This can cause race conditions where both pods are trying to schedule for a short period of time. 

I have noticed errors like `failed to create placeholder pod \{"error": "pods \"tg-spark-executor-abcdefg-0\" already exists"}` caused by both scheduler pods attempting to make this placeholder pod at once. I believe this has also caused pods to get stuck pending when they should have been scheduled.

 

There is currently discussion about adding a way to tell k8s deployments to not allow new pods to start before the old pod is fully shut down. [https://github.com/kubernetes/kubernetes/issues/115844]

 

In the meantime the solutions seems to be to switch to a statefulset.

[https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment] 

> *Note:* This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a [StatefulSet|https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/].

 

From what I can tell, the use of a StatefulSet here is a pretty smooth transition, but I am not sure if there are wider issues or implications to this change that I do not know about.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: dev-help@yunikorn.apache.org