You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Manikandan R (Jira)" <ji...@apache.org> on 2022/03/18 16:50:00 UTC

[jira] [Commented] (YUNIKORN-1117) Use k8s Nodes condition to determine health of the node

    [ https://issues.apache.org/jira/browse/YUNIKORN-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508916#comment-17508916 ] 

Manikandan R commented on YUNIKORN-1117:
----------------------------------------

On digging deeper into this, learned that it is better to depend on "Taints and Tolerations" instead of "Conditions" as earlier one superseded the later because of its flexibility (Namespaces etc). For all conditions documented above, k8s automatically creates a taints against that particular node. Each taint has effect as well. Effects are NoSchedule, NoExecute etc. For example, For "Ready == false" condition, "node.kubernetes.io/not-ready" taint would be created against that node with NoSchedule effect. NoSchedule means that node should not be picked up for scheduling.

We can do the changes in phase wise manner as described below:

Phase 1:

As part of handling NodeUpdate event, Shim should able to receive the \{{Taints}} info from \{{*v1.Node}} and passes it to core through \{{*si.NodeInfo}} to do 2 things: 1. Marking that node as "unschedulable" so it doesn't get picked for scheduling based on the taint effect 2. To update metrics. To do this, need to add few more fields in \{{*si.NodeInfo}} message as well.

Phase 2:

We can handle Tolerations in future phases if needed. Tolerations is all about how pods can tolerate these taints. Usually, pods don't need to tolerate the above discussed in-built or system based taints. A taint (User or Admin defined) could be used to prevent Pod (which doesn't require a GPU) from scheduling on GPU nodes.

> Use k8s Nodes condition to determine health of the node
> -------------------------------------------------------
>
>                 Key: YUNIKORN-1117
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-1117
>             Project: Apache YuniKorn
>          Issue Type: Improvement
>            Reporter: Manikandan R
>            Assignee: Manikandan R
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Among multiple conditions discussed in [https://kubernetes.io/docs/concepts/architecture/nodes/#condition], only "Ready" has been used. We should use other conditions as well to determine a generic {{isNodeHealthy}} factor and eventually passing to core as well.
> Please refer the discussion [https://github.com/apache/incubator-yunikorn-k8shim/pull/380#issuecomment-1066328969] for more details.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org