You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Adam Antal (Jira)" <ji...@apache.org> on 2020/04/24 13:11:00 UTC

[jira] [Commented] (YUNIKORN-42) Better to support POD events for YuniKorn to troubleshoot allocation failures

    [ https://issues.apache.org/jira/browse/YUNIKORN-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091556#comment-17091556 ] 

Adam Antal commented on YUNIKORN-42:
------------------------------------

[~wwei] thanks for assign this to me. I know that you are all kinda busy with the release, so I wrote down a few thoughts and you can respond to them when you have time.

I recently got through the design doc and browsed the Scheduler Interface to have more insight on the purpose of this jira. I think [~wangda]'s original implementation plan is a good one, I would make a few suggestions according to my and [~wilfreds] previous comments.

I would like to approach the pod, application, queue and node cases separately.
For *pods* it's logical to pass these events from the scheduler to the shim, and the shim can further emit these to the k8s event system. So end-users will {{kubectl describe}} the pending pod to see any errors that the scheduler can emit. I'd like to change the way how it does so.
Because new pods are requested through {{AllocationAsk}} in {{UpdateRequest}}, so the proposed {{DiagnosticInformation}} in {{UpdateResponse}} is too broad for this purpose. I'd put it into {{RejectedAllocationAsk}}, but as I can see we already have a reason string that describes the rejection. Could we leverage that perhaps?

Since *nodes* are also ResourceManager dependent objects, I'd so something similar for emitting node-related events as well. As I searched the SI, I've found {{AcceptedNode}} and {{RejectedNode}} objects - can we also use these for the event system?

*Queues* are scheduler-level concepts so these should not be passed along with the SI.

With regards to *applications*: I have the impression that applications are RM-level concepts because they are included in the SI protocol. That being said we also have to provide some diagnostics on that level, but there is no such utility as {{kubectl describe application}} in k8s side - so the question is: do we really need to do that?
One idea that I could think of is that we can also emit CRDs on behalf of the shim that represents applications and that object can be the target of these events. This is handled by the shim obviously, and can be synchronized with the Spark / other applications' state (where we actually no need to communicate this with the scheduler continuosly).
I see some advantage of these CRDs in contexts like work-preserving recovery (I am not aware how this is currently handled in K8s), but would be pretty straightforward to just read up the CRDs when an RM has to resync its state.

As for the event cache in the scheduler component, I think [~wangda]'s proposal is good: we also need a way to approach the problem from the scheduler's perspective. I'd definitely like to keep that piece of the architecture.

Please explain your opinion on that. I will create an updated POC document with the things we discuss in this thread. I welcome your thoughts/constructive criticism.

> Better to support POD events for YuniKorn to troubleshoot allocation failures
> -----------------------------------------------------------------------------
>
>                 Key: YUNIKORN-42
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-42
>             Project: Apache YuniKorn
>          Issue Type: Task
>            Reporter: Wangda Tan
>            Assignee: Adam Antal
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now it is tricky to do troubleshoot for pod allocation, we need better expose this information to POD description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org