You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Weiwei Yang (Jira)" <ji...@apache.org> on 2021/03/02 08:21:00 UTC

[jira] [Comment Edited] (YUNIKORN-460) Handle app reservation timeout

    [ https://issues.apache.org/jira/browse/YUNIKORN-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277760#comment-17277760 ] 

Weiwei Yang edited comment on YUNIKORN-460 at 3/2/21, 8:20 AM:
---------------------------------------------------------------

hi [~kmarton] your notes captured our discussion. Please review the following cases, and make sure what you have in the draft PR can handle all these cases (except the retry part)

1. the scheduler tried but non of app's pending placeholders get allocated, all pods are pending
after the timeout, all pending placeholders should be deleted
failed the app {color:red}[1]{color}

2. not all app's pending placeholders are allocated
this means there is no real pending ask yet
after timeout, all allocated and pending placeholders should be deleted
failed the app

3. all app's placeholders are allocated, no real pods submitted
after timeout, all allocated placeholders should be deleted
app transited to the completed state

4. all app's placeholders are allocated, only part of them gets replaced (min gang member > actual task number)
after timeout, app transited to the completed state
all remaining placeholders should be deleted

{color:red}[1]{color} Why failed the app? That means: the scheduler tried to do reservation for the app, but could not go through. When the app is failed, we can simply notify the shim about app's state, and then accordingly, the shim can release all placeholders. Note, we will see the app's real pods are pending, client side needs to cleanup the job. We can further build the "retry" logic on the shim side to re-submit the app again in e.g a few minutes. 


was (Author: wwei):
hi [~kmarton] your notes captured our discussion. Please review the following cases, and make sure what you have in the draft PR can handle all these cases (except the retry part)

1. the scheduler tried but non of app's pending placeholders get allocated, all pods are pending
after the timeout, all pending placeholders should be deleted
failed the app {color:red}[1]{color}

2. not all app's pending placeholders are allocated
after timeout, all allocated and pending placeholders should be deleted
failed the app

3. all app's placeholders are allocated, no real pods submitted
after timeout, all allocated placeholders should be deleted
app transited to the completed state

4. all app's placeholders are allocated, only part of them gets replaced (min gang member > actual task number)
after timeout, app transited to the completed state
all remaining placeholders should be deleted

{color:red}[1]{color} Why failed the app? That means: the scheduler tried to do reservation for the app, but could not go through. When the app is failed, we can simply notify the shim about app's state, and then accordingly, the shim can release all placeholders. Note, we will see the app's real pods are pending, client side needs to cleanup the job. We can further build the "retry" logic on the shim side to re-submit the app again in e.g a few minutes. 

> Handle app reservation timeout
> ------------------------------
>
>                 Key: YUNIKORN-460
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-460
>             Project: Apache YuniKorn
>          Issue Type: Sub-task
>          Components: shim - kubernetes
>            Reporter: Weiwei Yang
>            Assignee: Kinga Marton
>            Priority: Major
>              Labels: pull-request-available
>
> When an app is configured with a timeout, that determines the maximum time permitted to stay in the Reserving phase. If that times out, then all the existing placeholders should be deleted and the application will be scheduled normally. This timeout is needed because otherwise an app’s partial placeholders may occupy cluster resources and they are wasted.
> See more in [this doc|https://docs.google.com/document/d/1P-g4plXIJ9Xybp-jyKySI18P3rkGQPuTutGYhv1LaQ8/edit#heading=h.ebk2htgnnrex]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org