You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@slider.apache.org by "Gour Saha (JIRA)" <ji...@apache.org> on 2015/05/13 18:45:00 UTC

[jira] [Comment Edited] (SLIDER-479) Provide a slider command to kill all stranded containers continuing to run post stop command

    [ https://issues.apache.org/jira/browse/SLIDER-479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223206#comment-14223206 ] 

Gour Saha edited comment on SLIDER-479 at 5/13/15 4:44 PM:
-----------------------------------------------------------

That makes sense, but it might be tricky to define "long period of time" since for long running services (specifically for applications demanding a lot of affinity or data locality) the agents are designed to continue to run and wait for the failed AM to come back up. Definitely an application specific configurable time-period can be exposed. Would ephemeral nodes in zk make sense to control the kill of agents?


was (Author: gsaha):
That makes sense, but it might be tricky to define "long period of time" since for long running services (specifically for applications demanding a lot of affinity) the agents are designed to continue to run and wait for the failed AM to come back up. Definitely an application specific configurable time-period can be exposed. Would ephemeral nodes in zk make sense to control the kill of agents?

> Provide a slider command to kill all stranded containers continuing to run post stop command
> --------------------------------------------------------------------------------------------
>
>                 Key: SLIDER-479
>                 URL: https://issues.apache.org/jira/browse/SLIDER-479
>             Project: Slider
>          Issue Type: Bug
>            Reporter: Gour Saha
>             Fix For: Slider 2.0.0
>
>
> A container can continue to run even after a slider stop command has been issued. One such scenarios is when NM of a non Slider-AM node is lost and before the Slider-AM could clean up the stranded agent (and the application processes) slider stop command was issued. In such a scenario even if the NM is brought back up it will not kill these containers.
> In a large cluster with several applications deployed/managed by slider there could easily be numerous such stranded containers.
> Slider client could expose a "stop-all" command or maybe an option "stop --clean" (or anything appropriate for this task) to do the cleanup. It can bring up the Slider-AM in clean mode (say) which will not start any application but will simply register to ZK and wait for agents to heart-beat into it. Each one of these agents will receive the terminate command from the AM and will do necessary cleanup and shutdown.
> This new command can be issued only after an application has been stopped. When invoked while the application is running this command should fail providing relevant information. This command can also provide a summary of how many stranded containers it cleaned up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)