You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@slider.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2014/09/18 11:53:34 UTC

[jira] [Commented] (SLIDER-438) Slider agent continues to run in the container on a node where NM dies

    [ https://issues.apache.org/jira/browse/SLIDER-438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138736#comment-14138736 ] 

Steve Loughran commented on SLIDER-438:
---------------------------------------

# do you know if after the container lost even the AM can kill the container? I doubt it, as that message probably goes to the NM, which won't be there.
# otherwise, the AM could tell the provider to release a container when it next heartbeats in, and have the agent terminate itself

> Slider agent continues to run in the container on a node where NM dies
> ----------------------------------------------------------------------
>
>                 Key: SLIDER-438
>                 URL: https://issues.apache.org/jira/browse/SLIDER-438
>             Project: Slider
>          Issue Type: Bug
>          Components: agent, agent-provider
>            Reporter: Gour Saha
>            Assignee: Gour Saha
>
> Steps to reproduce:
> - Setup a 3-node cluster (in non-HA mode)
> - Run slider create for HBase app-package (with HMaster and HRegionServer components only - just to keep things simple)
> - Let's assume that the HRegionServer came up in a node different from that of HMaster and Slider AM (if not, doing destroy-create couple of times will definitely get you to this setup)
> - Kill the NM in the node where HRegionServer is running
> - Restart the NM within 10 minutes (which is the default time after which RM marks the node as KILLED, configurable using yarn.nm.liveness-monitor.expiry-interval-ms)
> - At this point Slider AM received the container lost event from RM, it marked the container lost and requested a new one to RM. A new HRegionServer container came up (in the same host where the old one was running). At this point both the HRegionServer containers continued to run happily along side each other and successfully heart-beating to AM.
> Expected:
> - Given that the first HRegionServer instance was still heart-beating with AM, AM should be able to send a kill signal and bring the agent/container down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)