You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Prabhu Joseph (Jira)" <ji...@apache.org> on 2021/08/17 08:51:00 UTC
[jira] [Resolved] (YARN-10873) Graceful Decommission ignores launched containers and gets deactivated before timeout

     [ https://issues.apache.org/jira/browse/YARN-10873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prabhu Joseph resolved YARN-10873.
----------------------------------
    Resolution: Fixed

> Graceful Decommission ignores launched containers and gets deactivated before timeout
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-10873
>                 URL: https://issues.apache.org/jira/browse/YARN-10873
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: RM
>    Affects Versions: 3.3.1
>            Reporter: Prabhu Joseph
>            Assignee: Srinivas S T
>            Priority: Major
>             Fix For: 3.4.0
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Graceful Decommission of a Node gets deactivated before timeout even though there are launched containers. 
> On Status update from Node which is in Decommissioning, RM transitions the node to DECOMMISSIONED before timeout if there are no running applications. These running applications are added from the Container Statuses from NodeManager. We have observed Containers are launched at NodeManager and at the same time ResourceManager forcefully decommissions the node.
> This affects the Livy Interactive jobs which supports only one application attempt.
> Will suggest to check FicaSchedulerNode to identify if there are any launched containers and determine whether to forcefully decommission or not.
> {code}
>   public static class StatusUpdateWhenHealthyTransition implements
>       MultipleArcTransition<RMNodeImpl, RMNodeEvent, NodeState> {
>     @Override
>     public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {
>       .....
>       if (isNodeDecommissioning) {
>         List<ApplicationId> keepAliveApps = statusEvent.getKeepAliveAppIds();
>         if (rmNode.runningApplications.isEmpty() &&
>             (keepAliveApps == null || keepAliveApps.isEmpty())) {
>           RMNodeImpl.deactivateNode(rmNode, NodeState.DECOMMISSIONED);
>           return NodeState.DECOMMISSIONED;
>         }
>       }
> {code}
> *ResourceManager Logs:*
> {code}
> 2021-06-16 08:45:04,140 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Launching masterappattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,141 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Setting up container Container: [ContainerId: container_1623830067124_0382_01_000001, AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: <memory:29696, vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM appattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,141 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Create AMRMToken for ApplicationAttempt: appattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,141 INFO org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager: Creating password for appattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,154 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1623830067124_0382_01_000001, AllocationRequestId: 0, Version: 0, NodeId: node1:34753, NodeHttpAddress: 927a9ef942b24b1eaa0e99c39d4e73f90224b902983:8042, Resource: <memory:29696, vCores:4>, Priority: 0, Token: Token { kind: ContainerToken, service: 10.1.2.3:34753 }, ExecutionType: GUARANTEED, ] for AM appattempt_1623830067124_0382_000001
> 2021-06-16 08:45:04,776 INFO org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully decommission node node1:34753 with state RUNNING
> 2021-06-16 08:45:04,776 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node node1:34753 in DECOMMISSIONING.
> 2021-06-16 08:45:04,776 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 Node Transitioned from RUNNING to DECOMMISSIONING
> 2021-06-16 08:45:05,131 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node node1:34753 as it is now DECOMMISSIONED
> 2021-06-16 08:45:05,131 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node1:34753 Node Transitioned from DECOMMISSIONING to DECOMMISSIONED
> 2021-06-16 08:45:05,131 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1623830067124_0382_01_000001 Container Transitioned from ACQUIRED to KILLED
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org