You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Wangda Tan (JIRA)" <ji...@apache.org> on 2015/09/12 01:38:46 UTC
[jira] [Commented] (YARN-3212) RMNode State Transition Update with
DECOMMISSIONING state
[ https://issues.apache.org/jira/browse/YARN-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741738#comment-14741738 ]
Wangda Tan commented on YARN-3212:
----------------------------------
Hi [~djp],
Thanks for working on this JIRA, just took a look at it:
*1) ResourceTrackerService:*
Question:
1. Why shutdown a "decommissioning" NM if it is doing heartbeat. Should we allow it continue heartbeat, since RM needs to know about container finished / killed information.
*2) RMNodeImpl:*
Question:
2. Do we have timeout of graceful decomission? Which will update a node to "DECOMMISSIONED" after the timeout.
3. If I understand correct, decommissioning is another running state, except:
- We cannot allocate any new containers to it.
Comments:
- If answer to question #2 is no, I suggest to rename RMNodeEventType.DECOMISSION_WITH_TIMEOUT to GRACEFUL_DECOMISSION, since it doesn't have a "real" timeout.
- Why this is need?
{code}
.addTransition(NodeState.DECOMMISSIONING, NodeState.DECOMMISSIONING,
RMNodeEventType.DECOMMISSION_WITH_TIMEOUT,
new DecommissioningNodeTransition(NodeState.DECOMMISSIONING))
{code}
Should we simply ignore the DECOMMISSION_WITH_TIMEOUT event?
- Is there specific considerations that transfer UNHEALTHY to DECOMISSIONED when DECOMMISSION_WITH_TIMEOUT received? Is it better to transfer it to DECOMISSIONING since it has some containers running on it?
- One suggestion of how to handle node update to scheduler: I think you can add a field "isDecomissioning" to NodeUpdateSchedulerEvent, and scheduler can do all updates except allocate container.
> RMNode State Transition Update with DECOMMISSIONING state
> ---------------------------------------------------------
>
> Key: YARN-3212
> URL: https://issues.apache.org/jira/browse/YARN-3212
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Reporter: Junping Du
> Assignee: Junping Du
> Attachments: RMNodeImpl - new.png, YARN-3212-v1.patch, YARN-3212-v2.patch, YARN-3212-v3.patch, YARN-3212-v4.1.patch, YARN-3212-v4.patch, YARN-3212-v5.1.patch, YARN-3212-v5.patch
>
>
> As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and can transition from “running” state triggered by a new event - “decommissioning”.
> This new state can be transit to state of “decommissioned” when Resource_Update if no running apps on this NM or NM reconnect after restart. Or it received DECOMMISSIONED event (after timeout from CLI).
> In addition, it can back to “running” if user decides to cancel previous decommission by calling recommission on the same node. The reaction to other events is similar to RUNNING state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)