You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Stephen O'Donnell (Jira)" <ji...@apache.org> on 2019/11/29 16:27:00 UTC
[jira] [Commented] (HDDS-2607) DeadNodeHandler should not remove replica for a dead maintenance node

    [ https://issues.apache.org/jira/browse/HDDS-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985106#comment-16985106 ] 

Stephen O'Donnell commented on HDDS-2607:
-----------------------------------------

The NodeStateManager is responsible for firing a "dead node" event, but it currently only does this if the node is "IN_SERVICE". It will not do it if it is DECOMMISSIONING, DECOMMISSIONED, ENTERING_MAINTENANCE or IN_MAINTENANCE.

As part of this Jira we need to fix this, as the only time a dead node should not have the dead node event fired is when it is IN_MAINTENANCE. At other times, a "dead node event" should clear the nodes containers replica as usual. 

It is also important that the DatandeAdminMonitor aborts its workflow for any node which goes dead while maintenance is in progress (unless it has already reached IN_MAINTENANCE), for several reasons:

1. The dead node event will delete all the container replicas for the node, so its impossible to track them for replication correctly.
2. This could result in a node which is node completed decom / maintenance getting marked as completed.
3. If the node returns to service, the state on the cluster may have changed and new pipelines should be created etc meaning the admin workflow needs to restart.

In this Jira, we should therefore consider:

1. Resetting the nodes OperationalState to "IN_SERVICE" as part of the dead node handling.
2. Ensure the dead node event gets triggered for all operational states except IN_MAINTENANCE
3. The maintenance workflow is aborted if the health of any nodes becomes "DEAD"
4. How to trigger a dead node event for a node which is dead and was IN_MAINTENANCE and maintenance has ended either automatically or manually.

> DeadNodeHandler should not remove replica for a dead maintenance node
> ---------------------------------------------------------------------
>
>                 Key: HDDS-2607
>                 URL: https://issues.apache.org/jira/browse/HDDS-2607
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>          Components: SCM
>    Affects Versions: 0.5.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>
> Normally, when a node goes dead, the DeadNodeHandler removes all the containers and replica associated with the node from the ContainerManager.
> If a node is IN_MAINTENANCE and goes dead, then we do not want to remove its replica. They should remain present in the system to prevent the container being marked as under-replicated.
> We also need to consider the case where the node is dead, and then maintenance expires automatically. In that case, the replica associated with the node must be removed and the affected containers will become under-replicated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org