You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Ming Ma (JIRA)" <ji...@apache.org> on 2014/07/04 04:27:34 UTC

[jira] [Created] (HDFS-6626) Node is marked decommissioned if it becomes dead when it is being decommissioned

Ming Ma created HDFS-6626:
-----------------------------

             Summary: Node is marked decommissioned if it becomes dead when it is being decommissioned
                 Key: HDFS-6626
                 URL: https://issues.apache.org/jira/browse/HDFS-6626
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Ming Ma


Not sure if it is by design. But it isn't intuitive. The scenario is like this, you try to decommission a node; when the node is being decommissioned, the node becomes dead from NN's point of view; right after that NN will mark this node decommissioned. On the webUI, administrators will consider the decommission has completed successfully. That is because when there is no block left for the DN, decommission is considered done.

{noformat}
BlockManager.java
  boolean isReplicationInProgress(DatanodeDescriptor srcNode) {
    boolean status = false;
...
    final Iterator<? extends Block> it = srcNode.getBlockIterator();
    while(it.hasNext()) {
...
// set status if there is block under replication
    }
...
    return status;
}
{noformat}

The question is whether we should mark the dead node as decommission completed (the current behavior), or mark the dead node "decommission aborted". From administrators' point of view, when they are doing decomm, they want to know the status of decomm and the health of those decomm-in-progress nodes. If they can detect decommission failure earlier, they might be able to take actions earlier; for example if the TOR switch has issues during decomm, administrators will be able to quickly find out a bunch of "decommission aborted" nodes from the same rack. People can still find this information by doing the join between decomm node list and recent dead node list on the webUI; just not as convenient.

Suggestions?



--
This message was sent by Atlassian JIRA
(v6.2#6252)