You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2021/11/17 22:09:48 UTC

[GitHub] [hadoop] KevinWikant opened a new pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

KevinWikant opened a new pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675


   ### Description of PR
   
   Fixes a bug in Hadoop HDFS where if more than "dfs.namenode.decommission.max.concurrent.tracked.nodes" datanodes are lost while in state decommissioning, then all forward progress towards decommissioning any datanodes (including healthy datanodes) is blocked
   
   JIRA: https://issues.apache.org/jira/browse/HDFS-16303
   
   ### How was this patch tested?
   
   #### Unit Testing
   
   Added new unit tests:
   - TestDecommission.testRequeueUnhealthyDecommissioningNodes (& TestDecommissionWithBackoffMonitor.testRequeueUnhealthyDecommissioningNodes)
   - DatanodeAdminMonitorBase.testPendingNodesQueueOrdering
   - DatanodeAdminMonitorBase.testPendingNodesQueueReverseOrdering
   
   All "TestDecommission", "TestDecommissionWithBackoffMonitor", & "DatanodeAdminMonitorBase" tests pass when run locally
   
   Note that without the "DatanodeAdminManager" changes the new test "testRequeueUnhealthyDecommissioningNodes" fails because it times out waiting for the healthy nodes to be decommissioned
   
   ```
   > mvn -Dtest=TestDecommission#testRequeueUnhealthyDecommissioningNodes test
   ...
   [ERROR] Errors: 
   [ERROR]   TestDecommission.testRequeueUnhealthyDecommissioningNodes:1776 » Timeout Timed...
   ```
   
   ```
   > mvn -Dtest=TestDecommissionWithBackoffMonitor#testRequeueUnhealthyDecommissioningNodes test
   ...
   [ERROR] Errors: 
   [ERROR]   TestDecommissionWithBackoffMonitor>TestDecommission.testRequeueUnhealthyDecommissioningNodes:1776 » Timeout
   ```
   
   #### Manual Testing
   
   - create Hadoop cluster with:
       - 30 datanodes initially
       - custom Namenode JAR containing this change
       - hdfs-site configuration "dfs.namenode.decommission.max.concurrent.tracked.node = 10"
   
   ```
   > cat /etc/hadoop/conf/hdfs-site.xml | grep -A 1 'tracked'
       <name>dfs.namenode.decommission.max.concurrent.tracked.nodes</name>
       <value>10</value>
   ```
   
   - reproduce the bug: https://issues.apache.org/jira/browse/HDFS-16303
       - start decommissioning over 20 datanodes
       - terminate 20 datanodes while they are in state decommissioning
       - observe the Namenode logs to validate that there are 20 unhealthy datanodes stuck "in Decommission In Progress"
   
   ```
   2021-11-15 17:57:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
   2021-11-15 17:57:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 17:58:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
   2021-11-15 17:58:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 17:58:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
   2021-11-15 17:58:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 17:59:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
   2021-11-15 17:59:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
   ```
   
   - scale-up to 25 healthy datanodes & then decommission 22 of those datanodes (all but 3)
       - observe the Namenode logs to validate those 22 healthy datanodes are decommissioned (i.e. HDFS-16303 is solved)
   
   ```
   2021-11-15 17:59:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
   2021-11-15 17:59:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 18:00:14,487 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
   2021-11-15 18:00:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
   2021-11-15 18:01:14,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
   2021-11-15 18:01:44,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes will be tracked at a time. 22 nodes are currently queued waiting to be decommissioned.
   2021-11-15 18:02:14,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 22 nodes decommissioning but only 10 nodes will be tracked at a time. 22 nodes are currently queued waiting to be decommissioned.
   
   2021-11-15 18:02:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 12 nodes are currently queued waiting to be decommissioned.
   2021-11-15 18:02:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 8 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 18:03:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
   2021-11-15 18:03:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
   ```
   
   ### For code changes:
   
   - [yes] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
   - [n/a] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
   - [n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [no] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] aajisaka merged pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
aajisaka merged pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r770854361



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
##########
@@ -1654,4 +1658,204 @@ public Boolean get() {
 
     cleanupFile(fileSys, file);
   }
+
+  /**
+   * Test DatanodeAdminManager logic to re-queue unhealthy decommissioning nodes
+   * which are blocking the decommissioning of healthy nodes.
+   * Force the tracked nodes set to be filled with nodes lost while decommissioning,
+   * then decommission healthy nodes & validate they are decommissioned eventually.
+   */
+  @Test(timeout = 120000)
+  public void testRequeueUnhealthyDecommissioningNodes() throws Exception {
+    // Create a MiniDFSCluster with 3 live datanode in AdminState=NORMAL and
+    // 2 dead datanodes in AdminState=DECOMMISSION_INPROGRESS and a file
+    // with replication factor of 5.
+    final int numLiveNodes = 3;
+    final int numDeadNodes = 2;
+    final int numNodes = numLiveNodes + numDeadNodes;
+    final List<DatanodeDescriptor> liveNodes = new ArrayList<>();
+    final Map<DatanodeDescriptor, MiniDFSCluster.DataNodeProperties> deadNodeProps =
+        new HashMap<>();
+    final ArrayList<DatanodeInfo> decommissionedNodes = new ArrayList<>();
+    final Path filePath = new Path("/tmp/test");
+    createClusterWithDeadNodesDecommissionInProgress(numLiveNodes, liveNodes, numDeadNodes,
+        deadNodeProps, decommissionedNodes, filePath);
+    final FSNamesystem namesystem = getCluster().getNamesystem();
+    final BlockManager blockManager = namesystem.getBlockManager();
+    final DatanodeManager datanodeManager = blockManager.getDatanodeManager();
+    final DatanodeAdminManager decomManager = datanodeManager.getDatanodeAdminManager();
+
+    // Validate the 2 "dead" nodes are not removed from the tracked nodes set
+    // after several seconds of operation
+    final Duration checkDuration = Duration.ofSeconds(5);
+    Instant checkUntil = Instant.now().plus(checkDuration);
+    while (Instant.now().isBefore(checkUntil)) {
+      BlockManagerTestUtil.recheckDecommissionState(datanodeManager);
+      assertEquals(0, decomManager.getNumPendingNodes());
+      assertEquals(numDeadNodes, decomManager.getNumTrackedNodes());
+      assertTrue(deadNodeProps.keySet().stream()
+          .allMatch(node -> node.getAdminState().equals(AdminStates.DECOMMISSION_INPROGRESS)));
+      Thread.sleep(500);
+    }
+
+    // Delete the file such that its no longer a factor blocking decommissioning of live nodes
+    // which have block replicas for that file
+    getCluster().getFileSystem().delete(filePath, true);
+
+    // Start decommissioning 2 "live" datanodes
+    int numLiveDecommNodes = 2;
+    final List<DatanodeDescriptor> liveDecommNodes = liveNodes.subList(0, numLiveDecommNodes);
+    for (final DatanodeDescriptor liveNode : liveDecommNodes) {
+      takeNodeOutofService(0, liveNode.getDatanodeUuid(), 0, decommissionedNodes,
+          AdminStates.DECOMMISSION_INPROGRESS);
+      decommissionedNodes.add(liveNode);
+    }
+
+    // Write a new file such that there are under-replicated blocks preventing decommissioning
+    // of dead nodes
+    writeFile(getCluster().getFileSystem(), filePath, numNodes, 10);
+
+    // Validate that the live datanodes are put into the pending decommissioning queue
+    GenericTestUtils.waitFor(() -> decomManager.getNumTrackedNodes() == numDeadNodes
+            && decomManager.getNumPendingNodes() == numLiveDecommNodes
+            && liveDecommNodes.stream().allMatch(
+                node -> node.getAdminState().equals(AdminStates.DECOMMISSION_INPROGRESS)),
+        500, 30000);
+    for (final DatanodeDescriptor node : decomManager.getPendingNodes()) {
+      assertTrue(liveDecommNodes.contains(node));
+    }

Review comment:
       I have modified the relevant parts of the new unit test to use "assertThat", note that IntelliJ is informing me that in the maven JUnit version "assertThat" is deprecated
   
   As per the guidance here [https://junit.org/junit4/javadoc/4.13/deprecated-list.html], I have used "org.hamcrest.MatcherAssert.assertThat" instead of "org.junit.Assert.assertThat"
   
   I have also added error messages to the assertions

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
##########
@@ -151,4 +163,34 @@ public int getPendingNodeCount() {
   public Queue<DatanodeDescriptor> getPendingNodes() {
     return pendingNodes;
   }
+
+  /**
+   * If node "is dead while in Decommission In Progress", it cannot be decommissioned
+   * until it becomes healthy again. If there are more pendingNodes than can be tracked
+   * & some unhealthy tracked nodes, then re-queue the unhealthy tracked nodes
+   * to avoid blocking decommissioning of healthy nodes.
+   *
+   * @param unhealthyDns The unhealthy datanodes which may be re-queued
+   * @param numDecommissioningNodes The total number of nodes being decommissioned
+   * @return List of unhealthy nodes to be re-queued
+   */
+  List<DatanodeDescriptor> identifyUnhealthyNodesToRequeue(
+      final List<DatanodeDescriptor> unhealthyDns, int numDecommissioningNodes) {

Review comment:
       modified the return type to Stream

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java
##########
@@ -270,12 +274,34 @@ private void check() {
         // an invalid state.
         LOG.warn("DatanodeAdminMonitor caught exception when processing node "
             + "{}.", dn, e);
-        pendingNodes.add(dn);
-        toRemove.add(dn);
+        toRequeue.add(dn);

Review comment:
       I have modified the code to remove the toRequeue variable

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
##########
@@ -35,12 +38,21 @@
 public abstract class DatanodeAdminMonitorBase
     implements DatanodeAdminMonitorInterface, Configurable {
 
+  /**
+   * Sort by lastUpdate time descending order, such that unhealthy
+   * nodes are de-prioritized given they cannot be decommissioned.
+   */
+  public static final Comparator<DatanodeDescriptor> PENDING_NODES_QUEUE_COMPARATOR =
+      (dn1, dn2) -> Long.compare(dn2.getLastUpdate(), dn1.getLastUpdate());
+
   protected BlockManager blockManager;
   protected Namesystem namesystem;
   protected DatanodeAdminManager dnAdmin;
   protected Configuration conf;
 
-  protected final Queue<DatanodeDescriptor> pendingNodes = new ArrayDeque<>();
+  private final PriorityQueue<DatanodeDescriptor> pendingNodes = new PriorityQueue<>(
+      DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES_DEFAULT,

Review comment:
       I have changed the constructor to use the default initial capacity of 11: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/PriorityQueue.html#%3Cinit%3E(int)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-972499317


   Seems the unit tests failed on an unrelated error:
   
   `
   [ERROR] Failures: 
   [ERROR] org.apache.hadoop.hdfs.TestRollingUpgrade.testRollback(org.apache.hadoop.hdfs.TestRollingUpgrade)
   [ERROR]   Run 1: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was:<javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=org.apache.hadoop.hdfs.protocol.RollingUpgradeInfo$Bean,items=((itemName=blockPoolId,itemType=javax.management.openmbean.SimpleType(name=java.lang.String)),(itemName=createdRollbackImages,itemType=javax.management.openmbean.SimpleType(name=java.lang.Boolean)),(itemName=finalizeTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={blockPoolId=BP-1039609189-172.17.0.2-1637204447583, createdRollbackImages=true, finalizeTime=0, startTime=1637204448659})>
   [ERROR]   Run 2: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was:<javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=org.apache.hadoop.hdfs.protocol.RollingUpgradeInfo$Bean,items=((itemName=blockPoolId,itemType=javax.management.openmbean.SimpleType(name=java.lang.String)),(itemName=createdRollbackImages,itemType=javax.management.openmbean.SimpleType(name=java.lang.Boolean)),(itemName=finalizeTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={blockPoolId=BP-1039609189-172.17.0.2-1637204447583, createdRollbackImages=true, finalizeTime=0, startTime=1637204448659})>
   [ERROR]   Run 3: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was:<javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=org.apache.hadoop.hdfs.protocol.RollingUpgradeInfo$Bean,items=((itemName=blockPoolId,itemType=javax.management.openmbean.SimpleType(name=java.lang.String)),(itemName=createdRollbackImages,itemType=javax.management.openmbean.SimpleType(name=java.lang.Boolean)),(itemName=finalizeTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={blockPoolId=BP-1039609189-172.17.0.2-1637204447583, createdRollbackImages=true, finalizeTime=0, startTime=1637204448659})>
   [INFO] 
   `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-998300466


   > hence if few nodes are really in bad state (hardware/network issues), the plan is to keep re-queueing them until more nodes are getting decommissioned than max tracked nodes right?
   
   It's the opposite, the unhealthy nodes will only be re-queued when there are more nodes being decommissioned than max tracked nodes. Otherwise, if there are fewer nodes being decommissioned than max tracked nodes, then the unhealthy nodes will not be re-queued because they do not risk blocking the decommissioning of queued healthy nodes (i.e. because the queue is empty).
   
   One potential performance impact that does come to mind is that if there are say 200 unhealthy decommissioning nodes & max tracked nodes = 100, then this may cause some churn in the queueing/de-queueing process because each DatanodeAdminMonitor tick all 100 tracked nodes will be re-queued & then 100 queued nodes will be de-queued/tracked. Note that this churn (and any associated performance impact) will only take effect when:
   - there are more nodes being decommissioned than max tracked nodes
   - AND either:
       - number of healthy decommissioning nodes < max tracked nodes
       - number of unhealthy decommissioning nodes > max tracked nodes
   
   The amount of re-queued/de-queued nodes per tick can be quantified as:
   
   > numRequeue = numDecommissioning <= numTracked ? 0 : numDeadDecommissioning -(numDecommissioning - numTracked)
   
   This churn of queueing/de-queueing will not occur at all under typical decommissioning scenarios (i.e. where there isn't a large number of dead decommissioning nodes).
   
   One idea to mitigate this is to have DatanodeAdminMonitor maintain counters used to track the number of healthy in the pendingNodes queue; then these counts could be used to make an improved re-queue decision. In particular, unhealthy nodes are only re-queued if there are healthy nodes in the pendingNodes queue. But this approach has some flaws, for example an unhealthy node in the queue could come alive again, but an unhealthy node in the tracked set wouldn't be re-queued to make space for it because its still counted as a unhealthy node. To solve this, we would need to scan the pendingNodes queue to update the healthy/unhealthy node counts periodically, this scan could prove expensive.
   
   > Since unhealthy node getting decommissioned might anyways require some sort of retry, shall we requeue them even if the condition is not met (i.e. total no of decomm in progress < max tracked nodes) as a limited retries?
   
   If there are fewer nodes being decommissioned than max tracked nodes, then there are no nodes in the pendingNodes queue & all nodes are being tracked for decommissioning. Therefore, there is no possibility that any healthy nodes are blocked in the pendingNodes queue (preventing them from being decommissioned) & so in my opinion there is no benefit to re-queueing the unhealthy nodes in this case. Furthermore, this will negatively impact performance through frequent re-queueing & de-queueing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] hadoop-yetus commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
hadoop-yetus commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-996289833


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 50s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to include 4 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  37m 57s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 49s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 34s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m 12s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 41s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 11s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 41s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 59s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  28m 57s |  |  branch has no errors when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 43s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 34s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 29s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks issues.  |
   | -0 :warning: |  checkstyle  |   1m  6s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/3/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 45 unchanged - 1 fixed = 46 total (was 46)  |
   | +1 :green_heart: |  mvnsite  |   1m 38s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m  5s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 39s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   4m 18s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  31m  1s |  |  patch has no errors when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 171m 31s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) |  hadoop-hdfs in the patch passed.  |
   | +0 :ok: |  asflicense  |   0m 31s |  |  ASF License check generated no output?  |
   |  |   | 294m 41s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestStandbyIsHot |
   |   | hadoop.hdfs.TestFileChecksum |
   |   | hadoop.hdfs.server.namenode.ha.TestFailureToReadEdits |
   |   | hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA |
   |   | hadoop.hdfs.server.namenode.ha.TestDNFencing |
   |   | hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints |
   |   | hadoop.hdfs.server.namenode.ha.TestEditLogTailer |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/3/artifact/out/Dockerfile |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3675 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux dde5cf273f65 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 43f1c01823f41ab015a75737b3304c282f70dabc |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   |  Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/3/testReport/ |
   | Max. process+thread count | 2840 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
   | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/3/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant edited a comment on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-980107774


   > If a DN goes dead and then re-registers, it will be added back into the pending nodes, so I don't think we need to continue to track it in the decommission monitor when it goes dead. We can just handle the dead event in the decommission monitor and stop tracking it, clearing a slot for another node. Then it will be re-added if it comes back by existing logic above.
   
   @sodonnel After going through the re-implementation of the logic, I think I found a caveat worth mentioning
   
   Note the implementation of "isNodeHealthyForDecommissionOrMaintenance":
   - https://github.com/apache/hadoop/blob/dc751df63b4ab2c9c26a1efe7479c31fd1de80d5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4546
   - if HDFS has no lowRedundancy or pendingReconstruction blocks then all datanodes (including dead datanodes) are considered "healthy" for the purpose of being DECOMMISSIONED
   
   Here's the happy case:
   - a dead datanode is removed from DatanodeAdminManager (i.e. removed from both outOfServiceNodeBlocks & pendingNodes) with AdminState=DECOMMISSION_INPROGRESS
   - a dead datanode comes back alive, gets re-added to DatanodeAdminManager, & transitions to AdminState=DECOMMISSIONED
   
   Here's a less happy case:
   - a dead datanode is removed from DatanodeAdminManager with AdminState=DECOMMISSION_INPROGRESS
   - the dead datanode never comes back alive & therefore remains with AdminState=DECOMMISSION_INPROGRESS forever
   
   The problem is that dead datanodes can be eventually DECOMMISSIONED if the namenode eliminates all lowRedundancy blocks (because of "isNodeHealthyForDecommissionOrMaintenance" logic), but by removing the dead datanode from the DatanodeAdminManager entirely (until it comes back alive) we prevent the dead datanode from transitioning to DECOMMISSIONED even though it could have
   
   So the impact is that a dead datanode which never comes back alive again will never transition from DECOMMISSION_INPROGRESS to DECOMMISSIONED even though it could have


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] aajisaka commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
aajisaka commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r770274630



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java
##########
@@ -189,6 +190,30 @@ public void run() {
          * node will be removed from tracking by the pending cancel.
          */
         processCancelledNodes();
+
+        // Having more nodes decommissioning than can be tracked will impact decommissioning
+        // performance due to queueing delay
+        int numTrackedNodes = outOfServiceNodeBlocks.size();
+        int numQueuedNodes = getPendingNodes().size();
+        int numDecommissioningNodes = numTrackedNodes + numQueuedNodes;
+        if (numDecommissioningNodes > maxConcurrentTrackedNodes) {
+          LOG.warn(
+              "There are {} nodes decommissioning but only {} nodes will be tracked at a time. "
+                  + "{} nodes are currently queued waiting to be decommissioned.",
+              numDecommissioningNodes, maxConcurrentTrackedNodes, numQueuedNodes);
+
+          // Re-queue unhealthy nodes to make space for decommissioning healthy nodes
+          final List<DatanodeDescriptor> unhealthyDns = outOfServiceNodeBlocks.keySet().stream()
+              .filter(dn -> !blockManager.isNodeHealthyForDecommissionOrMaintenance(dn))
+              .collect(Collectors.toList());
+          final List<DatanodeDescriptor> toRequeue =
+              identifyUnhealthyNodesToRequeue(unhealthyDns, numDecommissioningNodes);
+          for (DatanodeDescriptor dn : toRequeue) {
+            getPendingNodes().add(dn);
+            outOfServiceNodeBlocks.remove(dn);

Review comment:
       Agreed. Thank you for your update.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] hadoop-yetus commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
hadoop-yetus commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-997048879


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 39s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to include 4 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m 15s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 29s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 18s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m  2s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 30s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  2s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 26s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m  8s |  |  branch has no errors when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 23s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 12s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 52s |  |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 45 unchanged - 1 fixed = 45 total (was 46)  |
   | +1 :green_heart: |  mvnsite  |   1m 17s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 53s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 24s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 26s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 36s |  |  patch has no errors when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 225m 34s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 46s |  |  The patch does not generate ASF License warnings.  |
   |  |   | 325m 16s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/4/artifact/out/Dockerfile |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3675 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 5b3c8d68d5ea 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 44cd931cef915a8ecc2e22a6b905355fcdbe8d41 |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   |  Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/4/testReport/ |
   | Max. process+thread count | 3527 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
   | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/4/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-1000322118


   Thanks @aajisaka & @virajjasani for the PR approvals! Are we good to merge now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] aajisaka commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
aajisaka commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r752955215



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
##########
@@ -1654,4 +1658,139 @@ public Boolean get() {
 
     cleanupFile(fileSys, file);
   }
+
+  /**
+   * Test DatanodeAdminManager logic to re-queue unhealthy decommissioning nodes
+   * which are blocking the decommissioning of healthy nodes.
+   * Force the tracked nodes set to be filled with nodes lost while decommissioning,
+   * then decommission healthy nodes & validate they are decommissioned eventually.
+   */
+  @Test(timeout = 120000)
+  public void testRequeueUnhealthyDecommissioningNodes() throws Exception {
+    // Allow 3 datanodes to be decommissioned at a time
+    getConf().setInt(DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES, 3);
+    // Disable the normal monitor runs
+    getConf()
+        .setInt(MiniDFSCluster.DFS_NAMENODE_DECOMMISSION_INTERVAL_TESTING_KEY, Integer.MAX_VALUE);
+
+    // Start cluster with 6 datanodes
+    startCluster(1, 6);
+    final FSNamesystem namesystem = getCluster().getNamesystem();
+    final BlockManager blockManager = namesystem.getBlockManager();
+    final DatanodeManager datanodeManager = blockManager.getDatanodeManager();
+    final DatanodeAdminManager decomManager = datanodeManager.getDatanodeAdminManager();
+    assertEquals(6, getCluster().getDataNodes().size());
+
+    // 3 datanodes will be "live" datanodes that are expected to be decommissioned eventually
+    final List<DatanodeDescriptor> liveNodes = getCluster().getDataNodes().subList(3, 6).stream()
+        .map(dn -> getDatanodeDesriptor(namesystem, dn.getDatanodeUuid()))
+        .collect(Collectors.toList());
+    assertEquals(3, liveNodes.size());
+
+    // 3 datanodes will be "dead" datanodes that are expected to never be decommissioned
+    final List<DatanodeDescriptor> deadNodes = getCluster().getDataNodes().subList(0, 3).stream()
+        .map(dn -> getDatanodeDesriptor(namesystem, dn.getDatanodeUuid()))
+        .collect(Collectors.toList());
+    assertEquals(3, deadNodes.size());
+
+    // Need to create some data or "isNodeHealthyForDecommissionOrMaintenance"
+    // may unexpectedly return true for a dead node
+    writeFile(getCluster().getFileSystem(), new Path("/tmp/test1"), 1, 100);
+
+    // Cause the 3 "dead" nodes to be lost while in state decommissioning
+    // and fill the tracked nodes set with those 3 "dead" nodes
+    ArrayList<DatanodeInfo> decommissionedNodes = Lists.newArrayList();
+    int expectedNumTracked = 0;
+    for (final DatanodeDescriptor deadNode : deadNodes) {
+      assertEquals(0, decomManager.getNumPendingNodes());
+      assertEquals(expectedNumTracked, decomManager.getNumTrackedNodes());
+
+      // Start decommissioning the node
+      takeNodeOutofService(0, deadNode.getDatanodeUuid(), 0, decommissionedNodes,
+          AdminStates.DECOMMISSION_INPROGRESS);
+      decommissionedNodes.add(deadNode);
+
+      // Stop the datanode so that it is lost while decommissioning
+      getCluster().stopDataNode(deadNode.getXferAddr());
+      deadNode.setLastUpdate(213); // Set last heartbeat to be in the past
+
+      // Wait for the decommissioning datanode to become unhealthy
+      // & to be added to "pendingNodes"
+      final int finalExpectedNumTracked = expectedNumTracked;
+      GenericTestUtils.waitFor(() ->
+          !BlockManagerTestUtil.isNodeHealthyForDecommissionOrMaintenance(blockManager, deadNode)
+              && decomManager.getNumTrackedNodes() == finalExpectedNumTracked
+              && decomManager.getNumPendingNodes() == 1, 500, 30000);
+
+      // Run DatanodeAdminManager.Monitor & validate the node becomes tracked
+      BlockManagerTestUtil.recheckDecommissionState(datanodeManager);
+      expectedNumTracked++;
+      assertEquals(0, decomManager.getNumPendingNodes());
+      assertEquals(expectedNumTracked, decomManager.getNumTrackedNodes());
+    }
+
+    // Validate the 3 "dead" nodes are not removed from the tracked nodes set
+    // after several seconds of operation
+    final Duration checkDuration = Duration.ofSeconds(5);
+    Instant checkUntil = Instant.now().plus(checkDuration);
+    while (Instant.now().isBefore(checkUntil)) {
+      BlockManagerTestUtil.recheckDecommissionState(datanodeManager);
+      assertEquals(0, decomManager.getNumPendingNodes());
+      assertEquals(3, decomManager.getNumTrackedNodes());
+      Thread.sleep(500);
+    }
+
+    // Validate that the "live" nodes are considered healthy
+    GenericTestUtils.waitFor(() -> {
+      try {
+        return liveNodes.stream().allMatch(
+            dn -> BlockManagerTestUtil.isNodeHealthyForDecommissionOrMaintenance(blockManager, dn));
+      } catch (Exception e) {
+        LOG.warn("Exception running isNodeHealthyForDecommissionOrMaintenance", e);
+        return false;
+      }
+    }, 500, 30000);
+
+    // Start decommissioning 2 "live" datanodes
+    for (final DatanodeDescriptor liveNode : liveNodes.subList(0, 2)) {
+      takeNodeOutofService(0, liveNode.getDatanodeUuid(), 0, decommissionedNodes,
+          AdminStates.DECOMMISSION_INPROGRESS);
+      decommissionedNodes.add(liveNode);
+    }
+
+    // Validate that the live datanodes are put into the pending decommissioning queue
+    GenericTestUtils.waitFor(
+        () -> decomManager.getNumTrackedNodes() == 3 && decomManager.getNumPendingNodes() == 2, 500,
+        30000);
+
+    // Validate that 2 of the "dead" nodes are re-queued.
+    // Need to comment out this check because DatanodeAdminBackoffMonitor
+    // manages datanodes with different logic than DatanodeAdminDefaultMonitor.
+    /*
+    BlockManagerTestUtil.recheckDecommissionState(datanodeManager);
+    assertEquals(4, decomManager.getNumPendingNodes());
+    assertEquals(1, decomManager.getNumTrackedNodes());
+    */

Review comment:
       Why comment out here? Are the assertions are flaky or not implemented?

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
##########
@@ -35,12 +38,21 @@
 public abstract class DatanodeAdminMonitorBase
     implements DatanodeAdminMonitorInterface, Configurable {
 
+  /**
+   * Sort by lastUpdate time descending order, such that unhealthy
+   * nodes are de-prioritized given they cannot be decommissioned.
+   */
+  public static final Comparator<DatanodeDescriptor> PENDING_NODES_QUEUE_COMPARATOR =
+      (dn1, dn2) -> Long.compare(dn2.getLastUpdate(), dn1.getLastUpdate());
+
   protected BlockManager blockManager;
   protected Namesystem namesystem;
   protected DatanodeAdminManager dnAdmin;
   protected Configuration conf;
 
-  protected final Queue<DatanodeDescriptor> pendingNodes = new ArrayDeque<>();
+  private final PriorityQueue<DatanodeDescriptor> pendingNodes = new PriorityQueue<>(

Review comment:
       This change increases the time complexity of adding an element from O(1) to O(logN), however, the N is bounded to the number of DNs decommissioning (i.e. about several hundreds even in the large cluster). Therefore I think the performance effect is not critical.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant edited a comment on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-998300466


   > hence if few nodes are really in bad state (hardware/network issues), the plan is to keep re-queueing them until more nodes are getting decommissioned than max tracked nodes right?
   
   It's the opposite, the unhealthy nodes will only be re-queued when there are more nodes being decommissioned than max tracked nodes. Otherwise, if there are fewer nodes being decommissioned than max tracked nodes, then the unhealthy nodes will not be re-queued because they do not risk blocking the decommissioning of queued healthy nodes (i.e. because the queue is empty).
   
   One potential performance impact that comes to mind is that if there are say 200 unhealthy decommissioning nodes & max tracked nodes = 100, then this may cause some churn in the queueing/de-queueing process because each DatanodeAdminMonitor tick all 100 tracked nodes will be re-queued & then 100 queued nodes will be de-queued/tracked. Note that this churn (and any associated performance impact) will only take effect when:
   - there are more nodes being decommissioned than max tracked nodes
   - AND either:
       - number of healthy decommissioning nodes < max tracked nodes
       - number of unhealthy decommissioning nodes > max tracked nodes
   
   The amount of re-queued/de-queued nodes per tick can be quantified as:
   
   `numRequeue = numDecommissioning <= numTracked ? 0 : numDeadDecommissioning - (numDecommissioning - numTracked)`
   
   This churn of queueing/de-queueing will not occur at all under typical decommissioning scenarios (i.e. where there isn't a large number of dead decommissioning nodes).
   
   One idea to mitigate this is to have DatanodeAdminMonitor maintain counters used to track the number of healthy nodes in the pendingNodes queue; then this count can be used to make an improved re-queue decision. In particular, unhealthy nodes are only re-queued if there are healthy nodes in the pendingNodes queue. But this approach has some flaws, for example an unhealthy node in the queue could come alive again, but then an unhealthy node in the tracked set wouldn't be re-queued because the healthy queued node count hasn't been updated. To solve this, we would need to scan the pendingNodes queue to update the healthy/unhealthy node counts periodically, this scan could prove expensive.
   
   > Since unhealthy node getting decommissioned might anyways require some sort of retry, shall we requeue them even if the condition is not met (i.e. total no of decomm in progress < max tracked nodes) as a limited retries?
   
   If there are fewer nodes being decommissioned than max tracked nodes, then there are no nodes in the pendingNodes queue & all nodes are being tracked for decommissioning. Therefore, there is no possibility that any healthy nodes are blocked in the pendingNodes queue (preventing them from being decommissioned) & so in my opinion there is no benefit to re-queueing the unhealthy nodes in this case. Furthermore, this will negatively impact performance through frequent re-queueing & de-queueing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] sodonnel commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
sodonnel commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-975785577


   Ah yes, the BackoffMonitor gets calls the method below which is indeed limits by max Tracked Nodes:
   
   ```
   private void processPendingNodes() {
       while (!pendingNodes.isEmpty() &&
           (maxConcurrentTrackedNodes == 0 ||
               outOfServiceNodeBlocks.size() < maxConcurrentTrackedNodes)) {
         outOfServiceNodeBlocks.put(pendingNodes.poll(), null);
       }
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r771734015



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
##########
@@ -1654,4 +1658,204 @@ public Boolean get() {
 
     cleanupFile(fileSys, file);
   }
+
+  /**
+   * Test DatanodeAdminManager logic to re-queue unhealthy decommissioning nodes
+   * which are blocking the decommissioning of healthy nodes.
+   * Force the tracked nodes set to be filled with nodes lost while decommissioning,
+   * then decommission healthy nodes & validate they are decommissioned eventually.
+   */
+  @Test(timeout = 120000)
+  public void testRequeueUnhealthyDecommissioningNodes() throws Exception {
+    // Create a MiniDFSCluster with 3 live datanode in AdminState=NORMAL and
+    // 2 dead datanodes in AdminState=DECOMMISSION_INPROGRESS and a file
+    // with replication factor of 5.
+    final int numLiveNodes = 3;
+    final int numDeadNodes = 2;
+    final int numNodes = numLiveNodes + numDeadNodes;
+    final List<DatanodeDescriptor> liveNodes = new ArrayList<>();
+    final Map<DatanodeDescriptor, MiniDFSCluster.DataNodeProperties> deadNodeProps =
+        new HashMap<>();
+    final ArrayList<DatanodeInfo> decommissionedNodes = new ArrayList<>();
+    final Path filePath = new Path("/tmp/test");
+    createClusterWithDeadNodesDecommissionInProgress(numLiveNodes, liveNodes, numDeadNodes,
+        deadNodeProps, decommissionedNodes, filePath);
+    final FSNamesystem namesystem = getCluster().getNamesystem();
+    final BlockManager blockManager = namesystem.getBlockManager();
+    final DatanodeManager datanodeManager = blockManager.getDatanodeManager();
+    final DatanodeAdminManager decomManager = datanodeManager.getDatanodeAdminManager();
+
+    // Validate the 2 "dead" nodes are not removed from the tracked nodes set
+    // after several seconds of operation
+    final Duration checkDuration = Duration.ofSeconds(5);
+    Instant checkUntil = Instant.now().plus(checkDuration);
+    while (Instant.now().isBefore(checkUntil)) {
+      BlockManagerTestUtil.recheckDecommissionState(datanodeManager);
+      assertEquals(0, decomManager.getNumPendingNodes());
+      assertEquals(numDeadNodes, decomManager.getNumTrackedNodes());
+      assertTrue(deadNodeProps.keySet().stream()
+          .allMatch(node -> node.getAdminState().equals(AdminStates.DECOMMISSION_INPROGRESS)));
+      Thread.sleep(500);
+    }
+
+    // Delete the file such that its no longer a factor blocking decommissioning of live nodes
+    // which have block replicas for that file
+    getCluster().getFileSystem().delete(filePath, true);
+
+    // Start decommissioning 2 "live" datanodes
+    int numLiveDecommNodes = 2;
+    final List<DatanodeDescriptor> liveDecommNodes = liveNodes.subList(0, numLiveDecommNodes);
+    for (final DatanodeDescriptor liveNode : liveDecommNodes) {
+      takeNodeOutofService(0, liveNode.getDatanodeUuid(), 0, decommissionedNodes,
+          AdminStates.DECOMMISSION_INPROGRESS);
+      decommissionedNodes.add(liveNode);
+    }
+
+    // Write a new file such that there are under-replicated blocks preventing decommissioning
+    // of dead nodes
+    writeFile(getCluster().getFileSystem(), filePath, numNodes, 10);
+
+    // Validate that the live datanodes are put into the pending decommissioning queue
+    GenericTestUtils.waitFor(() -> decomManager.getNumTrackedNodes() == numDeadNodes
+            && decomManager.getNumPendingNodes() == numLiveDecommNodes
+            && liveDecommNodes.stream().allMatch(
+                node -> node.getAdminState().equals(AdminStates.DECOMMISSION_INPROGRESS)),
+        500, 30000);
+    for (final DatanodeDescriptor node : decomManager.getPendingNodes()) {
+      assertTrue(liveDecommNodes.contains(node));
+    }

Review comment:
       that API seems a lot cleaner, thanks! I have made the change




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-997078100


   Unit testing failed due to unrelated flaky tests
   
   > [ERROR] Errors: 
   > [ERROR] org.apache.hadoop.hdfs.TestRollingUpgrade.testRollback(org.apache.hadoop.hdfs.TestRollingUpgrade)
   > [ERROR]   Run 1: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 » Timeout Timed out...
   > [ERROR]   Run 2: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 » Timeout Timed out...
   > [ERROR]   Run 3: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 » Timeout Timed out...
   > [INFO] 
   > [WARNING] Flakes: 
   > [WARNING] org.apache.hadoop.hdfs.TestRollingUpgrade.testCheckpoint(org.apache.hadoop.hdfs.TestRollingUpgrade)
   > [ERROR]   Run 1: TestRollingUpgrade.testCheckpoint:599->testCheckpoint:686 Test resulted in an unexpected exit
   > [INFO]   Run 2: PASS


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant edited a comment on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-983034760


   > DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission has effectively failed. There is a case where it can complete, but what does that really mean - if the node is dead, it has not been gracefully stopped.
   
   The case which I have described where dead node decommissioning completes can occur when:
   - a decommissioning node goes dead, but all of its blocks still have block replicas on other live nodes
   - the namenode is eventually able to satisfy the minimum replication of all blocks (by replicating the under-replicated blocks from the live nodes)
   - the dead decommissioning node is transitioned to decommissioned
   
   In this case, the node did go dead while decommissioning, but there was no LowRedundancy blocks thanks to redundant block replicas. From the user perspective, the loss of the decommissioning node did not impact the outcome of the decommissioning process. Had the node not gone dead while decommissioning, the eventual outcome is the same in that the node is decommissioned & there is no LowRedundancy blocks.
   
   If there is LowRedundancy blocks then a dead datanode will remain decommissioning, because if the dead node were to come alive again then it may be able to recover the LowRedundancy blocks. But if there is no LowRedundancy blocks then the when the node comes alive again it will be immediately transition to decommissioned anyway, so why not make it decommissioned while its still dead?
   
   Also, I don't think the priority queue is adding much complexity, it's just putting healthy nodes (with more recent heartbeat times) ahead of unhealthy nodes (with older heartbeat times); such that healthy nodes are decommissioned first
   
   ----
   
   I also want to call out another caveat with the approach of removing the node from the DatanodeAdminManager which I  uncovered while unit testing
   
   If we leave the node in DECOMMISSION_IN_PROGRESS & remove the node from DatanodeAdminManager, then the following callstack should re-add the datanode to the DatanodeAdminManager when it comes alive again:
   - [DatanodeManager.registerDatanode](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1223)
   - [DatanodeManager.startAdminOperationIfNecessary](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1109)
   - [DatanodeAdminManager.startDecommission](https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L187)
   - [DatanodeAdminMonitorBase.startTrackingNode](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L136)
   
   The problem is this condition "!node.isDecommissionInProgress()": https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L177
   
   Because the dead datanode is left in DECOMMISSION_INPROGRESS, "startTrackingNode" is not invoked because of the "!node.isDecommissionInProgress()" condition
   
   Simply removing the condition "!node.isDecommissionInProgress()" will not function well because "startTrackingNode" is not idempotent:
   - [startDecommission is invoked periodically when refreshDatanodes is called](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1339)
   - [pendingNodes is an ArrayDequeue which does not deduplicate the DatanodeDescriptor](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L43)
   - therefore, removing the "!node.isDecommissionInProgress()" check will cause a large number of duplicate DatanodeDescriptor to be added to DatanodeAdminManager
   
   I can think of 2 obvious ways to handle this:
   A) make calls to "startTrackingNode" idempotent (meaning that if the DatanodeDescriptor is already tracked, it does not get added to the DatanodeAdminManager)
   B) modify startDecommission such that its aware of if the invocation is for a datanode which was just restarted after being dead such that it can still invoke "startTrackingNode" even though "node.isDecommissionInProgress()"
   C) add a flag to DatanodeDescriptor which indicates if the datanode is being tracked within DatanodeAdminManager, then check this flag in startDecommission
   
   For "A)", the challenge is that we need to ensure the DatanodeDescriptor is not in "pendingReplication" or "outOfServiceBlocks" which could be a fairly costly call to execute repeatedly. Also, I am not even sure such a check is thread-safe given there is no locking used as part of "startDecommission" or "startTrackingNode"
   
   For "B)", the awareness of if a registerDatanode call is related to a restarted datanode is available [here](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1177). So this information would need to be passed down the callstack to a method "startDecommission(DatanodeDescriptor node, boolean isNodeRestart)". Because of the modified method signature, all the other invocations of "startDecommission" will need to specify isNodeRestart=false
   
   Given this additional hurdle in the approach of removing a dead datanode from the DatanodeAdminManager, are we sure it will be less complex than the proposed changed?
   
   ----
   
   In short:
   - I don't think there is any downside in moving a dead datanode to decommissioned when there are no LowRedundancy blocks because this would happen immediately anyway were the node to come back alive (and get re-added to DatanodeAdminManager)
   - the approach of removing a dead datanode from the DatanodeAdminManager will not work properly without some significant refactoring of the "startDecommission" method & related code
   
   @sodonnel @virajjasani @aajisaka let me know your thoughts, I am still more in favor of tracking dead datanodes in DatanodeAdminManager (when there are LowRedundancy blocks), but if the community thinks its better to remove the dead datanodes from DatanodeAdminManager I can implement proposal "C)"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] aajisaka commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
aajisaka commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r770359233



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
##########
@@ -1654,4 +1658,204 @@ public Boolean get() {
 
     cleanupFile(fileSys, file);
   }
+
+  /**
+   * Test DatanodeAdminManager logic to re-queue unhealthy decommissioning nodes
+   * which are blocking the decommissioning of healthy nodes.
+   * Force the tracked nodes set to be filled with nodes lost while decommissioning,
+   * then decommission healthy nodes & validate they are decommissioned eventually.
+   */
+  @Test(timeout = 120000)
+  public void testRequeueUnhealthyDecommissioningNodes() throws Exception {
+    // Create a MiniDFSCluster with 3 live datanode in AdminState=NORMAL and
+    // 2 dead datanodes in AdminState=DECOMMISSION_INPROGRESS and a file
+    // with replication factor of 5.
+    final int numLiveNodes = 3;
+    final int numDeadNodes = 2;
+    final int numNodes = numLiveNodes + numDeadNodes;
+    final List<DatanodeDescriptor> liveNodes = new ArrayList<>();
+    final Map<DatanodeDescriptor, MiniDFSCluster.DataNodeProperties> deadNodeProps =
+        new HashMap<>();
+    final ArrayList<DatanodeInfo> decommissionedNodes = new ArrayList<>();
+    final Path filePath = new Path("/tmp/test");
+    createClusterWithDeadNodesDecommissionInProgress(numLiveNodes, liveNodes, numDeadNodes,
+        deadNodeProps, decommissionedNodes, filePath);
+    final FSNamesystem namesystem = getCluster().getNamesystem();
+    final BlockManager blockManager = namesystem.getBlockManager();
+    final DatanodeManager datanodeManager = blockManager.getDatanodeManager();
+    final DatanodeAdminManager decomManager = datanodeManager.getDatanodeAdminManager();
+
+    // Validate the 2 "dead" nodes are not removed from the tracked nodes set
+    // after several seconds of operation
+    final Duration checkDuration = Duration.ofSeconds(5);
+    Instant checkUntil = Instant.now().plus(checkDuration);
+    while (Instant.now().isBefore(checkUntil)) {
+      BlockManagerTestUtil.recheckDecommissionState(datanodeManager);
+      assertEquals(0, decomManager.getNumPendingNodes());
+      assertEquals(numDeadNodes, decomManager.getNumTrackedNodes());
+      assertTrue(deadNodeProps.keySet().stream()
+          .allMatch(node -> node.getAdminState().equals(AdminStates.DECOMMISSION_INPROGRESS)));
+      Thread.sleep(500);
+    }
+
+    // Delete the file such that its no longer a factor blocking decommissioning of live nodes
+    // which have block replicas for that file
+    getCluster().getFileSystem().delete(filePath, true);
+
+    // Start decommissioning 2 "live" datanodes
+    int numLiveDecommNodes = 2;
+    final List<DatanodeDescriptor> liveDecommNodes = liveNodes.subList(0, numLiveDecommNodes);
+    for (final DatanodeDescriptor liveNode : liveDecommNodes) {
+      takeNodeOutofService(0, liveNode.getDatanodeUuid(), 0, decommissionedNodes,
+          AdminStates.DECOMMISSION_INPROGRESS);
+      decommissionedNodes.add(liveNode);
+    }
+
+    // Write a new file such that there are under-replicated blocks preventing decommissioning
+    // of dead nodes
+    writeFile(getCluster().getFileSystem(), filePath, numNodes, 10);
+
+    // Validate that the live datanodes are put into the pending decommissioning queue
+    GenericTestUtils.waitFor(() -> decomManager.getNumTrackedNodes() == numDeadNodes
+            && decomManager.getNumPendingNodes() == numLiveDecommNodes
+            && liveDecommNodes.stream().allMatch(
+                node -> node.getAdminState().equals(AdminStates.DECOMMISSION_INPROGRESS)),
+        500, 30000);
+    for (final DatanodeDescriptor node : decomManager.getPendingNodes()) {
+      assertTrue(liveDecommNodes.contains(node));
+    }

Review comment:
       AssertJ provides more fluent API and more detailed error message, and can be used if possible. In this case, this code can be simplified to
   ```
         assertThat(liveDecommNodes).containsAll(decomManager.getPendingNodes());
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] hadoop-yetus commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
hadoop-yetus commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-991435605


   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 41s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to include 4 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  31m  6s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 29s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 20s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m  0s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 25s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  1s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m  9s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 17s |  |  branch has no errors when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 18s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 13s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 52s |  |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 45 unchanged - 1 fixed = 45 total (was 46)  |
   | +1 :green_heart: |  mvnsite  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 22s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m  9s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m  4s |  |  patch has no errors when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | +1 :green_heart: |  unit  | 226m 20s |  |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 47s |  |  The patch does not generate ASF License warnings.  |
   |  |   | 323m 13s |  |  |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/2/artifact/out/Dockerfile |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3675 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 6f28e8ecb417 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 54ba1cb54ae6bd66c126412933ec859007edd273 |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   |  Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/2/testReport/ |
   | Max. process+thread count | 3374 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
   | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] sodonnel commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
sodonnel commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-973916441


   This is quite a big change. I have a couple of thoughts.
   
   If a node goes dead while decommissioning, would it not be better to just remove it from the decommission monitor rather than keep tracking it there at all? If the node comes alive again, it should be entered back into the monitor.
   
   We could either detect it is dead in the monitor and remove it from tracking then, or have the place that logs the mentioned "node is dead while decommission in progress" remove it from the monitor.
   
   The DatanodeAdminBackoffMonitor is probably rarely used, if it is used at all, but it does not have a tracking limit I think at the moment. Perhaps it should have, it it was designed to run with less overhead than the default monitor, but perhaps if you decommissioned 100's of nodes at a time it would struggle, I am not sure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-981982641


   @sodonnel let me know your thoughts, but I think the problem with removing a dead node from the DatanodeAdminManager until it comes back alive again is that it will never be decommissioned if it never comes alive again
   
   Do you see any major downside in keeping the dead nodes in the pendingNodes priority queue behind all the alive nodes? Because of the priority queue ordering the dead nodes will not block decommissioning of alive nodes
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-985650485


   I would also add, that if you look at the implementation of the proposed alternative of removing a dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager: https://github.com/apache/hadoop/pull/3746/files
   
   It is not any less complex than this change, due to aforementioned caveats that need to be dealt with


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r762076295



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
##########
@@ -1654,4 +1658,139 @@ public Boolean get() {
 
     cleanupFile(fileSys, file);
   }
+
+  /**
+   * Test DatanodeAdminManager logic to re-queue unhealthy decommissioning nodes
+   * which are blocking the decommissioning of healthy nodes.
+   * Force the tracked nodes set to be filled with nodes lost while decommissioning,
+   * then decommission healthy nodes & validate they are decommissioned eventually.
+   */
+  @Test(timeout = 120000)
+  public void testRequeueUnhealthyDecommissioningNodes() throws Exception {
+    // Allow 3 datanodes to be decommissioned at a time
+    getConf().setInt(DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES, 3);
+    // Disable the normal monitor runs
+    getConf()
+        .setInt(MiniDFSCluster.DFS_NAMENODE_DECOMMISSION_INTERVAL_TESTING_KEY, Integer.MAX_VALUE);
+
+    // Start cluster with 6 datanodes
+    startCluster(1, 6);

Review comment:
       can probably reduce the number of nodes in this test




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] virajjasani commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
virajjasani commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-982594432


   > the priority queue idea adds some more complexity to an already hard to follow process / code area and I wonder if it is better to just remove the node from the monitor and let it be dealt with manually, which may be required a lot of the time anyway?
   
   Having faced the similar situation of `DECOMMISSION_IN_PROGRESS + DEAD` state requiring manual intervention, I agree with approach of removing the node and also the fact that it's already too complex process to follow so introducing new priority queue would complicate it even further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant edited a comment on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-972499317


   Seems the unit tests failed on an unrelated error:
   
   `
   [ERROR] Failures: 
   [ERROR] org.apache.hadoop.hdfs.TestRollingUpgrade.testRollback(org.apache.hadoop.hdfs.TestRollingUpgrade)
   [ERROR]   Run 1: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was:<javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=org.apache.hadoop.hdfs.protocol.RollingUpgradeInfo$Bean,items=((itemName=blockPoolId,itemType=javax.management.openmbean.SimpleType(name=java.lang.String)),(itemName=createdRollbackImages,itemType=javax.management.openmbean.SimpleType(name=java.lang.Boolean)),(itemName=finalizeTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={blockPoolId=BP-1039609189-172.17.0.2-1637204447583, createdRollbackImages=true, finalizeTime=0, startTime=1637204448659})>
   [ERROR]   Run 2: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was:<javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=org.apache.hadoop.hdfs.protocol.RollingUpgradeInfo$Bean,items=((itemName=blockPoolId,itemType=javax.management.openmbean.SimpleType(name=java.lang.String)),(itemName=createdRollbackImages,itemType=javax.management.openmbean.SimpleType(name=java.lang.Boolean)),(itemName=finalizeTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={blockPoolId=BP-1039609189-172.17.0.2-1637204447583, createdRollbackImages=true, finalizeTime=0, startTime=1637204448659})>
   [ERROR]   Run 3: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was:<javax.management.openmbean.CompositeDataSupport(compositeType=javax.management.openmbean.CompositeType(name=org.apache.hadoop.hdfs.protocol.RollingUpgradeInfo$Bean,items=((itemName=blockPoolId,itemType=javax.management.openmbean.SimpleType(name=java.lang.String)),(itemName=createdRollbackImages,itemType=javax.management.openmbean.SimpleType(name=java.lang.Boolean)),(itemName=finalizeTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)),(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long)))),contents={blockPoolId=BP-1039609189-172.17.0.2-1637204447583, createdRollbackImages=true, finalizeTime=0, startTime=1637204448659})>
   [INFO] 
   `
   
   Added "TestRollingUpgrade.testRollback" as a potentially flaky test here: https://issues.apache.org/jira/browse/HDFS-15646


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-975776386


   > The DatanodeAdminBackoffMonitor is probably rarely used, if it is used at all, but it does not have a tracking limit I think at the moment. Perhaps it should have, it it was designed to run with less overhead than the default monitor, but perhaps if you decommissioned 100's of nodes at a time it would struggle, I am not sure.
   
   Based on unit testing & code inspection, I think the "dfs.namenode.decommission.max.concurrent.tracked.nodes" still applies to DatanodeAdminBackoffMonitor
   
   From the code, DatanodeAdminBackoffMonitor:
   - has an [additional data structure pendingRep](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L98)
       - note that [pendingRep is size constrained by "dfs.namenode.decommission.backoff.monitor.pending.limit"](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml)
   - will [process blocks from within pendingRep](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L293)
   - then will [move blocks from outOfServiceBlocks to pendingRep](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L296) so that they can be processed next cycle
   
   So:
   - [pendingRep gets its blocks from outOfServiceBlocks](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L492)
   - [outOfServiceBlocks gets its datanodes from the pendingQueue](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L226)
       - note how outOfServiceBlocks is still size constrained by "dfs.namenode.decommission.max.concurrent.tracked.nodes"
   
   The new unit test "TestDecommissionWithBackoffMonitor.testRequeueUnhealthyDecommissioningNodes" will fail without the changes made to "TestDecommissionWithBackoffMonitor". It fails because the unhealthy nodes have filled up the tracked set (i.e. outOfServiceBlocks) & the healthy nodes are stuck in the pendingNodes queue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant edited a comment on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-975759431


   > If a node goes dead while decommissioning, would it not be better to just remove it from the decommission monitor rather than keep tracking it there at all? If the node comes alive again, it should be entered back into the monitor.
   > We could either detect it is dead in the monitor and remove it from tracking then, or have the place that logs the mentioned "node is dead while decommission in progress" remove it from the monitor.
   
   I agree with this statement & this is essentially the behavior this code change provides with 1 caveat "If the node comes alive again _OR if there are no (potentially healthy) nodes in the pendingNodes queue_, it will be entered back into the monitor"
   
   Consider the 2 alternatives:
   A) Do not track an unhealthy node until it becomes healthy
   B) Do not track an unhealthy node until it becomes healthy OR until there are fewer healthy nodes being decommissioned than "dfs.namenode.decommission.max.concurrent.tracked.nodes"
   
   Note that both alternatives do not prevent healthy nodes from being decommissioned. With Option B) if there are more nodes being decommissioned than can be tracked, any unhealthy nodes being tracked will be removed from the tracked set & re-queued (with a priority lower than all healthy nodes in the priority queue); then the healthy nodes will be de-queued & moved to the tracked set, once the healthy nodes are decommissioned the unhealthy nodes can be tracked again
   
   It may seem Option A) is more performant as it avoids tracking unhealthy nodes each DatanodeAdminMonitor cycle, but this may not be the case as we will still need to be checking the health status of all the unhealthy nodes in the pendingNodes queue to determine if they should be tracked again. Furthermore, tracking unhealthy nodes is the current behavior & as far as I know it has not caused any performance problems.
   
   With Option B) unhealthy nodes are only tracked (and there health status is only checked) when there are fewer healthy nodes than "dfs.namenode.decommission.max.concurrent.tracked.nodes". This may result in superior performance for Option B) as it will only need to check the health of up to "dfs.namenode.decommission.max.concurrent.tracked.nodes" unhealthy nodes, rather than having to check the health of all nodes in the pendingNodes queue.
   
   Note that the log "node is dead while decommission in progress", occurs from within the DatanodeAdminMonitor:
   - https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java#L240
   - https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4549


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] sodonnel commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
sodonnel commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-982018340


   For me, DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission has effectively failed. There is a case where it can complete, but what does that really mean - if the node is dead, it has not been gracefully stopped. If it wasn't for the way decommission is triggered using the hosts files, I would suggest switching it back to IN_SERVICE + DEAD, and let it be treated like any other dead host.
   
   If you have some monitoring tool tracking the decommission, and it sees "DECOMMISSIONED", then it assumes the decommission went fine. 
   
   If if sees DECOMMISSION_IN_PROGRESS + DEAD, then its a flag that the admin needs to go look into it, as it should not have happened - perhaps they need to bring the node back, or conclude that the cluster is still OK without it (no missing blocks) and add it to the exclude list and forget about it.
   
   My feeling is that the priority queue idea adds some more complexity to an already hard to follow process / code area and I wonder if it is better to just remove the node from the monitor and let it be dealt with manually, which may be required a lot of the time anyway?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] virajjasani commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
virajjasani commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-998480240


   > If there are fewer nodes being decommissioned than max tracked nodes, then there are no nodes in the pendingNodes queue & all nodes are being tracked for decommissioning. Therefore, there is no possibility that any healthy nodes are blocked in the pendingNodes queue
   
   Yes makes sense. Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] aajisaka commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
aajisaka commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-987488754


   Thank you @KevinWikant @sodonnel @virajjasani for your discussion, and thanks @KevinWikant again for the deep dive.
   
   I'm in favor of the approach in this PR.
   
   > This violates the expectation the the unit test is enforcing which is that a dead DECOMMISSION_INPROGRESS node should transition to DECOMMISSIONED when there are no LowRedundancy blocks
   
   The expectation is added by HDFS-7374. This JIRA allowed decommissioning of dead DataNodes, and the feature has been enabled for a long time since Hadoop 2.7.0. I don't want to break the expectation because breaking it surprises the administrators.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-983034760


   > DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission has effectively failed. There is a case where it can complete, but what does that really mean - if the node is dead, it has not been gracefully stopped.
   
   The case which I have described where dead node decommissioning completes can occur when:
   - a decommissioning node goes dead, but all of its blocks still have block replicas on other live nodes
   - the namenode is eventually able to satisfy the minimum replication of all blocks (by replicating the under-replicated blocks from the live nodes)
   - the dead decommissioning node is transitioned to decommissioned
   
   In this case, the node did go dead while decommissioning, but there was no data loss thanks to redundant block replicas. From the user perspective, the loss of the decommissioning node did not impact the outcome of the decommissioning process. Had the node not gone dead while decommissioning, the eventual outcome is the same in that the node is decommissioned, there is no data loss, & all blocks have sufficient replicas.
   
   If there is data loss then a dead datanode will remain decommissioning, because if the dead node were to come alive again then it may be able to recover the lost data. But if there is no data loss then the when the node comes alive again it will be immediately transition to decommissioned anyway, so why not make it decommissioned while its still dead (and there is no data loss)?
   
   Also, I don't think the priority queue is adding much complexity, it's just putting healthy nodes (with more recent heartbeat times) ahead of unhealthy nodes (with older heartbeat times) such that healthy nodes are decommissioned first
   
   ----
   
   I also want to call out another caveat with the approach of removing the node from the DatanodeAdminManager which I  uncovered while unit testing
   
   If we leave the node in DECOMMISSION_IN_PROGRESS & remove the node from DatanodeAdminManager, then the following callstack should re-add the datanode to the DatanodeAdminManager when it comes alive again:
   - [DatanodeManager.registerDatanode](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1223)
   - [DatanodeManager.startAdminOperationIfNecessary](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1109)
   - [DatanodeAdminManager.startDecommission](https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L187)
   - [DatanodeAdminMonitorBase.startTrackingNode](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L136)
   
   The problem is this condition "!node.isDecommissionInProgress()": https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L177
   
   Because the dead datanode is left in DECOMMISSION_INPROGRESS, "startTrackingNode" is not invoked because of the "!node.isDecommissionInProgress()" condition
   
   Simply removing the condition "!node.isDecommissionInProgress()" will not function well because "startTrackingNode" is not idempotent:
   - [startDecommission is invoked periodically when refreshDatanodes is called](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1339)
   - [pendingNodes is an ArrayDequeue which does not deduplicate the DatanodeDescriptor](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L43)
   - therefore, removing the "!node.isDecommissionInProgress()" check will cause a large number of duplicate DatanodeDescriptor to be added to DatanodeAdminManager
   
   I can think of 2 obvious ways to handle this:
   A) make calls to "startTrackingNode" idempotent (meaning that if the DatanodeDescriptor is already tracked, it does not get added to the DatanodeAdminManager)
   B) modify startDecommission such that its aware of if the invocation is for a datanode which was just restarted after being dead such that it can still invoke "startTrackingNode" even though "node.isDecommissionInProgress()"
   
   For "A)", the challenge is that we need to ensure the DatanodeDescriptor is not in "pendingReplication" or "outOfServiceBlocks" which could be a fairly costly call to execute repeatedly. Also, I am not even sure such a check is thread-safe given there is no locking used as part of "startDecommission" or "startTrackingNode"
   
   For "B)", the awareness of if a registerDatanode call is related to a [restarted datanode is available here](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1177). So this information would need to be passed down the callstack to a method "startDecommission(DatanodeDescriptor node, boolean isNodeRestart)". Because of the modified method signature, all the other invocations of "startDecommission" will need to specify isNodeRestart=false
   
   Given this additional hurdle in the approach of removing a dead datanode from the DatanodeAdminManager, are we sure it will be less complex/impactful than the proposed changed?
   
   ----
   
   In short:
   - I don't think there is any downside in moving a dead datanode to decommissioned when there are no LowRedundancy blocks because this would happen immediately anyway were the node to come back alive (and get re-added to DatanodeAdminManager)
   - the approach of removing a dead datanode from the DatanodeAdminManager will not work properly without some significant refactoring of the "startDecommission" method & related code
   
   @sodonnel @virajjasani @aajisaka let me know your thoughts, I am still more in favor of tracking dead datanodes in DatanodeAdminManager (when there are LowRedundancy blocks), but if the community thinks its better to remove the dead datanodes from DatanodeAdminManager I can implement proposal "B)"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] aajisaka commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
aajisaka commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r770299279



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java
##########
@@ -270,12 +274,34 @@ private void check() {
         // an invalid state.
         LOG.warn("DatanodeAdminMonitor caught exception when processing node "
             + "{}.", dn, e);
-        pendingNodes.add(dn);
-        toRemove.add(dn);
+        toRequeue.add(dn);

Review comment:
       I suppose reverting this change seems more simple because it can remove `toRequeue` variable.
   ```suggestion
           getPendingNodes.add(dn);
           toRemove.add(dn);
   ```

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
##########
@@ -151,4 +163,34 @@ public int getPendingNodeCount() {
   public Queue<DatanodeDescriptor> getPendingNodes() {
     return pendingNodes;
   }
+
+  /**
+   * If node "is dead while in Decommission In Progress", it cannot be decommissioned
+   * until it becomes healthy again. If there are more pendingNodes than can be tracked
+   * & some unhealthy tracked nodes, then re-queue the unhealthy tracked nodes
+   * to avoid blocking decommissioning of healthy nodes.
+   *
+   * @param unhealthyDns The unhealthy datanodes which may be re-queued
+   * @param numDecommissioningNodes The total number of nodes being decommissioned
+   * @return List of unhealthy nodes to be re-queued
+   */
+  List<DatanodeDescriptor> identifyUnhealthyNodesToRequeue(
+      final List<DatanodeDescriptor> unhealthyDns, int numDecommissioningNodes) {

Review comment:
       Using Stream instead of List for this method may simplify the code and reduce the conversion between List and Stream. What do you think?

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
##########
@@ -35,12 +38,21 @@
 public abstract class DatanodeAdminMonitorBase
     implements DatanodeAdminMonitorInterface, Configurable {
 
+  /**
+   * Sort by lastUpdate time descending order, such that unhealthy
+   * nodes are de-prioritized given they cannot be decommissioned.
+   */
+  public static final Comparator<DatanodeDescriptor> PENDING_NODES_QUEUE_COMPARATOR =
+      (dn1, dn2) -> Long.compare(dn2.getLastUpdate(), dn1.getLastUpdate());
+
   protected BlockManager blockManager;
   protected Namesystem namesystem;
   protected DatanodeAdminManager dnAdmin;
   protected Configuration conf;
 
-  protected final Queue<DatanodeDescriptor> pendingNodes = new ArrayDeque<>();
+  private final PriorityQueue<DatanodeDescriptor> pendingNodes = new PriorityQueue<>(
+      DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES_DEFAULT,

Review comment:
       I think we don't need to explicitly set the initial capacity. If the max concurrent tracked nodes is set to lower value than the default, the capacity will be too large.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant edited a comment on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-998300466


   > hence if few nodes are really in bad state (hardware/network issues), the plan is to keep re-queueing them until more nodes are getting decommissioned than max tracked nodes right?
   
   It's the opposite, the unhealthy nodes will only be re-queued when there are more nodes being decommissioned than max tracked nodes. Otherwise, if there are fewer nodes being decommissioned than max tracked nodes, then the unhealthy nodes will not be re-queued because they do not risk blocking the decommissioning of queued healthy nodes (i.e. because the queue is empty).
   
   One potential performance impact that does come to mind is that if there are say 200 unhealthy decommissioning nodes & max tracked nodes = 100, then this may cause some churn in the queueing/de-queueing process because each DatanodeAdminMonitor tick all 100 tracked nodes will be re-queued & then 100 queued nodes will be de-queued/tracked. Note that this churn (and any associated performance impact) will only take effect when:
   - there are more nodes being decommissioned than max tracked nodes
   - AND either:
       - number of healthy decommissioning nodes < max tracked nodes
       - number of unhealthy decommissioning nodes > max tracked nodes
   
   The amount of re-queued/de-queued nodes per tick can be quantified as:
   
   `numRequeue = numDecommissioning <= numTracked ? 0 : numDeadDecommissioning -(numDecommissioning - numTracked)`
   
   This churn of queueing/de-queueing will not occur at all under typical decommissioning scenarios (i.e. where there isn't a large number of dead decommissioning nodes).
   
   One idea to mitigate this is to have DatanodeAdminMonitor maintain counters used to track the number of healthy in the pendingNodes queue; then these counts could be used to make an improved re-queue decision. In particular, unhealthy nodes are only re-queued if there are healthy nodes in the pendingNodes queue. But this approach has some flaws, for example an unhealthy node in the queue could come alive again, but an unhealthy node in the tracked set wouldn't be re-queued to make space for it because its still counted as a unhealthy node. To solve this, we would need to scan the pendingNodes queue to update the healthy/unhealthy node counts periodically, this scan could prove expensive.
   
   > Since unhealthy node getting decommissioned might anyways require some sort of retry, shall we requeue them even if the condition is not met (i.e. total no of decomm in progress < max tracked nodes) as a limited retries?
   
   If there are fewer nodes being decommissioned than max tracked nodes, then there are no nodes in the pendingNodes queue & all nodes are being tracked for decommissioning. Therefore, there is no possibility that any healthy nodes are blocked in the pendingNodes queue (preventing them from being decommissioned) & so in my opinion there is no benefit to re-queueing the unhealthy nodes in this case. Furthermore, this will negatively impact performance through frequent re-queueing & de-queueing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant edited a comment on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-998300466


   > hence if few nodes are really in bad state (hardware/network issues), the plan is to keep re-queueing them until more nodes are getting decommissioned than max tracked nodes right?
   
   It's the opposite, the unhealthy nodes will only be re-queued when there are more nodes being decommissioned than max tracked nodes. Otherwise, if there are fewer nodes being decommissioned than max tracked nodes, then the unhealthy nodes will not be re-queued because they do not risk blocking the decommissioning of queued healthy nodes (i.e. because the queue is empty).
   
   One potential performance impact that does come to mind is that if there are say 200 unhealthy decommissioning nodes & max tracked nodes = 100, then this may cause some churn in the queueing/de-queueing process because each DatanodeAdminMonitor tick all 100 tracked nodes will be re-queued & then 100 queued nodes will be de-queued/tracked. Note that this churn (and any associated performance impact) will only take effect when:
   - there are more nodes being decommissioned than max tracked nodes
   - AND either:
       - number of healthy decommissioning nodes < max tracked nodes
       - number of unhealthy decommissioning nodes > max tracked nodes
   
   The amount of re-queued/de-queued nodes per tick can be quantified as:
   
   `numRequeue = numDecommissioning <= numTracked ? 0 : numDeadDecommissioning -(numDecommissioning - numTracked)`
   
   This churn of queueing/de-queueing will not occur at all under typical decommissioning scenarios (i.e. where there isn't a large number of dead decommissioning nodes).
   
   One idea to mitigate this is to have DatanodeAdminMonitor maintain counters used to track the number of healthy nodes in the pendingNodes queue; then this count can be used to make an improved re-queue decision. In particular, unhealthy nodes are only re-queued if there are healthy nodes in the pendingNodes queue. But this approach has some flaws, for example an unhealthy node in the queue could come alive again, but then an unhealthy node in the tracked set wouldn't be re-queued because the healthy queued node count hasn't been updated. To solve this, we would need to scan the pendingNodes queue to update the healthy/unhealthy node counts periodically, this scan could prove expensive.
   
   > Since unhealthy node getting decommissioned might anyways require some sort of retry, shall we requeue them even if the condition is not met (i.e. total no of decomm in progress < max tracked nodes) as a limited retries?
   
   If there are fewer nodes being decommissioned than max tracked nodes, then there are no nodes in the pendingNodes queue & all nodes are being tracked for decommissioning. Therefore, there is no possibility that any healthy nodes are blocked in the pendingNodes queue (preventing them from being decommissioned) & so in my opinion there is no benefit to re-queueing the unhealthy nodes in this case. Furthermore, this will negatively impact performance through frequent re-queueing & de-queueing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] virajjasani edited a comment on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
virajjasani edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-998059075


   Sorry, I could not get to this PR last week. I will review later this week but I don't mean to block this work. If I find something odd or something as an improvement over this, we can anyways get it clarified later on the PR/Jira or create addendum PR later.
   Thanks for your work @KevinWikant, this might be really helpful going forward.
   
   With a quick glance, just one question for now: Overall it seems the goal is to improve and continue the decommissioning of healthy nodes over unhealthy ones (by removing and then re-queueing the entries), hence if few nodes are really in bad state (hardware/network issues), the plan is to keep re-queueing them until more nodes are getting decommissioned than max tracked nodes right? Since unhealthy node getting decommissioned might anyways require some sort of retry, shall we requeue them even if the condition is not met (i.e. total no of decomm in progress < max tracked nodes) as a limited retries? I am just thinking at high level, yet to catch up with the PR.
   
   Also, good to know HDFS-7374 is not broken.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r774293315



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java
##########
@@ -189,6 +190,29 @@ public void run() {
          * node will be removed from tracking by the pending cancel.
          */
         processCancelledNodes();
+
+        // Having more nodes decommissioning than can be tracked will impact decommissioning
+        // performance due to queueing delay
+        int numTrackedNodes = outOfServiceNodeBlocks.size();
+        int numQueuedNodes = getPendingNodes().size();
+        int numDecommissioningNodes = numTrackedNodes + numQueuedNodes;
+        if (numDecommissioningNodes > maxConcurrentTrackedNodes) {
+          LOG.warn(
+              "There are {} nodes decommissioning but only {} nodes will be tracked at a time. "

Review comment:
       updated as per the recommendation




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] hadoop-yetus commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
hadoop-yetus commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-972487405


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 44s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to include 4 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m 10s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 23s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m  1s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 30s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  3s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 32s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 11s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 12s |  |  branch has no errors when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 17s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 18s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 14s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m 14s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 52s |  |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 45 unchanged - 1 fixed = 45 total (was 46)  |
   | +1 :green_heart: |  mvnsite  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 50s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 24s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 13s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  21m 59s |  |  patch has no errors when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 223m  2s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 48s |  |  The patch does not generate ASF License warnings.  |
   |  |   | 321m 13s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/1/artifact/out/Dockerfile |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3675 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux ca518d3e4d66 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / b77f0434b50d801584665d0e601e7f1c96c34743 |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   |  Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/1/testReport/ |
   | Max. process+thread count | 3738 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
   | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/1/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r754456231



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
##########
@@ -1654,4 +1658,139 @@ public Boolean get() {
 
     cleanupFile(fileSys, file);
   }
+
+  /**
+   * Test DatanodeAdminManager logic to re-queue unhealthy decommissioning nodes
+   * which are blocking the decommissioning of healthy nodes.
+   * Force the tracked nodes set to be filled with nodes lost while decommissioning,
+   * then decommission healthy nodes & validate they are decommissioned eventually.
+   */
+  @Test(timeout = 120000)
+  public void testRequeueUnhealthyDecommissioningNodes() throws Exception {
+    // Allow 3 datanodes to be decommissioned at a time
+    getConf().setInt(DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES, 3);
+    // Disable the normal monitor runs
+    getConf()
+        .setInt(MiniDFSCluster.DFS_NAMENODE_DECOMMISSION_INTERVAL_TESTING_KEY, Integer.MAX_VALUE);
+
+    // Start cluster with 6 datanodes
+    startCluster(1, 6);
+    final FSNamesystem namesystem = getCluster().getNamesystem();
+    final BlockManager blockManager = namesystem.getBlockManager();
+    final DatanodeManager datanodeManager = blockManager.getDatanodeManager();
+    final DatanodeAdminManager decomManager = datanodeManager.getDatanodeAdminManager();
+    assertEquals(6, getCluster().getDataNodes().size());
+
+    // 3 datanodes will be "live" datanodes that are expected to be decommissioned eventually
+    final List<DatanodeDescriptor> liveNodes = getCluster().getDataNodes().subList(3, 6).stream()
+        .map(dn -> getDatanodeDesriptor(namesystem, dn.getDatanodeUuid()))
+        .collect(Collectors.toList());
+    assertEquals(3, liveNodes.size());
+
+    // 3 datanodes will be "dead" datanodes that are expected to never be decommissioned
+    final List<DatanodeDescriptor> deadNodes = getCluster().getDataNodes().subList(0, 3).stream()
+        .map(dn -> getDatanodeDesriptor(namesystem, dn.getDatanodeUuid()))
+        .collect(Collectors.toList());
+    assertEquals(3, deadNodes.size());
+
+    // Need to create some data or "isNodeHealthyForDecommissionOrMaintenance"
+    // may unexpectedly return true for a dead node
+    writeFile(getCluster().getFileSystem(), new Path("/tmp/test1"), 1, 100);
+
+    // Cause the 3 "dead" nodes to be lost while in state decommissioning
+    // and fill the tracked nodes set with those 3 "dead" nodes
+    ArrayList<DatanodeInfo> decommissionedNodes = Lists.newArrayList();
+    int expectedNumTracked = 0;
+    for (final DatanodeDescriptor deadNode : deadNodes) {
+      assertEquals(0, decomManager.getNumPendingNodes());
+      assertEquals(expectedNumTracked, decomManager.getNumTrackedNodes());
+
+      // Start decommissioning the node
+      takeNodeOutofService(0, deadNode.getDatanodeUuid(), 0, decommissionedNodes,
+          AdminStates.DECOMMISSION_INPROGRESS);
+      decommissionedNodes.add(deadNode);
+
+      // Stop the datanode so that it is lost while decommissioning
+      getCluster().stopDataNode(deadNode.getXferAddr());
+      deadNode.setLastUpdate(213); // Set last heartbeat to be in the past
+
+      // Wait for the decommissioning datanode to become unhealthy
+      // & to be added to "pendingNodes"
+      final int finalExpectedNumTracked = expectedNumTracked;
+      GenericTestUtils.waitFor(() ->
+          !BlockManagerTestUtil.isNodeHealthyForDecommissionOrMaintenance(blockManager, deadNode)
+              && decomManager.getNumTrackedNodes() == finalExpectedNumTracked
+              && decomManager.getNumPendingNodes() == 1, 500, 30000);
+
+      // Run DatanodeAdminManager.Monitor & validate the node becomes tracked
+      BlockManagerTestUtil.recheckDecommissionState(datanodeManager);
+      expectedNumTracked++;
+      assertEquals(0, decomManager.getNumPendingNodes());
+      assertEquals(expectedNumTracked, decomManager.getNumTrackedNodes());
+    }
+
+    // Validate the 3 "dead" nodes are not removed from the tracked nodes set
+    // after several seconds of operation
+    final Duration checkDuration = Duration.ofSeconds(5);
+    Instant checkUntil = Instant.now().plus(checkDuration);
+    while (Instant.now().isBefore(checkUntil)) {
+      BlockManagerTestUtil.recheckDecommissionState(datanodeManager);
+      assertEquals(0, decomManager.getNumPendingNodes());
+      assertEquals(3, decomManager.getNumTrackedNodes());
+      Thread.sleep(500);
+    }
+
+    // Validate that the "live" nodes are considered healthy
+    GenericTestUtils.waitFor(() -> {
+      try {
+        return liveNodes.stream().allMatch(
+            dn -> BlockManagerTestUtil.isNodeHealthyForDecommissionOrMaintenance(blockManager, dn));
+      } catch (Exception e) {
+        LOG.warn("Exception running isNodeHealthyForDecommissionOrMaintenance", e);
+        return false;
+      }
+    }, 500, 30000);
+
+    // Start decommissioning 2 "live" datanodes
+    for (final DatanodeDescriptor liveNode : liveNodes.subList(0, 2)) {
+      takeNodeOutofService(0, liveNode.getDatanodeUuid(), 0, decommissionedNodes,
+          AdminStates.DECOMMISSION_INPROGRESS);
+      decommissionedNodes.add(liveNode);
+    }
+
+    // Validate that the live datanodes are put into the pending decommissioning queue
+    GenericTestUtils.waitFor(
+        () -> decomManager.getNumTrackedNodes() == 3 && decomManager.getNumPendingNodes() == 2, 500,
+        30000);
+
+    // Validate that 2 of the "dead" nodes are re-queued.
+    // Need to comment out this check because DatanodeAdminBackoffMonitor
+    // manages datanodes with different logic than DatanodeAdminDefaultMonitor.
+    /*
+    BlockManagerTestUtil.recheckDecommissionState(datanodeManager);
+    assertEquals(4, decomManager.getNumPendingNodes());
+    assertEquals(1, decomManager.getNumTrackedNodes());
+    */

Review comment:
       This assertion passes consistently for "TestDecommission", but fails for "TestDecommissionWithBackoffMonitor"
   
   I will investigate why its failing for "TestDecommissionWithBackoffMonitor" & get back to you




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant edited a comment on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-975776386


   > The DatanodeAdminBackoffMonitor is probably rarely used, if it is used at all, but it does not have a tracking limit I think at the moment. Perhaps it should have, it it was designed to run with less overhead than the default monitor, but perhaps if you decommissioned 100's of nodes at a time it would struggle, I am not sure.
   
   Based on unit testing & code inspection, I think the "dfs.namenode.decommission.max.concurrent.tracked.nodes" still applies to DatanodeAdminBackoffMonitor
   
   From the code, DatanodeAdminBackoffMonitor:
   - has an [additional data structure pendingRep](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L98)
       - note that [pendingRep is size constrained by "dfs.namenode.decommission.backoff.monitor.pending.limit"](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml)
   - will first [process blocks from within pendingRep](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L293)
   - then will [move blocks from outOfServiceBlocks to pendingRep](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L296) so that they can be processed next cycle
   
   So:
   - [pendingRep gets its blocks from outOfServiceBlocks](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L492)
   - [outOfServiceBlocks gets its datanodes from the pendingQueue](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java#L226)
       - note how outOfServiceBlocks is still size constrained by "dfs.namenode.decommission.max.concurrent.tracked.nodes"
   
   The new unit test "TestDecommissionWithBackoffMonitor.testRequeueUnhealthyDecommissioningNodes" will fail without the changes made to "TestDecommissionWithBackoffMonitor". It fails because the unhealthy nodes have filled up the tracked set (i.e. outOfServiceBlocks) & the healthy nodes are stuck in the pendingNodes queue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r774293566



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestDatanodeAdminMonitorBase.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hdfs.server.blockmanagement;
+
+import org.junit.Assert;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.PriorityQueue;
+import java.util.stream.Collectors;
+
+import org.apache.hadoop.hdfs.protocol.DatanodeID;
+
+public class TestDatanodeAdminMonitorBase {
+  public static final Logger LOG = LoggerFactory.getLogger(TestDatanodeAdminMonitorBase.class);
+
+  // Sort by lastUpdate time descending order, such that unhealthy
+  // nodes are de-prioritized given they cannot be decommissioned.
+  private static final int NUM_DATANODE = 10;
+  private static final int[] UNORDERED_LAST_UPDATE_TIMES =
+      new int[] {0, 5, 2, 11, 0, 3, 1001, 5, 1, 103};
+  private static final int[] ORDERED_LAST_UPDATE_TIMES =
+      new int[] {1001, 103, 11, 5, 5, 3, 2, 1, 0, 0};
+  private static final int[] REVERSE_ORDER_LAST_UPDATE_TIMES =
+      new int[] {0, 0, 1, 2, 3, 5, 5, 11, 103, 1001};
+
+  private static final DatanodeDescriptor[] NODES;
+
+  static {
+    NODES = new DatanodeDescriptor[NUM_DATANODE];
+    for (int i = 0; i < NUM_DATANODE; i++) {
+      NODES[i] = new DatanodeDescriptor(DatanodeID.EMPTY_DATANODE_ID);
+      NODES[i].setLastUpdate(UNORDERED_LAST_UPDATE_TIMES[i]);
+      NODES[i].setLastUpdateMonotonic(UNORDERED_LAST_UPDATE_TIMES[i]);
+    }
+  }
+
+  /**
+   * Verify that DatanodeAdminManager pendingNodes priority queue
+   * correctly orders the nodes by lastUpdate time descending.
+   */
+  @Test
+  public void testPendingNodesQueueOrdering() {
+    final PriorityQueue<DatanodeDescriptor> pendingNodes =
+        new PriorityQueue<>(10,
+            DatanodeAdminMonitorBase.PENDING_NODES_QUEUE_COMPARATOR);
+
+    for (int i = 0; i < NUM_DATANODE; i++) {
+      pendingNodes.add(NODES[i]);
+    }

Review comment:
       updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] hadoop-yetus commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
hadoop-yetus commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-1000232106


   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   1m 20s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to include 4 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  42m  1s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 40s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m  4s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 35s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 12s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 35s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 43s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  26m 52s |  |  branch has no errors when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 26s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 37s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 37s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 57s |  |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 45 unchanged - 1 fixed = 45 total (was 46)  |
   | +1 :green_heart: |  mvnsite  |   1m 33s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 58s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 29s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 46s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  27m 18s |  |  patch has no errors when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | +1 :green_heart: |  unit  | 350m  2s |  |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 13s |  |  The patch does not generate ASF License warnings.  |
   |  |   | 471m 19s |  |  |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/5/artifact/out/Dockerfile |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3675 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 11c131aeedfc 4.15.0-162-generic #170-Ubuntu SMP Mon Oct 18 11:38:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / b57cbefa664b3c7b2ffc32298b6561b79c86baa0 |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   |  Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/5/testReport/ |
   | Max. process+thread count | 1803 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
   | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/5/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-980107774


   > If a DN goes dead and then re-registers, it will be added back into the pending nodes, so I don't think we need to continue to track it in the decommission monitor when it goes dead. We can just handle the dead event in the decommission monitor and stop tracking it, clearing a slot for another node. Then it will be re-added if it comes back by existing logic above.
   
   After going through this re-implementation of the logic, I think I found a caveat worth mentioning
   
   Note the implementation of "isNodeHealthyForDecommissionOrMaintenance":
   - https://github.com/apache/hadoop/blob/dc751df63b4ab2c9c26a1efe7479c31fd1de80d5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4546
   - if HDFS has no lowRedundancy or pendingReconstruction blocks then all datanodes (including dead datanodes) are considered "healthy" for the purpose of being DECOMMISSIONED
   
   Here's the happy case:
   - a dead datanode is removed from DatanodeAdminManager (i.e. removed from both outOfServiceNodeBlocks & pendingNodes) with AdminState=DECOMMISSION_INPROGRESS
   - a dead datanode comes back alive, gets re-added to DatanodeAdminManager, & transitions to AdminState=DECOMMISSIONED
   
   Here's a less happy case:
   - a dead datanode is removed from DatanodeAdminManager with AdminState=DECOMMISSION_INPROGRESS
   - the dead datanode never comes back alive & therefore remains with AdminState=DECOMMISSION_INPROGRESS forever
   
   The problem is that dead datanodes can be eventually DECOMMISSIONED if the namenode eliminates all lowRedundancy blocks (because of "isNodeHealthyForDecommissionOrMaintenance" logic), but by removing the datanode from the DatanodeAdminManager entirely (until it comes back alive) we prevent the dead datanode from transitioning to DECOMMISSIONED even though it could have
   
   So the impact is that a dead datanode which never comes back alive again will never transition from DECOMMISSION_INPROGRESS to DECOMMISSIONED even though it could have


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r762076676



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
##########
@@ -1654,4 +1658,139 @@ public Boolean get() {
 
     cleanupFile(fileSys, file);
   }
+
+  /**
+   * Test DatanodeAdminManager logic to re-queue unhealthy decommissioning nodes
+   * which are blocking the decommissioning of healthy nodes.
+   * Force the tracked nodes set to be filled with nodes lost while decommissioning,
+   * then decommission healthy nodes & validate they are decommissioned eventually.
+   */
+  @Test(timeout = 120000)
+  public void testRequeueUnhealthyDecommissioningNodes() throws Exception {
+    // Allow 3 datanodes to be decommissioned at a time
+    getConf().setInt(DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES, 3);
+    // Disable the normal monitor runs
+    getConf()
+        .setInt(MiniDFSCluster.DFS_NAMENODE_DECOMMISSION_INTERVAL_TESTING_KEY, Integer.MAX_VALUE);
+
+    // Start cluster with 6 datanodes
+    startCluster(1, 6);
+    final FSNamesystem namesystem = getCluster().getNamesystem();
+    final BlockManager blockManager = namesystem.getBlockManager();
+    final DatanodeManager datanodeManager = blockManager.getDatanodeManager();
+    final DatanodeAdminManager decomManager = datanodeManager.getDatanodeAdminManager();
+    assertEquals(6, getCluster().getDataNodes().size());
+
+    // 3 datanodes will be "live" datanodes that are expected to be decommissioned eventually
+    final List<DatanodeDescriptor> liveNodes = getCluster().getDataNodes().subList(3, 6).stream()
+        .map(dn -> getDatanodeDesriptor(namesystem, dn.getDatanodeUuid()))
+        .collect(Collectors.toList());
+    assertEquals(3, liveNodes.size());
+
+    // 3 datanodes will be "dead" datanodes that are expected to never be decommissioned
+    final List<DatanodeDescriptor> deadNodes = getCluster().getDataNodes().subList(0, 3).stream()
+        .map(dn -> getDatanodeDesriptor(namesystem, dn.getDatanodeUuid()))
+        .collect(Collectors.toList());
+    assertEquals(3, deadNodes.size());
+
+    // Need to create some data or "isNodeHealthyForDecommissionOrMaintenance"
+    // may unexpectedly return true for a dead node
+    writeFile(getCluster().getFileSystem(), new Path("/tmp/test1"), 1, 100);

Review comment:
       should use a larger replication factor here to ensure there are LowRendundancy blocks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant edited a comment on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-983034760


   > DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission has effectively failed. There is a case where it can complete, but what does that really mean - if the node is dead, it has not been gracefully stopped.
   
   The case which I have described where dead node decommissioning completes can occur when:
   - a decommissioning node goes dead, but all of its blocks still have block replicas on other live nodes
   - the namenode is eventually able to satisfy the minimum replication of all blocks (by replicating the under-replicated blocks from the live nodes)
   - the dead decommissioning node is transitioned to decommissioned
   
   In this case, the node did go dead while decommissioning, but there was no LowRedundancy blocks thanks to redundant block replicas. From the user perspective, the loss of the decommissioning node did not impact the outcome of the decommissioning process. Had the node not gone dead while decommissioning, the eventual outcome is the same in that the node is decommissioned & there is no LowRedundancy blocks.
   
   If there is LowRedundancy blocks then a dead datanode will remain decommissioning, because if the dead node were to come alive again then it may be able to recover the LowRedundancy blocks. But if there is no LowRedundancy blocks then the when the node comes alive again it will be immediately transition to decommissioned anyway, so why not make it decommissioned while its still dead?
   
   Also, I don't think the priority queue is adding much complexity, it's just putting healthy nodes (with more recent heartbeat times) ahead of unhealthy nodes (with older heartbeat times); such that healthy nodes are decommissioned first
   
   ----
   
   I also want to call out another caveat with the approach of removing the node from the DatanodeAdminManager which I  uncovered while unit testing
   
   If we leave the node in DECOMMISSION_IN_PROGRESS & remove the node from DatanodeAdminManager, then the following callstack should re-add the datanode to the DatanodeAdminManager when it comes alive again:
   - [DatanodeManager.registerDatanode](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1223)
   - [DatanodeManager.startAdminOperationIfNecessary](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1109)
   - [DatanodeAdminManager.startDecommission](https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L187)
   - [DatanodeAdminMonitorBase.startTrackingNode](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L136)
   
   The problem is this condition "!node.isDecommissionInProgress()": https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L177
   
   Because the dead datanode is left in DECOMMISSION_INPROGRESS, "startTrackingNode" is not invoked because of the "!node.isDecommissionInProgress()" condition
   
   Simply removing the condition "!node.isDecommissionInProgress()" will not function well because "startTrackingNode" is not idempotent:
   - [startDecommission is invoked periodically when refreshDatanodes is called](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1339)
   - [pendingNodes is an ArrayDequeue which does not deduplicate the DatanodeDescriptor](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L43)
   - therefore, removing the "!node.isDecommissionInProgress()" check will cause a large number of duplicate DatanodeDescriptor to be added to DatanodeAdminManager
   
   I can think of 2 obvious ways to handle this:
   A) make calls to "startTrackingNode" idempotent (meaning that if the DatanodeDescriptor is already tracked, it does not get added to the DatanodeAdminManager)
   B) modify startDecommission such that its aware of if the invocation is for a datanode which was just restarted after being dead such that it can still invoke "startTrackingNode" even though "node.isDecommissionInProgress()"
   
   For "A)", the challenge is that we need to ensure the DatanodeDescriptor is not in "pendingReplication" or "outOfServiceBlocks" which could be a fairly costly call to execute repeatedly. Also, I am not even sure such a check is thread-safe given there is no locking used as part of "startDecommission" or "startTrackingNode"
   
   For "B)", the awareness of if a registerDatanode call is related to a restarted datanode is available [here](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1177). So this information would need to be passed down the callstack to a method "startDecommission(DatanodeDescriptor node, boolean isNodeRestart)". Because of the modified method signature, all the other invocations of "startDecommission" will need to specify isNodeRestart=false
   
   Given this additional hurdle in the approach of removing a dead datanode from the DatanodeAdminManager, are we sure it will be less complex than the proposed changed?
   
   ----
   
   In short:
   - I don't think there is any downside in moving a dead datanode to decommissioned when there are no LowRedundancy blocks because this would happen immediately anyway were the node to come back alive (and get re-added to DatanodeAdminManager)
   - the approach of removing a dead datanode from the DatanodeAdminManager will not work properly without some significant refactoring of the "startDecommission" method & related code
   
   @sodonnel @virajjasani @aajisaka let me know your thoughts, I am still more in favor of tracking dead datanodes in DatanodeAdminManager (when there are LowRedundancy blocks), but if the community thinks its better to remove the dead datanodes from DatanodeAdminManager I can implement proposal "B)"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant edited a comment on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-980107774


   > If a DN goes dead and then re-registers, it will be added back into the pending nodes, so I don't think we need to continue to track it in the decommission monitor when it goes dead. We can just handle the dead event in the decommission monitor and stop tracking it, clearing a slot for another node. Then it will be re-added if it comes back by existing logic above.
   
   After going through the re-implementation of the logic, I think I found a caveat worth mentioning
   
   Note the implementation of "isNodeHealthyForDecommissionOrMaintenance":
   - https://github.com/apache/hadoop/blob/dc751df63b4ab2c9c26a1efe7479c31fd1de80d5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4546
   - if HDFS has no lowRedundancy or pendingReconstruction blocks then all datanodes (including dead datanodes) are considered "healthy" for the purpose of being DECOMMISSIONED
   
   Here's the happy case:
   - a dead datanode is removed from DatanodeAdminManager (i.e. removed from both outOfServiceNodeBlocks & pendingNodes) with AdminState=DECOMMISSION_INPROGRESS
   - a dead datanode comes back alive, gets re-added to DatanodeAdminManager, & transitions to AdminState=DECOMMISSIONED
   
   Here's a less happy case:
   - a dead datanode is removed from DatanodeAdminManager with AdminState=DECOMMISSION_INPROGRESS
   - the dead datanode never comes back alive & therefore remains with AdminState=DECOMMISSION_INPROGRESS forever
   
   The problem is that dead datanodes can be eventually DECOMMISSIONED if the namenode eliminates all lowRedundancy blocks (because of "isNodeHealthyForDecommissionOrMaintenance" logic), but by removing the dead datanode from the DatanodeAdminManager entirely (until it comes back alive) we prevent the dead datanode from transitioning to DECOMMISSIONED even though it could have
   
   So the impact is that a dead datanode which never comes back alive again will never transition from DECOMMISSION_INPROGRESS to DECOMMISSIONED even though it could have


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] aajisaka commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
aajisaka commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r771228531



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
##########
@@ -1654,4 +1658,204 @@ public Boolean get() {
 
     cleanupFile(fileSys, file);
   }
+
+  /**
+   * Test DatanodeAdminManager logic to re-queue unhealthy decommissioning nodes
+   * which are blocking the decommissioning of healthy nodes.
+   * Force the tracked nodes set to be filled with nodes lost while decommissioning,
+   * then decommission healthy nodes & validate they are decommissioned eventually.
+   */
+  @Test(timeout = 120000)
+  public void testRequeueUnhealthyDecommissioningNodes() throws Exception {
+    // Create a MiniDFSCluster with 3 live datanode in AdminState=NORMAL and
+    // 2 dead datanodes in AdminState=DECOMMISSION_INPROGRESS and a file
+    // with replication factor of 5.
+    final int numLiveNodes = 3;
+    final int numDeadNodes = 2;
+    final int numNodes = numLiveNodes + numDeadNodes;
+    final List<DatanodeDescriptor> liveNodes = new ArrayList<>();
+    final Map<DatanodeDescriptor, MiniDFSCluster.DataNodeProperties> deadNodeProps =
+        new HashMap<>();
+    final ArrayList<DatanodeInfo> decommissionedNodes = new ArrayList<>();
+    final Path filePath = new Path("/tmp/test");
+    createClusterWithDeadNodesDecommissionInProgress(numLiveNodes, liveNodes, numDeadNodes,
+        deadNodeProps, decommissionedNodes, filePath);
+    final FSNamesystem namesystem = getCluster().getNamesystem();
+    final BlockManager blockManager = namesystem.getBlockManager();
+    final DatanodeManager datanodeManager = blockManager.getDatanodeManager();
+    final DatanodeAdminManager decomManager = datanodeManager.getDatanodeAdminManager();
+
+    // Validate the 2 "dead" nodes are not removed from the tracked nodes set
+    // after several seconds of operation
+    final Duration checkDuration = Duration.ofSeconds(5);
+    Instant checkUntil = Instant.now().plus(checkDuration);
+    while (Instant.now().isBefore(checkUntil)) {
+      BlockManagerTestUtil.recheckDecommissionState(datanodeManager);
+      assertEquals(0, decomManager.getNumPendingNodes());
+      assertEquals(numDeadNodes, decomManager.getNumTrackedNodes());
+      assertTrue(deadNodeProps.keySet().stream()
+          .allMatch(node -> node.getAdminState().equals(AdminStates.DECOMMISSION_INPROGRESS)));
+      Thread.sleep(500);
+    }
+
+    // Delete the file such that its no longer a factor blocking decommissioning of live nodes
+    // which have block replicas for that file
+    getCluster().getFileSystem().delete(filePath, true);
+
+    // Start decommissioning 2 "live" datanodes
+    int numLiveDecommNodes = 2;
+    final List<DatanodeDescriptor> liveDecommNodes = liveNodes.subList(0, numLiveDecommNodes);
+    for (final DatanodeDescriptor liveNode : liveDecommNodes) {
+      takeNodeOutofService(0, liveNode.getDatanodeUuid(), 0, decommissionedNodes,
+          AdminStates.DECOMMISSION_INPROGRESS);
+      decommissionedNodes.add(liveNode);
+    }
+
+    // Write a new file such that there are under-replicated blocks preventing decommissioning
+    // of dead nodes
+    writeFile(getCluster().getFileSystem(), filePath, numNodes, 10);
+
+    // Validate that the live datanodes are put into the pending decommissioning queue
+    GenericTestUtils.waitFor(() -> decomManager.getNumTrackedNodes() == numDeadNodes
+            && decomManager.getNumPendingNodes() == numLiveDecommNodes
+            && liveDecommNodes.stream().allMatch(
+                node -> node.getAdminState().equals(AdminStates.DECOMMISSION_INPROGRESS)),
+        500, 30000);
+    for (final DatanodeDescriptor node : decomManager.getPendingNodes()) {
+      assertTrue(liveDecommNodes.contains(node));
+    }

Review comment:
       Sorry I meant `org.assertj.core.api.Assertions.assertThat`.
   https://joel-costigliola.github.io/assertj/




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant edited a comment on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-998300466


   @virajjasani, please see my response to your comments below
   
   > hence if few nodes are really in bad state (hardware/network issues), the plan is to keep re-queueing them until more nodes are getting decommissioned than max tracked nodes right?
   
   It's the opposite, the unhealthy nodes will only be re-queued when there are more nodes being decommissioned than max tracked nodes. Otherwise, if there are fewer nodes being decommissioned than max tracked nodes, then the unhealthy nodes will not be re-queued because they do not risk blocking the decommissioning of queued healthy nodes (i.e. because the queue is empty).
   
   One potential performance impact that comes to mind is that if there are say 200 unhealthy decommissioning nodes & max tracked nodes = 100, then this may cause some churn in the queueing/de-queueing process because each DatanodeAdminMonitor tick all 100 tracked nodes will be re-queued & then 100 queued nodes will be de-queued/tracked. Note that this churn (and any associated performance impact) will only take effect when:
   - there are more nodes being decommissioned than max tracked nodes
   - AND either:
       - number of healthy decommissioning nodes < max tracked nodes
       - number of unhealthy decommissioning nodes > max tracked nodes
   
   The amount of re-queued/de-queued nodes per tick can be quantified as:
   
   `numRequeue = numDecommissioning <= numTracked ? 0 : numDeadDecommissioning - (numDecommissioning - numTracked)`
   
   This churn of queueing/de-queueing will not occur at all under typical decommissioning scenarios (i.e. where there isn't a large number of dead decommissioning nodes).
   
   One idea to mitigate this is to have DatanodeAdminMonitor maintain counters used to track the number of healthy nodes in the pendingNodes queue; then this count can be used to make an improved re-queue decision. In particular, unhealthy nodes are only re-queued if there are healthy nodes in the pendingNodes queue. But this approach has some flaws, for example an unhealthy node in the queue could come alive again, but then an unhealthy node in the tracked set wouldn't be re-queued because the healthy queued node count hasn't been updated. To solve this, we would need to scan the pendingNodes queue to update the healthy/unhealthy node counts periodically, this scan could prove expensive.
   
   > Since unhealthy node getting decommissioned might anyways require some sort of retry, shall we requeue them even if the condition is not met (i.e. total no of decomm in progress < max tracked nodes) as a limited retries?
   
   If there are fewer nodes being decommissioned than max tracked nodes, then there are no nodes in the pendingNodes queue & all nodes are being tracked for decommissioning. Therefore, there is no possibility that any healthy nodes are blocked in the pendingNodes queue (preventing them from being decommissioned) & so in my opinion there is no benefit to re-queueing the unhealthy nodes in this case. Furthermore, this will negatively impact performance through frequent re-queueing & de-queueing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r762077696



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDecommission.java
##########
@@ -1654,4 +1658,139 @@ public Boolean get() {
 
     cleanupFile(fileSys, file);
   }
+
+  /**
+   * Test DatanodeAdminManager logic to re-queue unhealthy decommissioning nodes
+   * which are blocking the decommissioning of healthy nodes.
+   * Force the tracked nodes set to be filled with nodes lost while decommissioning,
+   * then decommission healthy nodes & validate they are decommissioned eventually.
+   */
+  @Test(timeout = 120000)
+  public void testRequeueUnhealthyDecommissioningNodes() throws Exception {
+    // Allow 3 datanodes to be decommissioned at a time
+    getConf().setInt(DFSConfigKeys.DFS_NAMENODE_DECOMMISSION_MAX_CONCURRENT_TRACKED_NODES, 3);
+    // Disable the normal monitor runs
+    getConf()
+        .setInt(MiniDFSCluster.DFS_NAMENODE_DECOMMISSION_INTERVAL_TESTING_KEY, Integer.MAX_VALUE);
+
+    // Start cluster with 6 datanodes
+    startCluster(1, 6);
+    final FSNamesystem namesystem = getCluster().getNamesystem();
+    final BlockManager blockManager = namesystem.getBlockManager();
+    final DatanodeManager datanodeManager = blockManager.getDatanodeManager();
+    final DatanodeAdminManager decomManager = datanodeManager.getDatanodeAdminManager();
+    assertEquals(6, getCluster().getDataNodes().size());
+
+    // 3 datanodes will be "live" datanodes that are expected to be decommissioned eventually
+    final List<DatanodeDescriptor> liveNodes = getCluster().getDataNodes().subList(3, 6).stream()
+        .map(dn -> getDatanodeDesriptor(namesystem, dn.getDatanodeUuid()))
+        .collect(Collectors.toList());
+    assertEquals(3, liveNodes.size());
+
+    // 3 datanodes will be "dead" datanodes that are expected to never be decommissioned
+    final List<DatanodeDescriptor> deadNodes = getCluster().getDataNodes().subList(0, 3).stream()
+        .map(dn -> getDatanodeDesriptor(namesystem, dn.getDatanodeUuid()))
+        .collect(Collectors.toList());
+    assertEquals(3, deadNodes.size());
+
+    // Need to create some data or "isNodeHealthyForDecommissionOrMaintenance"
+    // may unexpectedly return true for a dead node
+    writeFile(getCluster().getFileSystem(), new Path("/tmp/test1"), 1, 100);
+
+    // Cause the 3 "dead" nodes to be lost while in state decommissioning
+    // and fill the tracked nodes set with those 3 "dead" nodes
+    ArrayList<DatanodeInfo> decommissionedNodes = Lists.newArrayList();
+    int expectedNumTracked = 0;
+    for (final DatanodeDescriptor deadNode : deadNodes) {

Review comment:
       should put the "waitFor" after the for loop such that the nodes can be stopped in parallel, this will improve the runtime of the test




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-985647896


   @sodonnel The existing test "TestDecommissioningStatus.testDecommissionStatusAfterDNRestart" will be problematic for the proposed alternative of removing a dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager: https://github.com/apache/hadoop/pull/3746/
   
   As previously stated, removing the dead DECOMMISSION_INPROGRESS node from the DatanodeAdminManager means that when there are no LowRedundancy blocks the dead node will remain in DECOMMISSION_INPROGRESS rather than transitioning to DECOMMISSIONED
   
   This violates the expectation the the unit test is enforcing which is that a dead DECOMMISSION_INPROGRESS node should transition to DECOMMISSIONED when there are no LowRedundancy blocks
   
   ```
   "Delete the under-replicated file, which should let the DECOMMISSION_IN_PROGRESS node become DECOMMISSIONED"
   ```
   
   https://github.com/apache/hadoop/blob/6342d5e523941622a140fd877f06e9b59f48c48b/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestDecommissioningStatus.java#L451
   
   Therefore, I think this is a good argument to remain more in favor of the original proposed change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] aajisaka commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
aajisaka commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-989316188


   ping @sodonnel let me know your thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r774293380



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
##########
@@ -151,4 +162,34 @@ public int getPendingNodeCount() {
   public Queue<DatanodeDescriptor> getPendingNodes() {
     return pendingNodes;
   }
+
+  /**
+   * If node "is dead while in Decommission In Progress", it cannot be decommissioned
+   * until it becomes healthy again. If there are more pendingNodes than can be tracked
+   * & some unhealthy tracked nodes, then re-queue the unhealthy tracked nodes
+   * to avoid blocking decommissioning of healthy nodes.
+   *
+   * @param unhealthyDns The unhealthy datanodes which may be re-queued
+   * @param numDecommissioningNodes The total number of nodes being decommissioned
+   * @return List of unhealthy nodes to be re-queued

Review comment:
       corrected List to Stream

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
##########
@@ -35,12 +38,20 @@
 public abstract class DatanodeAdminMonitorBase
     implements DatanodeAdminMonitorInterface, Configurable {
 
+  /**
+   * Sort by lastUpdate time descending order, such that unhealthy
+   * nodes are de-prioritized given they cannot be decommissioned.
+   */
+  public static final Comparator<DatanodeDescriptor> PENDING_NODES_QUEUE_COMPARATOR =

Review comment:
       good call, modified to package private

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
##########
@@ -151,4 +162,34 @@ public int getPendingNodeCount() {
   public Queue<DatanodeDescriptor> getPendingNodes() {
     return pendingNodes;
   }
+
+  /**
+   * If node "is dead while in Decommission In Progress", it cannot be decommissioned
+   * until it becomes healthy again. If there are more pendingNodes than can be tracked
+   * & some unhealthy tracked nodes, then re-queue the unhealthy tracked nodes
+   * to avoid blocking decommissioning of healthy nodes.
+   *
+   * @param unhealthyDns The unhealthy datanodes which may be re-queued
+   * @param numDecommissioningNodes The total number of nodes being decommissioned
+   * @return List of unhealthy nodes to be re-queued
+   */
+  Stream<DatanodeDescriptor> identifyUnhealthyNodesToRequeue(

Review comment:
       modified to "getUnhealthyNodesToRequeue"

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java
##########
@@ -270,12 +273,31 @@ private void check() {
         // an invalid state.
         LOG.warn("DatanodeAdminMonitor caught exception when processing node "
             + "{}.", dn, e);
-        pendingNodes.add(dn);
+        getPendingNodes().add(dn);
         toRemove.add(dn);
       } finally {
         iterkey = dn;
       }
     }
+
+    // Having more nodes decommissioning than can be tracked will impact decommissioning
+    // performance due to queueing delay
+    int numTrackedNodes = outOfServiceNodeBlocks.size() - toRemove.size();
+    int numQueuedNodes = getPendingNodes().size();
+    int numDecommissioningNodes = numTrackedNodes + numQueuedNodes;
+    if (numDecommissioningNodes > maxConcurrentTrackedNodes) {
+      LOG.warn(
+          "There are {} nodes decommissioning but only {} nodes will be tracked at a time. "
+              + "{} nodes are currently queued waiting to be decommissioned.",
+          numDecommissioningNodes, maxConcurrentTrackedNodes, numQueuedNodes);

Review comment:
       updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] hadoop-yetus commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
hadoop-yetus commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-991435605


   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 41s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to include 4 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  31m  6s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 29s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 20s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m  0s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 25s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  1s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m  9s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 17s |  |  branch has no errors when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 18s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 13s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 52s |  |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 45 unchanged - 1 fixed = 45 total (was 46)  |
   | +1 :green_heart: |  mvnsite  |   1m 18s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 22s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m  9s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m  4s |  |  patch has no errors when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | +1 :green_heart: |  unit  | 226m 20s |  |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 47s |  |  The patch does not generate ASF License warnings.  |
   |  |   | 323m 13s |  |  |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/2/artifact/out/Dockerfile |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3675 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux 6f28e8ecb417 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 54ba1cb54ae6bd66c126412933ec859007edd273 |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   |  Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/2/testReport/ |
   | Max. process+thread count | 3374 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
   | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] virajjasani commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
virajjasani commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-998059075


   Sorry, I could not get to this PR last week. I will review later this week but I don't mean to block this work. If I find something odd or something as an improvement over this, we can anyways get it clarified later on the PR/Jira or create addendum PR later.
   Thanks for your work @KevinWikant, this might be really helpful going forward.
   
   With a quick glance, just one question for now: Overall it seems the goal is to improve and continue the decommissioning of healthy nodes over unhealthy ones (by removing and then re-queueing the entries), hence if few nodes are really in bad state (hardware/network issues), the plan is to keep re-queueing them until more nodes are getting decommissioned than max tracked nodes right? Since unhealthy node getting decommissioned might anyways require some sort of retry, shall we requeue them even if the condition is not met (i.e. total no of decomm in progress < max tracked nodes)? I am just thinking at high level, yet to catch up with the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] virajjasani commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
virajjasani commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-998061174


   > Unit testing failed due to unrelated flaky tests
   > 
   > > [ERROR] Errors:
   > > [ERROR] org.apache.hadoop.hdfs.TestRollingUpgrade.testRollback(org.apache.hadoop.hdfs.TestRollingUpgrade)
   > > [ERROR]   Run 1: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 » Timeout Timed out...
   > > [ERROR]   Run 2: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 » Timeout Timed out...
   > > [ERROR]   Run 3: TestRollingUpgrade.testRollback:329->waitForNullMxBean:361 » Timeout Timed out...
   > > [INFO]
   > > [WARNING] Flakes:
   > > [WARNING] org.apache.hadoop.hdfs.TestRollingUpgrade.testCheckpoint(org.apache.hadoop.hdfs.TestRollingUpgrade)
   > > [ERROR]   Run 1: TestRollingUpgrade.testCheckpoint:599->testCheckpoint:686 Test resulted in an unexpected exit
   > > [INFO]   Run 2: PASS
   
   Yeah this test failure is not relevant. Even after the recent attempt, it is still flaky, we might require better insights for this test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] aajisaka commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
aajisaka commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-1000323824


   Merged. Thank you @KevinWikant @virajjasani @sodonnel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] virajjasani commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
virajjasani commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r773700080



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java
##########
@@ -189,6 +190,29 @@ public void run() {
          * node will be removed from tracking by the pending cancel.
          */
         processCancelledNodes();
+
+        // Having more nodes decommissioning than can be tracked will impact decommissioning
+        // performance due to queueing delay
+        int numTrackedNodes = outOfServiceNodeBlocks.size();
+        int numQueuedNodes = getPendingNodes().size();
+        int numDecommissioningNodes = numTrackedNodes + numQueuedNodes;
+        if (numDecommissioningNodes > maxConcurrentTrackedNodes) {
+          LOG.warn(
+              "There are {} nodes decommissioning but only {} nodes will be tracked at a time. "

Review comment:
       nit: more readable `{} nodes are decommissioning but only.....`

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
##########
@@ -151,4 +162,34 @@ public int getPendingNodeCount() {
   public Queue<DatanodeDescriptor> getPendingNodes() {
     return pendingNodes;
   }
+
+  /**
+   * If node "is dead while in Decommission In Progress", it cannot be decommissioned
+   * until it becomes healthy again. If there are more pendingNodes than can be tracked
+   * & some unhealthy tracked nodes, then re-queue the unhealthy tracked nodes
+   * to avoid blocking decommissioning of healthy nodes.
+   *
+   * @param unhealthyDns The unhealthy datanodes which may be re-queued
+   * @param numDecommissioningNodes The total number of nodes being decommissioned
+   * @return List of unhealthy nodes to be re-queued
+   */
+  Stream<DatanodeDescriptor> identifyUnhealthyNodesToRequeue(

Review comment:
       nit: Since we are not only identifying but also returning the unhealthy nodes to re-queue, shall we update this method name to `identifyAndGetUnhealthyNodesToRequeue` or `getUnhealthyNodesToRequeue` ?

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
##########
@@ -35,12 +38,20 @@
 public abstract class DatanodeAdminMonitorBase
     implements DatanodeAdminMonitorInterface, Configurable {
 
+  /**
+   * Sort by lastUpdate time descending order, such that unhealthy
+   * nodes are de-prioritized given they cannot be decommissioned.
+   */
+  public static final Comparator<DatanodeDescriptor> PENDING_NODES_QUEUE_COMPARATOR =

Review comment:
       Let's remove public and keep default access modifier?

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestDatanodeAdminMonitorBase.java
##########
@@ -0,0 +1,95 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.hdfs.server.blockmanagement;
+
+import org.junit.Assert;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.PriorityQueue;
+import java.util.stream.Collectors;
+
+import org.apache.hadoop.hdfs.protocol.DatanodeID;
+
+public class TestDatanodeAdminMonitorBase {
+  public static final Logger LOG = LoggerFactory.getLogger(TestDatanodeAdminMonitorBase.class);
+
+  // Sort by lastUpdate time descending order, such that unhealthy
+  // nodes are de-prioritized given they cannot be decommissioned.
+  private static final int NUM_DATANODE = 10;
+  private static final int[] UNORDERED_LAST_UPDATE_TIMES =
+      new int[] {0, 5, 2, 11, 0, 3, 1001, 5, 1, 103};
+  private static final int[] ORDERED_LAST_UPDATE_TIMES =
+      new int[] {1001, 103, 11, 5, 5, 3, 2, 1, 0, 0};
+  private static final int[] REVERSE_ORDER_LAST_UPDATE_TIMES =
+      new int[] {0, 0, 1, 2, 3, 5, 5, 11, 103, 1001};
+
+  private static final DatanodeDescriptor[] NODES;
+
+  static {
+    NODES = new DatanodeDescriptor[NUM_DATANODE];
+    for (int i = 0; i < NUM_DATANODE; i++) {
+      NODES[i] = new DatanodeDescriptor(DatanodeID.EMPTY_DATANODE_ID);
+      NODES[i].setLastUpdate(UNORDERED_LAST_UPDATE_TIMES[i]);
+      NODES[i].setLastUpdateMonotonic(UNORDERED_LAST_UPDATE_TIMES[i]);
+    }
+  }
+
+  /**
+   * Verify that DatanodeAdminManager pendingNodes priority queue
+   * correctly orders the nodes by lastUpdate time descending.
+   */
+  @Test
+  public void testPendingNodesQueueOrdering() {
+    final PriorityQueue<DatanodeDescriptor> pendingNodes =
+        new PriorityQueue<>(10,
+            DatanodeAdminMonitorBase.PENDING_NODES_QUEUE_COMPARATOR);
+
+    for (int i = 0; i < NUM_DATANODE; i++) {
+      pendingNodes.add(NODES[i]);
+    }

Review comment:
       `pendingNodes.addAll(Arrays.asList(NODES).subList(0, NUM_DATANODE))` would be better alternative using Collection copy API over manual copy

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java
##########
@@ -270,12 +273,31 @@ private void check() {
         // an invalid state.
         LOG.warn("DatanodeAdminMonitor caught exception when processing node "
             + "{}.", dn, e);
-        pendingNodes.add(dn);
+        getPendingNodes().add(dn);
         toRemove.add(dn);
       } finally {
         iterkey = dn;
       }
     }
+
+    // Having more nodes decommissioning than can be tracked will impact decommissioning
+    // performance due to queueing delay
+    int numTrackedNodes = outOfServiceNodeBlocks.size() - toRemove.size();
+    int numQueuedNodes = getPendingNodes().size();
+    int numDecommissioningNodes = numTrackedNodes + numQueuedNodes;
+    if (numDecommissioningNodes > maxConcurrentTrackedNodes) {
+      LOG.warn(
+          "There are {} nodes decommissioning but only {} nodes will be tracked at a time. "
+              + "{} nodes are currently queued waiting to be decommissioned.",
+          numDecommissioningNodes, maxConcurrentTrackedNodes, numQueuedNodes);

Review comment:
       Same as above, `{} nodes are decommissioning but only`...

##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java
##########
@@ -151,4 +162,34 @@ public int getPendingNodeCount() {
   public Queue<DatanodeDescriptor> getPendingNodes() {
     return pendingNodes;
   }
+
+  /**
+   * If node "is dead while in Decommission In Progress", it cannot be decommissioned
+   * until it becomes healthy again. If there are more pendingNodes than can be tracked
+   * & some unhealthy tracked nodes, then re-queue the unhealthy tracked nodes
+   * to avoid blocking decommissioning of healthy nodes.
+   *
+   * @param unhealthyDns The unhealthy datanodes which may be re-queued
+   * @param numDecommissioningNodes The total number of nodes being decommissioned
+   * @return List of unhealthy nodes to be re-queued

Review comment:
       nit: `@return Stream of unhealthy....`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on a change in pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on a change in pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#discussion_r762078547



##########
File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminBackoffMonitor.java
##########
@@ -189,6 +190,30 @@ public void run() {
          * node will be removed from tracking by the pending cancel.
          */
         processCancelledNodes();
+
+        // Having more nodes decommissioning than can be tracked will impact decommissioning
+        // performance due to queueing delay
+        int numTrackedNodes = outOfServiceNodeBlocks.size();
+        int numQueuedNodes = getPendingNodes().size();
+        int numDecommissioningNodes = numTrackedNodes + numQueuedNodes;
+        if (numDecommissioningNodes > maxConcurrentTrackedNodes) {
+          LOG.warn(
+              "There are {} nodes decommissioning but only {} nodes will be tracked at a time. "
+                  + "{} nodes are currently queued waiting to be decommissioned.",
+              numDecommissioningNodes, maxConcurrentTrackedNodes, numQueuedNodes);
+
+          // Re-queue unhealthy nodes to make space for decommissioning healthy nodes
+          final List<DatanodeDescriptor> unhealthyDns = outOfServiceNodeBlocks.keySet().stream()
+              .filter(dn -> !blockManager.isNodeHealthyForDecommissionOrMaintenance(dn))
+              .collect(Collectors.toList());
+          final List<DatanodeDescriptor> toRequeue =
+              identifyUnhealthyNodesToRequeue(unhealthyDns, numDecommissioningNodes);
+          for (DatanodeDescriptor dn : toRequeue) {
+            getPendingNodes().add(dn);
+            outOfServiceNodeBlocks.remove(dn);

Review comment:
       I think I may also need to remove from "pendingRep" here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-985024997


   I have implemented the alternative where dead DECOMMISSION_INPROGRESS nodes are removed from the DatanodeAdminManager immediately: https://github.com/apache/hadoop/pull/3746
   
   I created a separate PR so that the 2 alternatives can be compared, the diff for each alternative is quite different
   
   @sodonnel @virajjasani @aajisaka let me know your thoughts


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-975778252


   > If a DN goes dead and then re-registers, it will be added back into the pending nodes, so I don't think we need to continue to track it in the decommission monitor when it goes dead. We can just handle the dead event in the decommission monitor and stop tracking it, clearing a slot for another node. Then it will be re-added if it comes back by existing logic above.
   
   Thanks for sharing this! I will make this change, test it, & get back to you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-975759431


   > If a node goes dead while decommissioning, would it not be better to just remove it from the decommission monitor rather than keep tracking it there at all? If the node comes alive again, it should be entered back into the monitor.
   > We could either detect it is dead in the monitor and remove it from tracking then, or have the place that logs the mentioned "node is dead while decommission in progress" remove it from the monitor.
   
   I agree with this statement & this is essentially the behavior this code change provides with 1 caveat "If the node comes alive again _OR if there are no (potentially healthy) nodes in the pendingNodes queue_, it will be entered back into the monitor"
   
   Consider the 2 alternatives:
   A) Do not track an unhealthy node until it becomes healthy
   B) Do not track an unhealthy node until it becomes healthy OR until there are no (potentially healthy) nodes in the pendingNodes queue
   
   Note that both alternatives do not prevent healthy nodes from being decommissioned. With Option B) if there are more nodes being decommissioned than can be tracked, any unhealthy nodes being tracked will be removed from the tracked set & re-queued (with a priority lower than all healthy nodes in the priority queue); then the healthy nodes will be de-queued & moved to the tracked set
   
   It may seem Option A) is more performant as it avoids tracking unhealthy nodes each DatanodeAdminMonitor cycle, but this may not be the case as we will still need to be checking the health status of these unhealthy queued nodes to determine if they should be tracked. Furthermore, tracking unhealthy nodes is the current behavior & as far as I know it has not caused any performance problems.
   
   With Option B) unhealthy nodes are only tracked (and there health status is only checked) when there are fewer healthy nodes than "dfs.namenode.decommission.max.concurrent.tracked.nodes". This may result in superior performance for Option B) as it will only need to check the health of up to "dfs.namenode.decommission.max.concurrent.tracked.nodes" unhealthy nodes, rather than having to check the health of all nodes in the pendingNodes queue.
   
   Note that the log "node is dead while decommission in progress", occurs from within the DatanodeAdminMonitor:
   - https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java#L240
   - https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4549


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] sodonnel commented on pull request #3675: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
sodonnel commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-975769423


   In `DatanodeManager.registerDatanode()`, it has logic to add a node into the decommission workflow if the node has a DECOMMISSIONED status in the hosts / combined hosts file, as it calls `startAdminOperationIfNecessary` which looks like:
   
   ```
   void startAdminOperationIfNecessary(DatanodeDescriptor nodeReg) {
       long maintenanceExpireTimeInMS =
           hostConfigManager.getMaintenanceExpirationTimeInMS(nodeReg);
       // If the registered node is in exclude list, then decommission it
       if (getHostConfigManager().isExcluded(nodeReg)) {
         datanodeAdminManager.startDecommission(nodeReg);
       } else if (nodeReg.maintenanceNotExpired(maintenanceExpireTimeInMS)) {
         datanodeAdminManager.startMaintenance(nodeReg, maintenanceExpireTimeInMS);
       }
     }
   ```
   If a DN goes dead and then re-registers, it will be added back into the pending nodes, so I don't think we need to continue to track it in the decommission monitor when it goes dead. We can just handle the dead event in the decommission monitor and stop tracking it, clearing a slot for another node. Then it will be re-added if it comes back by existing logic above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org