You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2021/11/16 23:18:24 UTC

[GitHub] [hadoop] KevinWikant opened a new pull request #3667: HDFS-16303. Improve handling of datanode lost while decommissioning

KevinWikant opened a new pull request #3667:
URL: https://github.com/apache/hadoop/pull/3667


   ### Description of PR
   
   Fixes a bug in Hadoop HDFS where if more than "dfs.namenode.decommission.max.concurrent.tracked.nodes" datanodes are lost while in state decommissioning, then all forward progress towards decommissioning any datanodes (including healthy datanodes) is blocked
   
   ### How was this patch tested?
   
   #### Unit Testing
   
   Added new unit tests:
   - TestDecommission.testRequeueUnhealthyDecommissioningNodes
   - DatanodeAdminMonitorBase.testPendingNodesQueueOrdering
   - DatanodeAdminMonitorBase.testPendingNodesQueueReverseOrdering
   
   All "TestDecommission" & "DatanodeAdminMonitorBase" tests pass when run locally
   
   Note that without the "DatanodeAdminManager" changes the new test "testRequeueUnhealthyDecommissioningNodes" fails because it times out waiting for the healthy nodes to be decommissioned
   
   ```
   > mvn -Dtest=TestDecommission#testRequeueUnhealthyDecommissioningNodes test
   ...
   [ERROR] Errors: 
   [ERROR]   TestDecommission.testRequeueUnhealthyDecommissioningNodes:1772 ยป Timeout Timed...
   ```
   
   #### Manual Testing
   
   - create Hadoop cluster with:
       - 30 datanodes initially
       - hdfs-site configuration "dfs.namenode.decommission.max.concurrent.tracked.node = 10"
       - custom Namenode JAR containing this change
   
   ```
   > cat /etc/hadoop/conf/hdfs-site.xml | grep -A 1 'tracked'
       <name>dfs.namenode.decommission.max.concurrent.tracked.nodes</name>
       <value>10</value>
   ```
   
   - reproduce the bug: https://issues.apache.org/jira/browse/HDFS-16303
       - start decommissioning over 20 datanodes
       - terminate 20 datanodes while decommissioning
       - observe the Namenode logs to validate that there are 20 unhealthy datanodes stuck "in Decommission In Progress"
   
   ```
   2021-11-15 17:57:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
   2021-11-15 17:57:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 17:58:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
   2021-11-15 17:58:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 17:58:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
   2021-11-15 17:58:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 17:59:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
   2021-11-15 17:59:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
   ```
   
   - scale-up to 25 healthy datanodes & then decommission 22 of those datanodes (all but 3)
       - observe the Namenode logs to validate those 22 healthy datanodes are decommissioned (i.e. HDFS-16303 is solved)
   
   ```
   2021-11-15 17:59:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
   2021-11-15 17:59:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 18:00:14,487 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
   2021-11-15 18:00:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 42 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
   2021-11-15 18:01:14,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes will be tracked at a time. 32 nodes are currently queued waiting to be decommissioned.
   2021-11-15 18:01:44,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 32 nodes decommissioning but only 10 nodes will be tracked at a time. 22 nodes are currently queued waiting to be decommissioned.
   2021-11-15 18:02:14,486 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 22 nodes decommissioning but only 10 nodes will be tracked at a time. 22 nodes are currently queued waiting to be decommissioned.
   
   2021-11-15 18:02:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 12 nodes are currently queued waiting to be decommissioned.
   2021-11-15 18:02:44,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 8 nodes which are dead while in Decommission In Progress.
   
   2021-11-15 18:03:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): There are 20 nodes decommissioning but only 10 nodes will be tracked at a time. 10 nodes are currently queued waiting to be decommissioned.
   2021-11-15 18:03:14,485 WARN org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager (DatanodeAdminMonitor-0): dfs.namenode.decommission.max.concurrent.tracked.nodes limit has been reached, re-queueing 10 nodes which are dead while in Decommission In Progress.
   ```
   
   ### For code changes:
   
   - [yes] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
   - [no] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
   - [n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [no] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] KevinWikant closed pull request #3667: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
KevinWikant closed pull request #3667:
URL: https://github.com/apache/hadoop/pull/3667


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


[GitHub] [hadoop] hadoop-yetus commented on pull request #3667: HDFS-16303. Improve handling of datanode lost while decommissioning

Posted by GitBox <gi...@apache.org>.
hadoop-yetus commented on pull request #3667:
URL: https://github.com/apache/hadoop/pull/3667#issuecomment-971323286


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   1m 24s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to include 4 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  37m 29s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 46s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  compile  |   1m 35s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  checkstyle  |   1m 10s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 43s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 10s |  |  trunk passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 39s |  |  trunk passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 54s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  27m  6s |  |  branch has no errors when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 40s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javac  |   1m 40s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 24s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  javac  |   1m 24s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks issues.  |
   | -0 :warning: |  checkstyle  |   0m 57s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3667/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 5 new + 45 unchanged - 1 fixed = 50 total (was 46)  |
   | +1 :green_heart: |  mvnsite  |   1m 32s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m  1s |  |  the patch passed with JDK Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 33s |  |  the patch passed with JDK Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10  |
   | +1 :green_heart: |  spotbugs  |   3m 54s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  26m 56s |  |  patch has no errors when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 395m 51s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3667/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 51s |  |  The patch does not generate ASF License warnings.  |
   |  |   | 513m 13s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.hdfs.TestDecommissionWithBackoffMonitor |
   |   | hadoop.hdfs.TestHDFSFileSystemContract |
   |   | hadoop.hdfs.web.TestWebHdfsFileSystemContract |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3667/1/artifact/out/Dockerfile |
   | GITHUB PR | https://github.com/apache/hadoop/pull/3667 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell |
   | uname | Linux df4ebb1bd981 4.15.0-143-generic #147-Ubuntu SMP Wed Apr 14 16:10:11 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 418468c9f91f4cf60057afe2bec9d125c92de1bc |
   | Default Java | Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.11+9-Ubuntu-0ubuntu2.20.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_292-8u292-b10-0ubuntu1~20.04-b10 |
   |  Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3667/1/testReport/ |
   | Max. process+thread count | 1990 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
   | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3667/1/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0-SNAPSHOT https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org