You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2022/06/06 14:22:14 UTC

[GitHub] [hadoop] KevinWikant opened a new pull request, #4410: HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

KevinWikant opened a new pull request, #4410:
URL: https://github.com/apache/hadoop/pull/4410

   HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas
   
   ### Description of PR
   
   Bug fix for a re-occurring HDFS bug which can result in datanodes being unable to complete decommissioning indefinitely. In short, the bug is a chicken & egg problem where:
   - in order for a datanode to be decommissioned its blocks must be sufficiently replicated
   - datanode cannot sufficiently replicate some block(s) because of corrupt block replicas on target datanodes
   - corrupt block replicas will not be invalidated because the block(s) are not sufficiently replicated
   
   In this scenario, the block(s) are sufficiently replicated but the logic the Namenode uses to determine if a block is sufficiently replicated is flawed.
   
   To understand the bug further we must first establish some background information.
   
   #### Background Information
   
   Givens:
   - FSDataOutputStream is being used to write the HDFS file, under the hood this uses a class DataStreamer
   - for the sake of example we will say the HDFS file has a replication factor of 2, though this is not a requirement to reproduce the issue
   - the file is being appended to intermittently over an extended period of time (in general, this issue needs minutes/hours  to reproduce)
   - HDFS is configured with typical default configurations
   
   Under certain scenarios the DataStreamer client can detect a bad link when trying to append to the block pipeline, in this case the DataStreamer client can shift the block pipeline by replacing the bad link with a new datanode. When this happens the replica on the datanode that was shifted away from becomes corrupted because it no longer has the latest generation stamp for the block. As a more concrete example:
   - DataStreamer client creates block pipeline on datanodes A & B, each have a block replica with generation stamp 1
   - DataStreamer client tries to append the block pipeline by sending block transfer (with generation stamp 2) to datanode A
   - Datanode A succeeds in writing the block transfer & then attempts to forward the transfer to datanode B
   - Datanode B fails the transfer for some reason and responds with a PipelineAck failure code
   - Datanode A sends a PipelineAck to DataStreamer indicating datanode A succeeded in the append & datanode B failed in the append. The DataStreamer detects datanode B as a bad link which will be replaced before the next append operation
   - at this point datanode A has live replica with generation stamp 2 & datanode B has corrupt replica with generation stamp 1
   - the next time DataStreamer tries to append the block it will call Namenode "getAdditionalDatanode" API which returns some other datanode C
   - DataStreamer sends data transfer (with generation stamp 3) to the new block pipeline containing datanodes A & C, the append succeeds to both datanodes
   - end state is that:
     - datanodes A & C have live replicas with latest generation stamp 3
     - datanode B has a corrupt replica because its lagging behind with generation stamp 1
   
   The key behavior being highlighted here is that when the DataStreamer client shifts the block pipeline due to append failures on a subset of the datanodes in the pipeline, a corrupt block replica gets leftover on the datanode that was shifted away from.
   
   This corrupt block replica makes the datanode ineligible as a replication target for the block because of the following exception:
   
   ```
   2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 dst: /DN3:9866; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
   ```
   
   What typically will occur is that these corrupt block replicas will be invalidated by the Namenode which will cause the corrupt replica to the be deleted on the datanode, thus allowing the datanode to once again be a replication target for the block. Note that the Namenode will not identify the corrupt block replica until the datanode sends its next block report, this can take up to 6 hours with the default block report interval.
   
   As long as there is 1 live replica of the block, all the corrupt replicas should be invalidated based on this condition: https://github.com/apache/hadoop/blob/7bd7725532fd139d2e0e1662df7700f7ab95067a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L1928
   
   When there are 0 live replicas the corrupt replicas are not invalidated, I assume the reasoning behind this is that its better to have some corrupt copies of the block then to have no copies at all.
   
   #### Description of Problem
   
   The problem comes into play when the aforementioned behavior is coupled with decommissioning and/or entering maintenance.
   
   Consider the following scenario:
   - block has replication factor of 2
   - there are 3 datanodes A, B, & C
   - datanode A has decommissioning replica
   - datanodes B & C have corrupt replicas
   
   This scenario is essentially a decommissioning & replication deadlock because:
   - corrupt replicas on B & C will not be invalidated because there are 0 live replicas (as per Namenode logic)
   - datanode A cannot finish decommissioning until the block is replicated to datanodes B & C
   - the block cannot be replicated to datanodes B & C until their corrupt replicas are invalidated
   
   This does not need to be a deadlock, the corrupt replicas could be invalidated & the live replica could be transferred from A to B & C.
   
   The same problem can exist on a larger scale, the requirements are that:
   - liveReplicas < minReplicationFactor (minReplicationFactor=1 by default)
   - decommissioningAndEnteringMaintenanceReplicas > 0
   - liveReplicas + decommissioningAndEnteringMaintenanceReplicas + corruptReplicas = numberOfDatanodes
   
   In this case the corrupt replicas will not be invalidated by the Namenode which means that the decommissioning and entering maintenance replicas will never be sufficiently replicated and therefore will never finish decommissioning or entering maintenance.
   
   The symptom of this issue in the logs is that right after a node with a corrupt replica sends its block report, rather than the block being invalidated it just gets counted as a corrupt replica:
   
   ```
   TODO
   ```
   
   #### Proposed Solution
   
   Rather than checking if minReplicationSatisfied based on live replicas, check based on usable replicas (i.e. live + decommissioning + enteringMaintenance). This will allow the corrupt replicas to be invalidated & the live replica(s) on the decommissioning node(s) to be sufficiently replicated.
   
   The only perceived risk here would be that the corrupt blocks are invalidated at around the same time the decommissioning and entering maintenance nodes are decommissioned. This could in theory bring the overall number of replicas below the minReplicationFactor (to 0 in the worst case). This is however not an actual risk because the decommissioning and entering maintenance nodes will not finish decommissioning until they have a sufficient number of live replicas; so there is no possibility that the decommissioning and entering maintenance nodes will be decommissioned prematurely.
   
   ### How was this patch tested?
   
   Added a unit test "testDeleteCorruptReplicaForUnderReplicatedBlock"
   
   - TODO
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
   - [n/a] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
   - [n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [n/a] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

[GitHub] [hadoop] ZanderXu commented on pull request #4410: HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Posted by GitBox <gi...@apache.org>.

ZanderXu commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1157281252

   @KevinWikant Nice catch +1. I learned a lot from it. thanks~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

[GitHub] [hadoop] KevinWikant commented on pull request #4410: HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Posted by GitBox <gi...@apache.org>.

KevinWikant commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1154627715

   Hi @ashutoshcipher thank you for reviewing the PR! I have tried to address your comments below:
   
   ## I am thinking could there be a case where the block can show up in both liveReplicas and decommissioning, which could lead to any unnecessarily call to invalidateCorruptReplicas()
   
   I am not sure its possible for a block replica to be reported as both a liveReplica & a decommissioningReplica at the same time. Its my understanding that a block replica is on decommissioning datanode is counted as a decommissioningReplica & a non-corrupt replica on a live datanode is counted as a liveReplica. So a block replica on a decommissioning node will only be counted towards decommissioningReplica count & not liveReplica count. I have never seen the behavior you are mentioning in my experience, let me know if you have a reference JIRA.
   
   If the case you described is possible then in theory numUsableReplica would be greater than it should be. In the typical case where "dfs.namenode.replication.min = 1" this makes no difference because even if there is only 1 non-corrupt block replica on 1 decommissioning node then the corrupt blocks are invalidated regardless of wether or not numUsableReplica=1 or numUsableReplica=2 (due to double counting of replica as liveReplica & decommissioningReplica). In the case where "dfs.namenode.replication.min > 1" there could arguably be a difference because the corrupt replicas would not be invalidated if numUsableReplica=1 but they will be invalidated if numUsableReplica=2 (due to double counting of replica as liveReplica & decommissioningReplica).
   
   I think if this scenario is possible the correct fix would be to ensure that each block replica is only counted once towards liveReplica or decommissioningReplica. Let me know if there is a JIRA for this issue & I can potentially look into the bug fix separately.
   
   ## Edge case coming to my mind is when the we are considering the block on decommissioning node as useable but the very next moment, the node is decommissioned.
   
   Fair point, I had considered this & mentioned this edge case in the PR description:
   
   ```
   The only perceived risk here would be that the corrupt blocks are invalidated at around the same time the decommissioning and entering maintenance nodes are decommissioned. This could in theory bring the overall number of replicas below the "dfs.namenode.replication.min" (i.e. to 0 replicas in the worst case). This is however not an actual risk because the decommissioning and entering maintenance nodes will not finish decommissioning until their is a sufficient number of liveReplicas; so there is no possibility that the decommissioning and entering maintenance nodes will be decommissioned prematurely.
   ```
   
   Any replicas on decommissioning nodes will not be decommissioned until there are more liveReplicas than the default replication factor for the HDFS file. So its only possible for decommissioningReplicas to be decommissioned at the same time corruptReplicas are invalidated if there are sufficient liveReplicas to satisfy the replication factor; because of the live replicas its safe to eliminate both the decommissioningReplicas & the corruptReplicas. If there is not a sufficient number of liveReplicas then the decommissioningReplicas will not be decommissioned but the corruptReplicas will be invalidated; the block will not be lost because the decommissioningReplicas will still exist.
   
   One case you could argue is that if:
   - corruptReplicas > 1
   - decommissioningReplicas = 1
   
   By invalidating the corruptReplicas we are exposing the block to a risk of the loss should the decommissioningReplica be unexpectedly terminated/failed. This is true, but this same risk already exists where:
   - corruptReplicas > 1
   - liveReplicas = 1
   By invalidating the corruptReplicas we are exposing the block to a risk of the loss should the liveReplica be unexpectedly terminated/failed.
   
   So I don't think this change introduces any new risk of data loss, in fact it helps improve block redundancy in scenarios where a block cannot be sufficiently replicated because the corruptReplicas cannot be invalidated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

[GitHub] [hadoop] ashutoshcipher commented on pull request #4410: HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Posted by GitBox <gi...@apache.org>.

ashutoshcipher commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1159766297

   > Filed [HDFS-16635](https://issues.apache.org/jira/browse/HDFS-16635) to fix javadoc error.
   
   @aajisaka  Raised PR for the same.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

[GitHub] [hadoop] aajisaka commented on pull request #4410: HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Posted by GitBox <gi...@apache.org>.

aajisaka commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1158535982

   Filed HDFS-16635 to fix javadoc error.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

[GitHub] [hadoop] KevinWikant commented on pull request #4410: HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Posted by GitBox <gi...@apache.org>.

KevinWikant commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1157654335

   @ashutoshcipher @aajisaka @ZanderXu really appreciate the reviews on this PR, thank you!
   
   @aajisaka I have removed the unused imports, please let me know if you have any other comments/concerns


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

[GitHub] [hadoop] KevinWikant commented on pull request #4410: HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Posted by GitBox <gi...@apache.org>.

KevinWikant commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1147944328

   The failing unit test is:
   
   ```
   Failed junit tests  |  hadoop.hdfs.tools.TestDFSAdmin
   ```
   
   This seems to be unrelated to my change & a flaky unit test:
   
   ```
   [ERROR] testAllDatanodesReconfig(org.apache.hadoop.hdfs.tools.TestDFSAdmin)  Time elapsed: 5.489 s  <<< FAILURE!
   org.junit.ComparisonFailure: expected:<[SUCCESS: Changed property dfs.datanode.peer.stats.enabled]> but was:<[  From: "false"]>
       at org.junit.Assert.assertEquals(Assert.java:117)
       at org.junit.Assert.assertEquals(Assert.java:146)
       at org.apache.hadoop.hdfs.tools.TestDFSAdmin.testAllDatanodesReconfig(TestDFSAdmin.java:1208)
       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
       at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
       at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
       at java.lang.reflect.Method.invoke(Method.java:498)
       at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
       at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
       at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
       at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
       at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
       at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
       at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
       at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
       at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
       at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
       at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
       at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
       at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
       at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
       at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
       at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
       at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
       at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
       at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
       at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
       at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
       at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
       at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
       at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
       at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

[GitHub] [hadoop] hadoop-yetus commented on pull request #4410: HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Posted by GitBox <gi...@apache.org>.

hadoop-yetus commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1147898690

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |  13m 52s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to include 1 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  36m 55s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 44s |  |  trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 37s |  |  trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   1m 26s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 47s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 24s |  |  trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 47s |  |  trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 40s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 57s |  |  branch has no errors when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 21s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 27s |  |  the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 27s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks issues.  |
   | -0 :warning: |  checkstyle  |   1m  1s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 3 new + 100 unchanged - 0 fixed = 103 total (was 100)  |
   | +1 :green_heart: |  mvnsite  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 58s |  |  the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 24s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 24s |  |  patch has no errors when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 250m 34s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m 15s |  |  The patch does not generate ASF License warnings.  |
   |  |   | 372m 10s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.hdfs.tools.TestDFSAdmin |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/1/artifact/out/Dockerfile |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4410 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux bda194b52c15 4.15.0-156-generic #163-Ubuntu SMP Thu Aug 19 23:31:58 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 73dd8a92c19cfd5989dcd0ca61a6f5dfea3d0a97 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/1/testReport/ |
   | Max. process+thread count | 3272 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
   | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/1/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

[GitHub] [hadoop] hadoop-yetus commented on pull request #4410: HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Posted by GitBox <gi...@apache.org>.

hadoop-yetus commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1158158938

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 56s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to include 1 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  39m 25s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 39s |  |  trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  compile  |   1m 31s |  |  trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   1m 21s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 40s |  |  trunk passed  |
   | -1 :x: |  javadoc  |   1m 20s | [/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/artifact/out/branch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt) |  hadoop-hdfs in trunk failed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.  |
   | +1 :green_heart: |  javadoc  |   1m 44s |  |  trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 43s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  25m 58s |  |  branch has no errors when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 24s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 30s |  |  the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1  |
   | +1 :green_heart: |  javac  |   1m 30s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 19s |  |  the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks issues.  |
   | -0 :warning: |  checkstyle  |   1m  2s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 100 unchanged - 0 fixed = 101 total (was 100)  |
   | +1 :green_heart: |  mvnsite  |   1m 28s |  |  the patch passed  |
   | -1 :x: |  javadoc  |   1m  0s | [/patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/artifact/out/patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.txt) |  hadoop-hdfs in the patch failed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1.  |
   | +1 :green_heart: |  javadoc  |   1m 30s |  |  the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 35s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  26m  0s |  |  patch has no errors when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | +1 :green_heart: |  unit  | 381m 44s |  |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   1m  1s |  |  The patch does not generate ASF License warnings.  |
   |  |   | 498m 26s |  |  |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/artifact/out/Dockerfile |
   | GITHUB PR | https://github.com/apache/hadoop/pull/4410 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux efcbee072994 4.15.0-166-generic #174-Ubuntu SMP Wed Dec 8 19:07:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 9dd26601ec0cb25a1de4f772e6bff084141bbfb5 |
   | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 |
   |  Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/testReport/ |
   | Max. process+thread count | 1965 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs |
   | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4410/2/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

[GitHub] [hadoop] ashutoshcipher commented on pull request #4410: HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Posted by GitBox <gi...@apache.org>.

ashutoshcipher commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1154595288

   Thanks @KevinWikant for the PR.
   
   We have seen this bug coming up few times in last few months. And the PR seems to fix the bug.
   
   Although I am thinking could there be a case where the block can show up in both liveReplicas and decommissioning, which could lead to any unnecessarily call to invalidateCorruptReplicas()
   
   Edge case coming to my mind is when the we are considering the block on decommissioning node as useable but the very next moment, the node is decommissioned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

[GitHub] [hadoop] aajisaka commented on pull request #4410: HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Posted by GitBox <gi...@apache.org>.

aajisaka commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1159895754

   Merged. Thank you @KevinWikant for your contribution and thank you @ashutoshcipher @ZanderXu for your review!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

[GitHub] [hadoop] aajisaka merged pull request #4410: HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas

Posted by GitBox <gi...@apache.org>.

aajisaka merged PR #4410:
URL: https://github.com/apache/hadoop/pull/4410


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org