You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/08/03 11:05:26 UTC

[GitHub] [ozone] sodonnel opened a new pull request, #3649: HDDS-7081. EC: ReplicationManager - UnderRep handler should handle duplicate indexes

sodonnel opened a new pull request, #3649:
URL: https://github.com/apache/ozone/pull/3649

   ## What changes were proposed in this pull request?
   
   When there are two replicas online with the same index, eg due to decommission, over-replication or maintenance, and the container is under replicated due to another missing index, an illegal argument exception can be thrown when collecting the source indexes:
   
   ```
   2022-08-02 10:54:26,939 WARN org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler: Exception while processing for creating the EC reconstruction container commands for #2.
   java.lang.IllegalStateException: Duplicate key 3 (attempted merging values ContainerReplica{containerID=#2, state=CLOSED, datanodeDetails=fb63a3c8-2e5b-432e-be63-274c41aab79f{ip: 172.27.124.131, host: quasar-onjdpu-5.quasar-onjdpu.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, placeOfBirth=2ff0ed60-a461-41a4-8fff-9da6cb4e52ad, sequenceId=0, keyCount=1, bytesUsed=34603008,replicaIndex= 3} and ContainerReplica{containerID=#2, state=CLOSED, datanodeDetails=2ff0ed60-a461-41a4-8fff-9da6cb4e52ad{ip: 172.27.193.4, host: quasar-onjdpu-3.quasar-onjdpu.root.hwx.site, ports: [REPLICATION=9886, RATIS=9858, RATIS_ADMIN=9857, RATIS_SERVER=9856, STANDALONE=9859], networkLocation: /default, certSerialId: null, persistedOpState: DECOMMISSIONING, persistedOpStateExpiryEpochSec: 0}, placeOfBirth=2ff0ed60-a461-41a4-8fff-9da6cb4e5
 2ad, sequenceId=0, keyCount=1, bytesUsed=34603008,replicaIndex= 3})
           at java.base/java.util.stream.Collectors.duplicateKeyException(Collectors.java:133)
           at java.base/java.util.stream.Collectors.lambda$uniqKeysMapAccumulator$1(Collectors.java:180)
           at java.base/java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
           at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
           at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
           at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
           at java.base/java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1603)
           at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
           at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
           at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
           at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
           at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
           at org.apache.hadoop.hdds.scm.container.replication.ECUnderReplicationHandler.processAndCreateCommands(ECUnderReplicationHandler.java:151)
           at org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:366)
           at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.processContainer(UnderReplicatedProcessor.java:92)
           at org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.processAll(UnderReplicatedProcessor.java:76)
           at org.apache.hadoop.hdds.scm.ha.BackgroundSCMService.run(BackgroundSCMService.java:101)
           at java.base/java.lang.Thread.run(Thread.java:834)
   ```
   
   This then goes unhandled and causes the under rep processing thread to exit.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-7081
   
   ## How was this patch tested?
   
   Unit test to reproduce and ensure the issue is fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel merged pull request #3649: HDDS-7081. EC: ReplicationManager - UnderRep handler should handle duplicate indexes

Posted by GitBox <gi...@apache.org>.

sodonnel merged PR #3649:
URL: https://github.com/apache/ozone/pull/3649


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a diff in pull request #3649: HDDS-7081. EC: ReplicationManager - UnderRep handler should handle duplicate indexes

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on code in PR #3649:
URL: https://github.com/apache/ozone/pull/3649#discussion_r936764238


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/OverReplicatedProcessor.java:
##########
@@ -73,7 +73,7 @@ public void processAll() {
       try {
         processContainer(overRep);
         processed++;
-      } catch (IOException e) {
+      } catch (Exception e) {

Review Comment:
   Should we handle Throwable to be safe? Unless service is not running, this thread must be running.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] umamaheswararao commented on a diff in pull request #3649: HDDS-7081. EC: ReplicationManager - UnderRep handler should handle duplicate indexes

Posted by GitBox <gi...@apache.org>.

umamaheswararao commented on code in PR #3649:
URL: https://github.com/apache/ozone/pull/3649#discussion_r936950996


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/OverReplicatedProcessor.java:
##########
@@ -73,7 +73,7 @@ public void processAll() {
       try {
         processContainer(overRep);
         processed++;
-      } catch (IOException e) {
+      } catch (Exception e) {

Review Comment:
   I just looked at RM threads in HDFS and we catch throwable. I did not remember how that evolved. 
   I think at least one instance I remember RM dead and NN was still running. ( Of course in a bad way).
   
   We need to make sure process going down after this event.
   
   ```
   @Override
       public void run() {
         while (namesystem.isRunning()) {
           try {
             // Process recovery work only when active NN is out of safe mode.
             if (isPopulatingReplQueues()) {
               computeDatanodeWork();
               processPendingReconstructions();
               rescanPostponedMisreplicatedBlocks();
             }
             TimeUnit.MILLISECONDS.sleep(redundancyRecheckIntervalMs);
           } catch (Throwable t) {
             if (!namesystem.isRunning()) {
               LOG.info("Stopping RedundancyMonitor.");
               if (!(t instanceof InterruptedException)) {
                 LOG.info("RedundancyMonitor received an exception"
                     + " while shutting down.", t);
               }
               break;
             } else if (!checkNSRunning && t instanceof InterruptedException) {
               LOG.info("Stopping RedundancyMonitor for testing.");
               break;
             }
             LOG.error("RedundancyMonitor thread received Runtime exception. ",
                 t);
             terminate(1, t);
           }
         }
       }
   ```
   
   I am not saying to copy this, but this sustained and evolved from many such issues with similar use case in the system.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] adoroszlai commented on a diff in pull request #3649: HDDS-7081. EC: ReplicationManager - UnderRep handler should handle duplicate indexes

Posted by GitBox <gi...@apache.org>.

adoroszlai commented on code in PR #3649:
URL: https://github.com/apache/ozone/pull/3649#discussion_r936769674


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/OverReplicatedProcessor.java:
##########
@@ -73,7 +73,7 @@ public void processAll() {
       try {
         processContainer(overRep);
         processed++;
-      } catch (IOException e) {
+      } catch (Exception e) {

Review Comment:
   I don't think `OutOfMemoryError` and other errors should be caught, so I think `Exception` is fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3649: HDDS-7081. EC: ReplicationManager - UnderRep handler should handle duplicate indexes

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3649:
URL: https://github.com/apache/ozone/pull/3649#discussion_r936959285


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/OverReplicatedProcessor.java:
##########
@@ -73,7 +73,7 @@ public void processAll() {
       try {
         processContainer(overRep);
         processed++;
-      } catch (IOException e) {
+      } catch (Exception e) {

Review Comment:
   Fair enough I will change this to throwable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org