You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "sodonnel (via GitHub)" <gi...@apache.org> on 2023/03/29 21:01:26 UTC

[GitHub] [ozone] sodonnel opened a new pull request, #4496: HDDS-8309. ReplicationManager: Throttle EC Reconstruction commands

sodonnel opened a new pull request, #4496:
URL: https://github.com/apache/ozone/pull/4496

   ## What changes were proposed in this pull request?
   
   EC Reconstruction commands should be throttled in a similar way to replicate container commands, so that a limited number of commands can be sent to any given node.
   
   As reconstruction and replication share the same queue on the datanode, the datanode could be filled with a combination of replication and reconstruction commands, so the sendThrottledReplicationCommand method will also need adjusted to consider the number of reconstructions commands queue when checking the limit.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-8309
   
   ## How was this patch tested?
   
   New tests added


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on pull request #4496: HDDS-8309. ReplicationManager: Throttle EC Reconstruction commands

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on PR #4496:
URL: https://github.com/apache/ozone/pull/4496#issuecomment-1489759250

   @sodonnel Seems like the change is causing some failing tests:
   
   ```
   org.apache.hadoop.ozone.container.TestECContainerRecovery
   org.apache.hadoop.ozone.container.TestContainerReplication
   ```
   
   and replication in `ozonesecure`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #4496: HDDS-8309. ReplicationManager: Throttle EC Reconstruction commands

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #4496:
URL: https://github.com/apache/ozone/pull/4496#discussion_r1152526165


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ECUnderReplicationHandler.java:
##########
@@ -315,14 +315,8 @@ private int processMissingIndexes(
                 sourceDatanodesWithIndex, selectedDatanodes,
                 int2byte(missingIndexes),
                 repConfig);
-        replicationManager.sendDatanodeCommand(reconstructionCommand,
-            container, selectedDatanodes.get(0));
-        // For each index that is going to be reconstructed with this command,
-        // adjust the replica count to reflect the pending operation.
-        for (int i = 0; i < missingIndexes.size(); i++) {

Review Comment:
   This code was not adjusting the global pending ops. It was only adjusting the local copy. I change I made only a week or two ago. I will restore it, as it has caused a failing test which luckily caught the fact it should not be removed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #4496: HDDS-8309. ReplicationManager: Throttle EC Reconstruction commands

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #4496:
URL: https://github.com/apache/ozone/pull/4496#discussion_r1152482688


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ECUnderReplicationHandler.java:
##########
@@ -315,14 +315,8 @@ private int processMissingIndexes(
                 sourceDatanodesWithIndex, selectedDatanodes,
                 int2byte(missingIndexes),
                 repConfig);
-        replicationManager.sendDatanodeCommand(reconstructionCommand,
-            container, selectedDatanodes.get(0));
-        // For each index that is going to be reconstructed with this command,
-        // adjust the replica count to reflect the pending operation.
-        for (int i = 0; i < missingIndexes.size(); i++) {

Review Comment:
   This, I believe, was an existing bug in the handler. The pendingOps are adjusted via the `replicationManager.sendDatanodeCommand()`, so they were being adjust here and then again in RM. I am not sure if that would have caused a problem or not.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #4496: HDDS-8309. ReplicationManager: Throttle EC Reconstruction commands

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on PR #4496:
URL: https://github.com/apache/ozone/pull/4496#issuecomment-1490425015

   Create https://issues.apache.org/jira/browse/HDDS-8332 to track the improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a diff in pull request #4496: HDDS-8309. ReplicationManager: Throttle EC Reconstruction commands

Posted by "adoroszlai (via GitHub)" <gi...@apache.org>.
adoroszlai commented on code in PR #4496:
URL: https://github.com/apache/ozone/pull/4496#discussion_r1153236969


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -1160,16 +1183,31 @@ public void setMaintenanceReplicaMinimum(int replicaCount) {
         defaultValue = "20",
         tags = { SCM, DATANODE },
         description = "A limit to restrict the total number of replication " +
-            "commands queued on a datanode. Note this is intended to be a " +
-            " temporary config until we have a more dynamic way of limiting " +
-            "load."
+            "and reconstruction commands queued on a datanode. Note this is " +
+            "intended to be a temporary config until we have a more dynamic " +
+            "way of limiting load."
     )
     private int datanodeReplicationLimit = 20;
 
     public int getDatanodeReplicationLimit() {
       return datanodeReplicationLimit;
     }
 
+    @Config(key = "datanode.reconstruction.weight",
+        type = ConfigType.INT,
+        defaultValue = "3",
+        tags = { SCM, DATANODE },
+        description = "When counting the number of replication commands on a " +
+            "datanode, the number of reconstruction commands is multiplied " +
+            "by this weight to ensure reconstruction commands use more of " +
+            "the capacity, as they are more expensive to process."
+    )
+    private int reconstructionCommandWeight = 3;

Review Comment:
   I think the weight we should use might depend on the EC scheme (3-2 vs. 6-3 vs. 10-4), so I'm not sure we should have a config item for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #4496: HDDS-8309. ReplicationManager: Throttle EC Reconstruction commands

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel commented on code in PR #4496:
URL: https://github.com/apache/ozone/pull/4496#discussion_r1153297201


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -1160,16 +1183,31 @@ public void setMaintenanceReplicaMinimum(int replicaCount) {
         defaultValue = "20",
         tags = { SCM, DATANODE },
         description = "A limit to restrict the total number of replication " +
-            "commands queued on a datanode. Note this is intended to be a " +
-            " temporary config until we have a more dynamic way of limiting " +
-            "load."
+            "and reconstruction commands queued on a datanode. Note this is " +
+            "intended to be a temporary config until we have a more dynamic " +
+            "way of limiting load."
     )
     private int datanodeReplicationLimit = 20;
 
     public int getDatanodeReplicationLimit() {
       return datanodeReplicationLimit;
     }
 
+    @Config(key = "datanode.reconstruction.weight",
+        type = ConfigType.INT,
+        defaultValue = "3",
+        tags = { SCM, DATANODE },
+        description = "When counting the number of replication commands on a " +
+            "datanode, the number of reconstruction commands is multiplied " +
+            "by this weight to ensure reconstruction commands use more of " +
+            "the capacity, as they are more expensive to process."
+    )
+    private int reconstructionCommandWeight = 3;

Review Comment:
   Yea this is a good point, and warrants more thought. The question is what value should be used. Eg we could use the parity number, or some functions of the nodes involved, but we may still need a config setting to "tune" it. I will create another Jira under our RM improvements epic so we can consider how best to do this.
   
   Its difficult to pick sensible defaults for these sort of settings without testing on real hardware and data, and even then, the best setting probably depends on the hardware capabilities, network and how the data looks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel merged pull request #4496: HDDS-8309. ReplicationManager: Throttle EC Reconstruction commands

Posted by "sodonnel (via GitHub)" <gi...@apache.org>.
sodonnel merged PR #4496:
URL: https://github.com/apache/ozone/pull/4496


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org