You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/06/03 09:15:00 UTC

[GitHub] [ozone] szetszwo opened a new pull request, #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

szetszwo opened a new pull request, #3482:
URL: https://github.com/apache/ozone/pull/3482

   See https://issues.apache.org/jira/browse/HDDS-6829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r893997953


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -1450,9 +1537,8 @@ private void sendReplicateCommand(final ContainerInfo container,
     final ContainerID id = container.containerID();
     final ReplicateContainerCommand replicateCommand =
         new ReplicateContainerCommand(id.getId(), sources);
-    inflightReplication.computeIfAbsent(id, k -> new ArrayList<>());
     sendAndTrackDatanodeCommand(datanode, replicateCommand,
-        action -> inflightReplication.get(id).add(action));
+        action -> inflightReplication.add(id, action));

Review Comment:
   Sure, let's add some metric.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r893329852


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -1450,9 +1537,8 @@ private void sendReplicateCommand(final ContainerInfo container,
     final ContainerID id = container.containerID();
     final ReplicateContainerCommand replicateCommand =
         new ReplicateContainerCommand(id.getId(), sources);
-    inflightReplication.computeIfAbsent(id, k -> new ArrayList<>());
     sendAndTrackDatanodeCommand(datanode, replicateCommand,
-        action -> inflightReplication.get(id).add(action));
+        action -> inflightReplication.add(id, action));

Review Comment:
   I guess we should probably check the return status here, and increment some metric if we are skipping the replication for now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on PR #3482:
URL: https://github.com/apache/ozone/pull/3482#issuecomment-1150467728

   > Are the any confs in the new ReplicationManager related to inflight replication and inflight deletion?
   
   No, not as yet. The plan is to use the currently queued command count on the DNs to limit the replication commands in the new RM. It will find all the under-replicated container, prioritize them by remaining redundancy and then schedule them over time. The plan is to have a per-dn limit, so faster DNs can accept more work perhaps. We have not yet worked out the low level details, but that is the high level thinking.
   
   The new replication manager is going to use a new class to track inflight operations, called ContainerReplicaPendingOps. That part is already committed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r893336729


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -102,6 +104,98 @@ public class LegacyReplicationManager {
   public static final Logger LOG =
       LoggerFactory.getLogger(LegacyReplicationManager.class);
 
+  enum Inflight {REPLICATION, DELETION}
+
+  static class InflightMap {
+    private final Map<ContainerID, List<InflightAction>> map
+        = new ConcurrentHashMap<>();
+    private final Inflight type;
+    private final int actionSizeLimit;
+
+    InflightMap(Inflight type, int actionSizeLimit) {
+      this.type = type;
+      this.actionSizeLimit = actionSizeLimit;
+    }
+
+    boolean isReplication() {
+      return type == Inflight.REPLICATION;
+    }
+
+    private List<InflightAction> get(ContainerID id) {
+      return map.get(id);
+    }
+
+    boolean containsKey(ContainerID id) {
+      return map.containsKey(id);
+    }
+
+    int size(ContainerID id) {
+      return Optional.ofNullable(map.get(id)).map(List::size).orElse(0);
+    }
+
+    int size() {
+      return map.size();
+    }
+
+    void clear() {
+      map.clear();
+    }
+
+    void iterate(ContainerID id, Function<InflightAction, Boolean> processor) {
+      for(;;) {
+        final List<InflightAction> actions = get(id);
+        if (actions == null) {
+          return;
+        }
+        synchronized (actions) {
+          if (get(id) != actions) {
+            continue; //actions is removed, retry
+          }
+          for (Iterator<InflightAction> i = actions.iterator(); i.hasNext(); ) {
+            final Boolean remove = processor.apply(i.next());
+            if (remove == Boolean.TRUE) {
+              i.remove();
+            }
+          }
+          map.computeIfPresent(id, (k, v) -> v == actions && v.isEmpty() ? null : v);
+        }
+      }
+    }
+
+    boolean add(ContainerID id, InflightAction a) {
+      for(;;) {
+        final List<InflightAction> actions = map.computeIfAbsent(id,
+            key -> new LinkedList<>());
+        synchronized (actions) {
+          if (get(id) != actions) {
+            continue; //actions is removed, retry
+          }
+          if (actions.size() >= actionSizeLimit) {

Review Comment:
   The limit here is based on the number of actions for an individual container I think? Do we not want to limit on the total number of inflight actions across all containers? As a crude estimate it would be `map.size()`, although each entry could potentially have several replications against it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r896980706


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -102,6 +105,111 @@ public class LegacyReplicationManager {
   public static final Logger LOG =
       LoggerFactory.getLogger(LegacyReplicationManager.class);
 
+  static class InflightMap {
+    private final Map<ContainerID, List<InflightAction>> map
+        = new ConcurrentHashMap<>();
+    private final InflightType type;
+    private final int sizeLimit;
+    private final AtomicInteger inflightCount = new AtomicInteger();
+
+    InflightMap(InflightType type, int sizeLimit) {
+      this.type = type;
+      this.sizeLimit = sizeLimit > 0 ? sizeLimit : Integer.MAX_VALUE;
+    }
+
+    boolean isReplication() {
+      return type == InflightType.REPLICATION;
+    }
+
+    private List<InflightAction> get(ContainerID id) {
+      return map.get(id);
+    }
+
+    boolean containsKey(ContainerID id) {
+      return map.containsKey(id);
+    }
+
+    int inflightActionCount(ContainerID id) {
+      return Optional.ofNullable(map.get(id)).map(List::size).orElse(0);
+    }
+
+    int containerCount() {
+      return map.size();
+    }
+
+    boolean isFull() {
+      return inflightCount.get() >= sizeLimit;
+    }
+
+    void clear() {
+      map.clear();
+    }
+
+    void iterate(ContainerID id, Function<InflightAction, Boolean> processor) {
+      for (; ;) {
+        final List<InflightAction> actions = get(id);
+        if (actions == null) {
+          return;
+        }
+        synchronized (actions) {
+          if (get(id) != actions) {
+            continue; //actions is changed, retry
+          }
+          for (Iterator<InflightAction> i = actions.iterator(); i.hasNext();) {
+            final Boolean remove = processor.apply(i.next());
+            if (remove == Boolean.TRUE) {

Review Comment:
   @adoroszlai , thanks for pointing out the findbugs warning.  I kept checking all the highlighted Warnings like this https://github.com/apache/ozone/runs/6874432894?check_suite_focus=true#step:5:1669 but missed the real findbugs warning was not highlighted https://github.com/apache/ozone/runs/6874432894?check_suite_focus=true#step:5:1726
   
   And yes, windbags is findbugs after the auto spelling correction.  :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on PR #3482:
URL: https://github.com/apache/ozone/pull/3482#issuecomment-1150429434

   @sodonnel , checked the code.  It actually is not hard to limit the inflight replications and the inflight deletions.  Since the new ReplicationManager is coming, we don't want to over engineer the legacy code.  Also, we should make sure the new confs added here can be used in the new ReplicationManager.
   
   Are the any confs in the new ReplicationManager related to inflight replication and inflight deletion?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on PR #3482:
URL: https://github.com/apache/ozone/pull/3482#issuecomment-1150680410

   @sodonnel , please take a look of the current change.  If it is good, I will add some new tests for testing the inflight limits.  Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on PR #3482:
URL: https://github.com/apache/ozone/pull/3482#issuecomment-1150600052

   > [Inflight limit](https://github.com/apache/ozone/pull/3482/commits/51f05cddf4f668ab0517a5cedc29a36ef9e1855c)
    
   Note that this is still work-in-progress.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on PR #3482:
URL: https://github.com/apache/ozone/pull/3482#issuecomment-1155299401

   The windbags failure does not seem related to this.
   
   @sodonnel , could you review the change?  Thank you in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on PR #3482:
URL: https://github.com/apache/ozone/pull/3482#issuecomment-1148487383

   I have a couple of thoughts here:
   
   1. Our main concern is pending replications. The replication manager also processes over-replication, as well as handling containers which need to be closed, quasi-closed, unhealthy etc. Putting the limiter where it is in this change, delays all those things from happening too. It will also impact EC containers which are going to go through a new code path and hence will not face the same problem. 
   
   2. There is a replication Manager Report in place, and if we only process some of the containers, the report, which is only updated after each full replication manager run will not have the correct numbers.
   
   3. This solution is a temporary measure until we get the new replication manager ready, and it only really affects things inside the LegacyReplicationManager. I think it would be better if we confined this solution to within the LegacyReplicationManager, and will allow us to develop the new version freely.
   
   I think we would be better placing a limit on the pending in-flight replication tasks, rather than a limit on the number of containers processed. That way the replication manager will still process over replication and all the other health check tasks, but we can skip scheduling a replication for under-replication if there are too many pending already. The report can also be populated fully, with over / under replicated counts, even if all the under replication tasks are not scheduled.
   
   It would also be good to count up how many were skipped on each iteration if possible so it can be logged to give some insight into what is happening. It might be slightly tricky to do this the way things are currently structured, so that would be nice to have.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r894007739


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -102,6 +104,98 @@ public class LegacyReplicationManager {
   public static final Logger LOG =
       LoggerFactory.getLogger(LegacyReplicationManager.class);
 
+  enum Inflight {REPLICATION, DELETION}
+
+  static class InflightMap {
+    private final Map<ContainerID, List<InflightAction>> map
+        = new ConcurrentHashMap<>();
+    private final Inflight type;
+    private final int actionSizeLimit;
+
+    InflightMap(Inflight type, int actionSizeLimit) {
+      this.type = type;
+      this.actionSizeLimit = actionSizeLimit;
+    }
+
+    boolean isReplication() {
+      return type == Inflight.REPLICATION;
+    }
+
+    private List<InflightAction> get(ContainerID id) {
+      return map.get(id);
+    }
+
+    boolean containsKey(ContainerID id) {
+      return map.containsKey(id);
+    }
+
+    int size(ContainerID id) {
+      return Optional.ofNullable(map.get(id)).map(List::size).orElse(0);
+    }
+
+    int size() {
+      return map.size();
+    }
+
+    void clear() {
+      map.clear();
+    }
+
+    void iterate(ContainerID id, Function<InflightAction, Boolean> processor) {
+      for(;;) {
+        final List<InflightAction> actions = get(id);
+        if (actions == null) {
+          return;
+        }
+        synchronized (actions) {
+          if (get(id) != actions) {
+            continue; //actions is removed, retry
+          }
+          for (Iterator<InflightAction> i = actions.iterator(); i.hasNext(); ) {
+            final Boolean remove = processor.apply(i.next());
+            if (remove == Boolean.TRUE) {
+              i.remove();
+            }
+          }
+          map.computeIfPresent(id, (k, v) -> v == actions && v.isEmpty() ? null : v);
+        }
+      }
+    }
+
+    boolean add(ContainerID id, InflightAction a) {
+      for(;;) {
+        final List<InflightAction> actions = map.computeIfAbsent(id,
+            key -> new LinkedList<>());
+        synchronized (actions) {
+          if (get(id) != actions) {
+            continue; //actions is removed, retry
+          }
+          if (actions.size() >= actionSizeLimit) {

Review Comment:
   Sure.  Let me see how to do it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r896982436


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -102,6 +105,111 @@ public class LegacyReplicationManager {
   public static final Logger LOG =
       LoggerFactory.getLogger(LegacyReplicationManager.class);
 
+  static class InflightMap {
+    private final Map<ContainerID, List<InflightAction>> map
+        = new ConcurrentHashMap<>();
+    private final InflightType type;
+    private final int sizeLimit;
+    private final AtomicInteger inflightCount = new AtomicInteger();
+
+    InflightMap(InflightType type, int sizeLimit) {
+      this.type = type;
+      this.sizeLimit = sizeLimit > 0 ? sizeLimit : Integer.MAX_VALUE;
+    }
+
+    boolean isReplication() {
+      return type == InflightType.REPLICATION;
+    }
+
+    private List<InflightAction> get(ContainerID id) {
+      return map.get(id);
+    }
+
+    boolean containsKey(ContainerID id) {
+      return map.containsKey(id);
+    }
+
+    int inflightActionCount(ContainerID id) {
+      return Optional.ofNullable(map.get(id)).map(List::size).orElse(0);
+    }
+
+    int containerCount() {
+      return map.size();
+    }
+
+    boolean isFull() {
+      return inflightCount.get() >= sizeLimit;
+    }
+
+    void clear() {
+      map.clear();
+    }
+
+    void iterate(ContainerID id, Function<InflightAction, Boolean> processor) {
+      for (; ;) {
+        final List<InflightAction> actions = get(id);
+        if (actions == null) {
+          return;
+        }
+        synchronized (actions) {
+          if (get(id) != actions) {
+            continue; //actions is changed, retry
+          }
+          for (Iterator<InflightAction> i = actions.iterator(); i.hasNext();) {
+            final Boolean remove = processor.apply(i.next());
+            if (remove == Boolean.TRUE) {

Review Comment:
   Also, Predicate sounds great!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo merged pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo merged PR #3482:
URL: https://github.com/apache/ozone/pull/3482


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on PR #3482:
URL: https://github.com/apache/ozone/pull/3482#issuecomment-1150507012

   Then, I would suggest to have the following confs:
   ```
       @Config(key = "container.max",
           type = ConfigType.INT,
           defaultValue = "0", // 0 means unlimited.
           tags = {SCM, OZONE},
           description = "This property is used to limit the maximum number " +
               "of containers to process in each loop.";
       )
       private int containerMax = 0;
   
       @Config(key = "container.inflight.replication.limit",
           type = ConfigType.INT,
           defaultValue = "0", // 0 means unlimited.
           tags = {SCM, OZONE},
           description = "This property is used to limit the maximum number " +
               "of inflight replication in each containers."
       )
       private int containerInflightReplicationLimit = 0;
   
       @Config(key = "container.inflight.deletion.limit",
           type = ConfigType.INT,
           defaultValue = "0", // 0 means unlimited.
           tags = {SCM, OZONE},
           description = "This property is used to limit the maximum number " +
               "of inflight deletion in each containers."
       )
       private int containerInflightDeletionLimit = 0;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r896936558


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -102,6 +105,111 @@ public class LegacyReplicationManager {
   public static final Logger LOG =
       LoggerFactory.getLogger(LegacyReplicationManager.class);
 
+  static class InflightMap {
+    private final Map<ContainerID, List<InflightAction>> map
+        = new ConcurrentHashMap<>();
+    private final InflightType type;
+    private final int sizeLimit;
+    private final AtomicInteger inflightCount = new AtomicInteger();
+
+    InflightMap(InflightType type, int sizeLimit) {
+      this.type = type;
+      this.sizeLimit = sizeLimit > 0 ? sizeLimit : Integer.MAX_VALUE;
+    }
+
+    boolean isReplication() {
+      return type == InflightType.REPLICATION;
+    }
+
+    private List<InflightAction> get(ContainerID id) {
+      return map.get(id);
+    }
+
+    boolean containsKey(ContainerID id) {
+      return map.containsKey(id);
+    }
+
+    int inflightActionCount(ContainerID id) {
+      return Optional.ofNullable(map.get(id)).map(List::size).orElse(0);
+    }
+
+    int containerCount() {
+      return map.size();
+    }
+
+    boolean isFull() {
+      return inflightCount.get() >= sizeLimit;
+    }
+
+    void clear() {
+      map.clear();
+    }
+
+    void iterate(ContainerID id, Function<InflightAction, Boolean> processor) {
+      for (; ;) {
+        final List<InflightAction> actions = get(id);
+        if (actions == null) {
+          return;
+        }
+        synchronized (actions) {
+          if (get(id) != actions) {
+            continue; //actions is changed, retry
+          }
+          for (Iterator<InflightAction> i = actions.iterator(); i.hasNext();) {
+            final Boolean remove = processor.apply(i.next());
+            if (remove == Boolean.TRUE) {

Review Comment:
   > The windbags failure does not seem related to this.
   
   `M B RC: Suspicious comparison of Boolean references in org.apache.hadoop.hdds.scm.container.replication.LegacyReplicationManager$InflightMap.iterate(ContainerID, Function)  At LegacyReplicationManager.java:[line 160]`
   
   I think it is related, assuming windbags == findbugs. :)
   
   Maybe I'm missing something, but can we use `Predicate<InflightAction>` instead of `Function<InflightAction, Boolean>`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r897320708


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -102,6 +105,111 @@ public class LegacyReplicationManager {
   public static final Logger LOG =
       LoggerFactory.getLogger(LegacyReplicationManager.class);
 
+  static class InflightMap {
+    private final Map<ContainerID, List<InflightAction>> map
+        = new ConcurrentHashMap<>();
+    private final InflightType type;
+    private final int sizeLimit;
+    private final AtomicInteger inflightCount = new AtomicInteger();
+
+    InflightMap(InflightType type, int sizeLimit) {
+      this.type = type;
+      this.sizeLimit = sizeLimit > 0 ? sizeLimit : Integer.MAX_VALUE;
+    }
+
+    boolean isReplication() {
+      return type == InflightType.REPLICATION;
+    }
+
+    private List<InflightAction> get(ContainerID id) {
+      return map.get(id);
+    }
+
+    boolean containsKey(ContainerID id) {
+      return map.containsKey(id);
+    }
+
+    int inflightActionCount(ContainerID id) {
+      return Optional.ofNullable(map.get(id)).map(List::size).orElse(0);
+    }
+
+    int containerCount() {
+      return map.size();
+    }
+
+    boolean isFull() {
+      return inflightCount.get() >= sizeLimit;
+    }
+
+    void clear() {
+      map.clear();
+    }
+
+    void iterate(ContainerID id, Function<InflightAction, Boolean> processor) {
+      for (; ;) {
+        final List<InflightAction> actions = get(id);
+        if (actions == null) {
+          return;
+        }
+        synchronized (actions) {
+          if (get(id) != actions) {
+            continue; //actions is changed, retry
+          }
+          for (Iterator<InflightAction> i = actions.iterator(); i.hasNext();) {
+            final Boolean remove = processor.apply(i.next());
+            if (remove == Boolean.TRUE) {

Review Comment:
   Good to know.  Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] adoroszlai commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
adoroszlai commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r896997515


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -102,6 +105,111 @@ public class LegacyReplicationManager {
   public static final Logger LOG =
       LoggerFactory.getLogger(LegacyReplicationManager.class);
 
+  static class InflightMap {
+    private final Map<ContainerID, List<InflightAction>> map
+        = new ConcurrentHashMap<>();
+    private final InflightType type;
+    private final int sizeLimit;
+    private final AtomicInteger inflightCount = new AtomicInteger();
+
+    InflightMap(InflightType type, int sizeLimit) {
+      this.type = type;
+      this.sizeLimit = sizeLimit > 0 ? sizeLimit : Integer.MAX_VALUE;
+    }
+
+    boolean isReplication() {
+      return type == InflightType.REPLICATION;
+    }
+
+    private List<InflightAction> get(ContainerID id) {
+      return map.get(id);
+    }
+
+    boolean containsKey(ContainerID id) {
+      return map.containsKey(id);
+    }
+
+    int inflightActionCount(ContainerID id) {
+      return Optional.ofNullable(map.get(id)).map(List::size).orElse(0);
+    }
+
+    int containerCount() {
+      return map.size();
+    }
+
+    boolean isFull() {
+      return inflightCount.get() >= sizeLimit;
+    }
+
+    void clear() {
+      map.clear();
+    }
+
+    void iterate(ContainerID id, Function<InflightAction, Boolean> processor) {
+      for (; ;) {
+        final List<InflightAction> actions = get(id);
+        if (actions == null) {
+          return;
+        }
+        synchronized (actions) {
+          if (get(id) != actions) {
+            continue; //actions is changed, retry
+          }
+          for (Iterator<InflightAction> i = actions.iterator(); i.hasNext();) {
+            final Boolean remove = processor.apply(i.next());
+            if (remove == Boolean.TRUE) {

Review Comment:
   The workflow has a "summary of failures" section which greps for the real findbugs problems, to avoid the need for checking the complete output manually:
   
   https://github.com/apache/ozone/runs/6874432894#step:6:8



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r893353018


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -102,6 +104,98 @@ public class LegacyReplicationManager {
   public static final Logger LOG =
       LoggerFactory.getLogger(LegacyReplicationManager.class);
 
+  enum Inflight {REPLICATION, DELETION}
+
+  static class InflightMap {
+    private final Map<ContainerID, List<InflightAction>> map
+        = new ConcurrentHashMap<>();
+    private final Inflight type;
+    private final int actionSizeLimit;
+
+    InflightMap(Inflight type, int actionSizeLimit) {
+      this.type = type;
+      this.actionSizeLimit = actionSizeLimit;
+    }
+
+    boolean isReplication() {
+      return type == Inflight.REPLICATION;
+    }
+
+    private List<InflightAction> get(ContainerID id) {
+      return map.get(id);
+    }
+
+    boolean containsKey(ContainerID id) {
+      return map.containsKey(id);
+    }
+
+    int size(ContainerID id) {
+      return Optional.ofNullable(map.get(id)).map(List::size).orElse(0);
+    }
+
+    int size() {
+      return map.size();
+    }
+
+    void clear() {
+      map.clear();
+    }
+
+    void iterate(ContainerID id, Function<InflightAction, Boolean> processor) {
+      for(;;) {
+        final List<InflightAction> actions = get(id);
+        if (actions == null) {
+          return;
+        }
+        synchronized (actions) {
+          if (get(id) != actions) {
+            continue; //actions is removed, retry
+          }
+          for (Iterator<InflightAction> i = actions.iterator(); i.hasNext(); ) {
+            final Boolean remove = processor.apply(i.next());
+            if (remove == Boolean.TRUE) {
+              i.remove();
+            }
+          }
+          map.computeIfPresent(id, (k, v) -> v == actions && v.isEmpty() ? null : v);
+        }
+      }
+    }
+
+    boolean add(ContainerID id, InflightAction a) {
+      for(;;) {
+        final List<InflightAction> actions = map.computeIfAbsent(id,
+            key -> new LinkedList<>());
+        synchronized (actions) {
+          if (get(id) != actions) {
+            continue; //actions is removed, retry
+          }
+          if (actions.size() >= actionSizeLimit) {

Review Comment:
   I also wonder if we would be better checking the size in `handleUnderReplicatedContainer()` and skipping it before doing all the work to find a new target etc. If there is no capacity to schedule a replica, they we ,ay as well skip the work too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] sodonnel commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
sodonnel commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r893325377


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -219,6 +227,89 @@ public synchronized void stop() {
     }
   }
 
+  /** A stream of {@link ContainerInfo} for supporting circular iterations. */

Review Comment:
   If we limit the inflight replications, I don't think we need this change any longer to limit the number of containers we process. We should always just process all containers and then skip adding replications there are no space for in the queue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on PR #3482:
URL: https://github.com/apache/ozone/pull/3482#issuecomment-1149457124

   @sodonnel , thanks for the comments!  Let me see how to change the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r893988177


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ReplicationManager.java:
##########
@@ -219,6 +227,89 @@ public synchronized void stop() {
     }
   }
 
+  /** A stream of {@link ContainerInfo} for supporting circular iterations. */

Review Comment:
   Sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on a diff in pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on code in PR #3482:
URL: https://github.com/apache/ozone/pull/3482#discussion_r896980706


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/LegacyReplicationManager.java:
##########
@@ -102,6 +105,111 @@ public class LegacyReplicationManager {
   public static final Logger LOG =
       LoggerFactory.getLogger(LegacyReplicationManager.class);
 
+  static class InflightMap {
+    private final Map<ContainerID, List<InflightAction>> map
+        = new ConcurrentHashMap<>();
+    private final InflightType type;
+    private final int sizeLimit;
+    private final AtomicInteger inflightCount = new AtomicInteger();
+
+    InflightMap(InflightType type, int sizeLimit) {
+      this.type = type;
+      this.sizeLimit = sizeLimit > 0 ? sizeLimit : Integer.MAX_VALUE;
+    }
+
+    boolean isReplication() {
+      return type == InflightType.REPLICATION;
+    }
+
+    private List<InflightAction> get(ContainerID id) {
+      return map.get(id);
+    }
+
+    boolean containsKey(ContainerID id) {
+      return map.containsKey(id);
+    }
+
+    int inflightActionCount(ContainerID id) {
+      return Optional.ofNullable(map.get(id)).map(List::size).orElse(0);
+    }
+
+    int containerCount() {
+      return map.size();
+    }
+
+    boolean isFull() {
+      return inflightCount.get() >= sizeLimit;
+    }
+
+    void clear() {
+      map.clear();
+    }
+
+    void iterate(ContainerID id, Function<InflightAction, Boolean> processor) {
+      for (; ;) {
+        final List<InflightAction> actions = get(id);
+        if (actions == null) {
+          return;
+        }
+        synchronized (actions) {
+          if (get(id) != actions) {
+            continue; //actions is changed, retry
+          }
+          for (Iterator<InflightAction> i = actions.iterator(); i.hasNext();) {
+            final Boolean remove = processor.apply(i.next());
+            if (remove == Boolean.TRUE) {

Review Comment:
   @adoroszlai , thanks for pointing out the findbugs warning.  I kept checking all the highlighted Warnings like this https://github.com/apache/ozone/runs/6874432894?check_suite_focus=true#step:5:1669 but missed the real findbugs warning which was not highlighted https://github.com/apache/ozone/runs/6874432894?check_suite_focus=true#step:5:1726
   
   And yes, windbags is findbugs after the auto spelling correction.  :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] szetszwo commented on pull request #3482: HDDS-6829. Limit the no of inflight replication tasks in SCM.

Posted by GitBox <gi...@apache.org>.
szetszwo commented on PR #3482:
URL: https://github.com/apache/ozone/pull/3482#issuecomment-1157854593

   Thanks @sodonnel and @adoroszlai for reviewing this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org