You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/10/14 11:08:33 UTC

[GitHub] [ozone] swamirishi opened a new pull request, #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

swamirishi opened a new pull request, #3836:
URL: https://github.com/apache/ozone/pull/3836

   ## What changes were proposed in this pull request?
   Currently as part of basic reconstruction implementation, we relaxed the placement policy satisfaction checks in different places like HealthChecks, UnderReplication andOverReplication handling. This is the Jira to track and enable placement policy satisfaction checks.
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-6966
   
   ## How was this patch tested?
   Unit Tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3836:
URL: https://github.com/apache/ozone/pull/3836#discussion_r1007837096


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ECUnderReplicationHandler.java:
##########
@@ -353,8 +386,77 @@ private void processMaintenanceOnlyIndexes(
         DatanodeDetails target = iterator.next();
         commands.put(target, replicateCommand);
         additionalMaintenanceCopiesNeeded -= 1;
+        maintIndexes.remove(replica.getReplicaIndex());
+        createdIndexes.add(replica.getReplicaIndex());
+      }
+    }
+  }
+
+  /**
+   * Function processes Replicas to fix placement policy issues.
+   * @param replicas
+   * @param deletionInFlight
+   * @param container
+   * @param excludedNodes
+   * @param sources
+   * @param createdIndexes
+   * @param placementStatus
+   * @throws IOException
+   */
+  private Map<DatanodeDetails, SCMCommand<?>> processMisreplication(
+          Set<ContainerReplica> replicas,
+          List<DatanodeDetails> deletionInFlight, ContainerInfo container,
+          List<DatanodeDetails> excludedNodes,
+          Map<Integer, Pair<ContainerReplica, NodeStatus>> sources,
+          Set<Integer> createdIndexes, ContainerPlacementStatus placementStatus)
+          throws IOException {
+    if (placementStatus.isPolicySatisfied() ||
+            placementStatus.misReplicationCount() <= createdIndexes.size()) {
+      return Collections.emptyMap();
+    }
+    Map<DatanodeDetails, SCMCommand<?>> commands = new HashMap<>();
+    List<Integer> sortedReplicaIdx =

Review Comment:
   The placementPolicy gives us the new target, but I don't understand what this code is doing to select which replica index it is going to copy onto that new target.
   
   Say we have EC 3-2, and the current layout looks like:
   
   ```
   index: 1 Rack: 1
   Index:2 Rack: 2 **
   Index:3 Rack: 2 **
   Index:4 Rack: 4
   Index:5 Rack: 5
   ```
   When we goto the placement policy, it says "here is a new target on rack 3".
   
   That means we need to make a new copy of index 2 or 3 onto rack 3, and then later remove the original index 2 or 3, but that is handled later by the over rep handler. Copying index 1, 4 or 5 to Rack 3 does not improve the placement, as we are still short by a rack.
   
   How does the logic here address that, as you seem to be sorting the DNs in order of duplicate index count first, and then using that below to pick which node to make a new copy of.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] swamirishi commented on pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

swamirishi commented on PR #3836:
URL: https://github.com/apache/ozone/pull/3836#issuecomment-1307026924

   @sodonnel I have addressed the review comments. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3836:
URL: https://github.com/apache/ozone/pull/3836#discussion_r1001691313


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ECUnderReplicationHandler.java:
##########
@@ -353,8 +386,77 @@ private void processMaintenanceOnlyIndexes(
         DatanodeDetails target = iterator.next();
         commands.put(target, replicateCommand);
         additionalMaintenanceCopiesNeeded -= 1;
+        maintIndexes.remove(replica.getReplicaIndex());
+        createdIndexes.add(replica.getReplicaIndex());
+      }
+    }
+  }
+
+  /**
+   * Function processes Replicas to fix placement policy issues.
+   * @param replicas
+   * @param deletionInFlight
+   * @param container
+   * @param excludedNodes
+   * @param sources
+   * @param createdIndexes
+   * @param placementStatus
+   * @throws IOException
+   */
+  private Map<DatanodeDetails, SCMCommand<?>> processMisreplication(
+          Set<ContainerReplica> replicas,
+          List<DatanodeDetails> deletionInFlight, ContainerInfo container,
+          List<DatanodeDetails> excludedNodes,
+          Map<Integer, Pair<ContainerReplica, NodeStatus>> sources,
+          Set<Integer> createdIndexes, ContainerPlacementStatus placementStatus)
+          throws IOException {
+    if (placementStatus.isPolicySatisfied() ||
+            placementStatus.misReplicationCount() <= createdIndexes.size()) {
+      return Collections.emptyMap();
+    }
+    Map<DatanodeDetails, SCMCommand<?>> commands = new HashMap<>();
+    List<Integer> sortedReplicaIdx =

Review Comment:
   I'm not sure what this is doing? Are we sorting the replicas so that if we have say, 2 copies of index 3, it will be first in the list?
   
   To fix mis-replication, we must have some indexes that reside on the same rack. So if we find a new node on a new rack, a candidate index to move is one where there are multiple indexes on the same rack. So I think we need to find any indexes that reside on the same rack and then move one of them to the new target.
   
   Note there could be 2 or more on the same rack.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3836:
URL: https://github.com/apache/ozone/pull/3836#issuecomment-1372169774

   Closing this PR as we have broken this task down into several other smaller tasks which have been committed over time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] swamirishi commented on pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

swamirishi commented on PR #3836:
URL: https://github.com/apache/ozone/pull/3836#issuecomment-1307935897

   > ```
   > Rack 1: 1,2
   > Rack 2: 3,4
   > Rack 3: 1
   > Rack 4: 1
   > ```
   > 
   > In this example the over rep handler should remove 2 "1" replicas, from rack1 and then either 3 or 4, leaving:
   > 
   > ```
   > Rack 1: 2
   > Rack 2: 3,4
   > Rack 3: 1
   > Rack 4:
   > ```
   > 
   > However you are correct, in that making an extra copy of "1" doesn't do much good.
   > 
   > I think we need to take a step back here. What does it mean to be mis-replicated?
   > 
   > It means that the replicas are not spread across enough racks. If there are less racks on the cluster than the replicaNum, then it is also fine for there to be 2 replicas per rack for example.
   > 
   > Assuming there are plenty of racks, for a container to be mis-replicated when all the replicas are present, there must be some racks hosting more than 1 replica. You can further extend this, and say, there must be some racks hosting more than 1 replica, where the replica is not also on another rack.
   > 
   > ```
   > R1: 1, 2
   > R2: 1, 2
   > R3: 3
   > R4: 4
   > R5: 5
   > ```
   > 
   > Above is not mis-replicated, it is simply over-replicated.
   > 
   > A significant complication in the solution to this problem is that a container can be both over-replicated and mis-replicated at the same time. If we remove the over-replication part, then the problem becomes simpler, as can then move any random replica from a rack with more than 1 index.
   > 
   > One idea that is worth thinking about, what if we changed the ECReplicationCheckHandler, to return health states in this order:
   > 
   > underReplicated overReplicated misReplicated
   > 
   > If a container is both over and mis replicated, rather than its state being mis-replicated (actually under-replicated due to mis-replication), it will return as over-replicated. Once the over-replication gets fixed, it will be re-processed and come out as mis-replicated.
   > 
   > Of course, fixing the mis-replication will cause it to go over replicated again, but I feel this over + mis-replicated will be a relatively rare occurrence in practice.
   > 
   > Alternatively, I wonder if the algorithm like this will work even with over-replication:
   > 
   > ```
   > for each rack_group
   >   if replicas_on_rack > 1
   >     for each replica_on_rack
   >       if (another_copy_of_replica_exists)
   >         continue // do nothing as we don't need to copy it
   >       else
   >         copy_to_new_rack
   >       end if
   >     end
   > end
   > ```
   > 
   > I think this would handle these scenarios:
   > 
   > ```
   > R1: 1, 2
   > R2: 2, 3
   > R3: 1
   > R4: 4
   > R5: 5
   > 
   > R1: 1, 2, 4
   > R2: 2, 3
   > R3: 1
   > R4:
   > R5: 5
   > 
   > R1: 1, 1, 2
   > R2: 2, 3
   > R3: 
   > R4: 4
   > R5: 5
   > ```
   > 
   > If the above works, then we just need 2 maps:
   > 
   > replica_index -> count_of_occurrences rack -> List<ContainerReplica>
   > 
   > EDIT:
   > 
   > ```
   > R1: 1, 1, 2, 2
   > R2: 
   > R3: 3
   > R4: 4
   > R5: 5
   > ```
   > 
   > This scenario would not work with the above algorithm, as both 1 and 2 have 2 occurrences, so we would skip them both. So we would have to say has_another_occurrence_on_a_different_rack, but then we also have to consider the distinct list on each rack. So if we had:
   > 
   > replica_index -> rack_count rack -> Set<ReplicaIndex>
   > 
   > ```
   > for each rack_group
   >   if replicas_on_rack > 1
   >     max_to_move = replicas_on_rack - 1
   >     moved_from_rack = 0
   >     for each replica_on_rack
   >       if (replica_on_multiple_racks)
   >         continue // do nothing as we don't need to copy it
   >       else
   >         copy_to_new_rack
   >         increment replica_index_to_rack_count_map
   >         moved_from_rack += 1
   >       end if
   >       if (moved_from_rack >= max_to_move)
   >         break;
   >     end
   >   end
   > end
   > ```
   
   The current Algorithm I have implemented takes into consideration of all the following edge cases.
   **ECPlacementManager.getPlacementGroupReplicaIndexMap:**
   The algorithm groups starts with the replicas based on the placement group(Map<PlacementGroup, Set<Integer>>) 
   **ECPlacementManager.getMisreplicatedIndexes(Map<PlacementGroup, Set<Integer>>)**
   This function starts with getting the reverseMap i.e. replicationIndex's spread across the different placement group (Map<Integer, Set<PlacementGroup>>).
   
   Now we create 2  ConcurrentSkipListSet<PlacementGroup> &  ConcurrentSkipListSet<Integer> for replicas & placement group respectively sorted based on the size of set in value.
   
   We pop out any placement group having  <=1 replicationIndex, this is why we have ConcurrentSkipListSet<PlacementGroup>
   The replica Index coming out of it is not misreplicated, thus we remove this index from our consideration thus remove the replica index from all the placementGroups Map. We continue to do the same process till there is no placement group containing <= 1 replica indexes.
   
   Now we pop out the replica Index with least number of replicas & we consider that replica to be misreplicated & we remove this replication index from our consideration & remove this replica index from all the placement Group.
   This loop runs till there are no more misreplicated indexes.
   This function would return the indexes to be copied to fix misreplication.
   
   Existing OverreplicationHandler can fix the overreplication post this.
   
   Let us consider the case & do a dry run:
   PlacementGroupReplicaMap
   Rack 1: 1,2
   Rack 2: 3,4
   Rack 3: 5,3
   Rack 4: 1
   
   Here misreplicationCnt = 1, Overreplicated Idx = 1 with 2 replicas
   Reverse Replica Map:
   1 : Rack 1, Rack 4
   2: Rack 1
   3: Rack 2, Rack 3
   4: Rack 2
   5: Rack 3
   
   MisreplicatedIndexesList = []
   ConcurrentListSet<PlacmentGroup> = Rack4, Rack1, Rack2, Rack3
   ConcurrentListSet<Index> = 2,4,5,1,3
   
   **Iteration 1:**
   Rack 4 Size  = 1 , Contains Replica Index = 1
   Remove Rack 4 & Replica Index 1 from all Placment Group
   then:
   PlacementGroupReplicaMap
   Rack 1: 2
   Rack 2: 3,4
   Rack 3: 5,3
   Rack 4: 
   Reverse Replica Map:
   2: Rack 1
   3: Rack 2, Rack 3
   4: Rack 2
   5: Rack 3
   ConcurrentListSet<PlacmentGroup> = Rack1, Rack2, Rack3
   ConcurrentListSet<Index> = 2,4,5,3
   MisreplicatedIndexesList = []
   
   **Iteration 2:**
   Rack 1 Size  = 1 , Contains Replica Index = 2
   Remove Rack 1 & Replica Index 2 from all Placment Group
   then:
   PlacementGroupReplicaMap
   Rack 1:
   Rack 2: 3,4
   Rack 3: 5,3
   Rack 4: 
   Reverse Replica Map: 
   1:
   2:
   3: Rack 2, Rack 3
   4: Rack 2
   5: Rack 3
   ConcurrentListSet<PlacmentGroup> =  Rack2, Rack3
   ConcurrentListSet<Index> = 4,5,3
   MisreplicatedIndexesList = []
   
   
   **Iteration 3:**
   Rack 2 Size  = 2 , Contains Replica Index = 3,4
   As size > 1 pop out from ReplicaIndexSkipList which is replica Index 4 & add it to MisreplicatedIndexesList & remove replica index 4 from all placement groups
   then:
   PlacementGroupReplicaMap
   Rack 1:
   Rack 2: 3
   Rack 3: 5,3
   Rack 4: 
   Reverse Replica Map: 
   1:
   2:
   3: Rack 2, Rack 3
   4: 
   5: Rack 3
   ConcurrentListSet<PlacmentGroup> =  Rack2, Rack3
   ConcurrentListSet<Index> = 5,3
   MisreplicatedIndexesList = [4]
   
   **Iteration 4:**
   Rack 2 Size  = 1 , Contains Replica Index = 3
   Remove Rack 2 & Replica Index 3 from all Placment Group
   then:
   **PlacementGroupReplicaMap**
   Rack 1:
   Rack 2: 
   Rack 3: 5
   Rack 4: 
   **Reverse Replica Map:** 
   1:
   2:
   3: 
   4: 
   5: Rack 3
   **ConcurrentListSet<PlacmentGroup>** =  Rack3
   **ConcurrentListSet<Index>** = 5
   **MisreplicatedIndexesList** = [4]
   
   **Iteration 5:**
   Rack 3 Size  = 1 , Contains Replica Index = 5
   Remove Rack 3 & Replica Index 5 from all Placment Group
   then:
   **PlacementGroupReplicaMap**
   Rack 1:
   Rack 2: 
   Rack 3: 
   Rack 4: 
   **Reverse Replica Map:** 
   1:
   2:
   3: 
   4: 
   5:
   **ConcurrentListSet<PlacmentGroup>** =  
   **ConcurrentListSet<Index>** = 
   **MisreplicatedIndexesList** = [4]
   
   Loop ends since ConcurrentListSet<Index> is empty
   Function returns MisreplicatedIndexesList.
   The starting misreplicationCount elements from this list is taken to be copied to different racks.
   Thus here:
   Replica Index 4 would be copied to Rack 5
   Thus making:
   Rack 1: 1,2
   Rack 2: 3,4
   Rack 3: 5,3
   Rack 4: 1
   Rack 5: 4
   
   Over replication Handler can delete Replica 1 from Rack 1, Replica 3 from Rack 3  & Replica 4 from Rack 2 making the spread:
   Rack 1: 2
   Rack 2: 3
   Rack 3: 5
   Rack 4: 1
   Rack 5: 4
    
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] swamirishi commented on a diff in pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

swamirishi commented on code in PR #3836:
URL: https://github.com/apache/ozone/pull/3836#discussion_r1004565273


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ECUnderReplicationHandler.java:
##########
@@ -353,8 +386,77 @@ private void processMaintenanceOnlyIndexes(
         DatanodeDetails target = iterator.next();
         commands.put(target, replicateCommand);
         additionalMaintenanceCopiesNeeded -= 1;
+        maintIndexes.remove(replica.getReplicaIndex());
+        createdIndexes.add(replica.getReplicaIndex());
+      }
+    }
+  }
+
+  /**
+   * Function processes Replicas to fix placement policy issues.
+   * @param replicas
+   * @param deletionInFlight
+   * @param container
+   * @param excludedNodes
+   * @param sources
+   * @param createdIndexes
+   * @param placementStatus
+   * @throws IOException
+   */
+  private Map<DatanodeDetails, SCMCommand<?>> processMisreplication(
+          Set<ContainerReplica> replicas,
+          List<DatanodeDetails> deletionInFlight, ContainerInfo container,
+          List<DatanodeDetails> excludedNodes,
+          Map<Integer, Pair<ContainerReplica, NodeStatus>> sources,
+          Set<Integer> createdIndexes, ContainerPlacementStatus placementStatus)
+          throws IOException {
+    if (placementStatus.isPolicySatisfied() ||
+            placementStatus.misReplicationCount() <= createdIndexes.size()) {
+      return Collections.emptyMap();
+    }
+    Map<DatanodeDetails, SCMCommand<?>> commands = new HashMap<>();
+    List<Integer> sortedReplicaIdx =
+            getSourcesStream(replicas, deletionInFlight)
+                    .collect(Collectors.groupingBy(
+                            r -> r.getLeft().getReplicaIndex(),
+                            Collectors.counting()))
+                    .entrySet().stream()
+                    .sorted(Comparator.comparingLong(Map.Entry::getValue))
+                    .map(Map.Entry::getKey).collect(Collectors.toList());
+    int additionalCopiesForPlacementStatus =
+            placementStatus.misReplicationCount() - createdIndexes.size();
+    List<DatanodeDetails> targets = getTargetDatanodes(excludedNodes, container,

Review Comment:
   This is taken care by the rack scatter policy which would fail if the selection doesn't conform to the policy



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3836:
URL: https://github.com/apache/ozone/pull/3836#issuecomment-1307236300

   The code to group the replicas and select those that can be use for an additional copy still seems overly complex.
   
   Really, what we need is a list of containerReplicas, where there are more than 1 of the replicas on the same "group". The order of them doesn't matter, and we need to be sure we don't try to move them all from a group, as one should always stay.
   
   I think this code does it, producing a list of containerReplias where any of them are good to "move":
   
   ```
       Map<Node, List<Pair<ContainerReplica, Node>>> list =
           getSourcesStream(replicas, deletionInFlight).map(Pair::getKey)
               .map(cr -> Pair.of(cr, ((Node)containerPlacement.getPlacementGroup(cr.getDatanodeDetails()))))
               .collect(Collectors.groupingBy(Pair::getRight));
   
       List<ContainerReplica> candidates = new ArrayList<>();
       for (List<Pair<ContainerReplica, Node>> p : list.values()) {
         if (p.size() > 1) {
           // Only groups with more than one entry can help fix mis-replications
           for (int i = 1; i < p.size(); i++) {
             // Skip the first entry, as we should not move all replicas in a group
             // only the excess ones. Ideally we skip a random one, but this is
             // simple and the sets should not be very large anyway.
             candidates.add(p.get(i).getLeft());
           }
         }
       }
   ```
   
   We can discuss this later and see what we can conclude.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] swamirishi commented on pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

swamirishi commented on PR #3836:
URL: https://github.com/apache/ozone/pull/3836#issuecomment-1307707255

   > The code to group the replicas and select those that can be use for an additional copy still seems overly complex.
   > 
   > Really, what we need is a list of containerReplicas, where there are more than 1 of the replicas on the same "group". The order of them doesn't matter, and we need to be sure we don't try to move them all from a group, as one should always stay.
   > 
   > I think this code does it, producing a list of containerReplias where any of them are good to "move":
   > 
   > ```
   >     Map<Node, List<Pair<ContainerReplica, Node>>> list =
   >         getSourcesStream(replicas, deletionInFlight).map(Pair::getKey)
   >             .map(cr -> Pair.of(cr, ((Node)containerPlacement.getPlacementGroup(cr.getDatanodeDetails()))))
   >             .collect(Collectors.groupingBy(Pair::getRight));
   > 
   >     List<ContainerReplica> candidates = new ArrayList<>();
   >     for (List<Pair<ContainerReplica, Node>> p : list.values()) {
   >       if (p.size() > 1) {
   >         // Only groups with more than one entry can help fix mis-replications
   >         for (int i = 1; i < p.size(); i++) {
   >           // Skip the first entry, as we should not move all replicas in a group
   >           // only the excess ones. Ideally we skip a random one, but this is
   >           // simple and the sets should not be very large anyway.
   >           candidates.add(p.get(i).getLeft());
   >         }
   >       }
   >     }
   > ```
   > 
   > We can discuss this later and see what we can conclude.
   
   Choosing a random replica index might not work:
   Conider:
   Rack 1: 1,2
   Rack 2: 3,4
   Rack 3: 1
   Copying replica index 1 to another rack 4 will create the following :
   Rack 1: 1,2
   Rack 2: 3,4
   Rack 3: 1
   Rack 4: 1
   Now the nodes are overreplicated:
   Over replication handler cannot delete any of the replica thus this container will be always over replicated & would be stuck on an infinite loop
   @sodonnel Does this make sense?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3836:
URL: https://github.com/apache/ozone/pull/3836#discussion_r1001687051


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ECUnderReplicationHandler.java:
##########
@@ -353,8 +386,77 @@ private void processMaintenanceOnlyIndexes(
         DatanodeDetails target = iterator.next();
         commands.put(target, replicateCommand);
         additionalMaintenanceCopiesNeeded -= 1;
+        maintIndexes.remove(replica.getReplicaIndex());
+        createdIndexes.add(replica.getReplicaIndex());
+      }
+    }
+  }
+
+  /**
+   * Function processes Replicas to fix placement policy issues.
+   * @param replicas
+   * @param deletionInFlight
+   * @param container
+   * @param excludedNodes
+   * @param sources
+   * @param createdIndexes
+   * @param placementStatus
+   * @throws IOException
+   */
+  private Map<DatanodeDetails, SCMCommand<?>> processMisreplication(
+          Set<ContainerReplica> replicas,
+          List<DatanodeDetails> deletionInFlight, ContainerInfo container,
+          List<DatanodeDetails> excludedNodes,
+          Map<Integer, Pair<ContainerReplica, NodeStatus>> sources,
+          Set<Integer> createdIndexes, ContainerPlacementStatus placementStatus)
+          throws IOException {
+    if (placementStatus.isPolicySatisfied() ||
+            placementStatus.misReplicationCount() <= createdIndexes.size()) {
+      return Collections.emptyMap();
+    }
+    Map<DatanodeDetails, SCMCommand<?>> commands = new HashMap<>();
+    List<Integer> sortedReplicaIdx =
+            getSourcesStream(replicas, deletionInFlight)
+                    .collect(Collectors.groupingBy(
+                            r -> r.getLeft().getReplicaIndex(),
+                            Collectors.counting()))
+                    .entrySet().stream()
+                    .sorted(Comparator.comparingLong(Map.Entry::getValue))
+                    .map(Map.Entry::getKey).collect(Collectors.toList());
+    int additionalCopiesForPlacementStatus =
+            placementStatus.misReplicationCount() - createdIndexes.size();
+    List<DatanodeDetails> targets = getTargetDatanodes(excludedNodes, container,

Review Comment:
   Is it possible that this call will give some target that do not improvement the placement? Eg for some reason we require 5 racks, but the container is on 4. Then this call returns a new node, but it duplicates one of the racks we are already on, so the placement is still bad. The rack scatter is (I think) best effort at picking racks - it will give a node that does not meet the policy if it cannot find anything better.
   
   So we probably need a check to see if the target improves the situation before scheduling the commands to do the copy.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3836:
URL: https://github.com/apache/ozone/pull/3836#issuecomment-1307839545

   ```
   Rack 1: 1,2
   Rack 2: 3,4
   Rack 3: 1
   Rack 4: 1
   ```
   In this example the over rep handler should remove 2 "1" replicas, from rack1 and then either 3 or 4, leaving:
   
   ```
   Rack 1: 2
   Rack 2: 3,4
   Rack 3: 1
   Rack 4:
   ```
   
   However you are correct, in that making an extra copy of "1" doesn't do much good.
   
   I think we need to take a step back here. What does it mean to be mis-replicated?
   
   It means that the replicas are not spread across enough racks. If there are less racks on the cluster than the replicaNum, then it is also fine for there to be 2 replicas per rack for example.
   
   Assuming there are plenty of racks, for a container to be mis-replicated when all the replicas are present, there must be some racks hosting more than 1 replica. You can further extend this, and say, there must be some racks hosting more than 1 replica, where the replica is not also on another rack.
   
   ```
   R1: 1, 2
   R2: 1, 2
   R3: 3
   R4: 4
   R5: 5
   ```
   
   Above is not mis-replicated, it is simply over-replicated.
   
   A significant complication in the solution to this problem is that a container can be both over-replicated and mis-replicated at the same time. If we remove the over-replication part, then the problem becomes simpler, as can then move any random replica from a rack with more than 1 index.
   
   One idea that is worth thinking about, what if we changed the ECReplicationCheckHandler, to return health states in this order:
   
   underReplicated
   overReplicated
   misReplicated
   
   If a container is both over and mis replicated, rather than its state being mis-replicated (actually under-replicated due to mis-replication), it will return as over-replicated. Once the over-replication gets fixed, it will be re-processed and come out as mis-replicated.
   
   Of course, fixing the mis-replication will cause it to go over replicated again, but I feel this over + mis-replicated will be a relatively rare occurrence in practice.
   
   Alternatively, I wonder if the algorithm like this will work even with over-replication:
   
   ```
   for each rack_group
     if replicas_on_rack > 1
       for each replica_on_rack
         if (another_copy_of_replica_exists)
           continue // do nothing as we don't need to copy it
         else
           copy_to_new_rack
         end if
       end
   end
   ```
   
   I think this would handle these scenarios:
   
   ```
   R1: 1, 2
   R2: 2, 3
   R3: 1
   R4: 4
   R5: 5
   
   R1: 1, 2, 4
   R2: 2, 3
   R3: 1
   R4:
   R5: 5
   
   R1: 1, 1, 2
   R2: 2, 3
   R3: 
   R4: 4
   R5: 5
   ```
   
   If the above works, then we just need 2 maps:
   
   replica_index -> count_of_occurrences
   rack -> List<ContainerReplica>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] swamirishi commented on pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

swamirishi commented on PR #3836:
URL: https://github.com/apache/ozone/pull/3836#issuecomment-1309339708

   > Do you believe the algorithm I described will also work? I think it would be implemented much more simply, as I think the current implementation is hard to follow and will be difficult to understand when we come back to it in the future.
   
   This algorithm might not work, when all of the indexes are over replicated which might be far fetched but plausible & at that time if there is another rack addition then things might be stuck in an infinite loop. 
   Basically this would fail if there are not enough single replicated indexes in comparison to the misreplication count.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] swamirishi commented on a diff in pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

swamirishi commented on code in PR #3836:
URL: https://github.com/apache/ozone/pull/3836#discussion_r1007143243


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ECUnderReplicationHandler.java:
##########
@@ -353,8 +386,77 @@ private void processMaintenanceOnlyIndexes(
         DatanodeDetails target = iterator.next();
         commands.put(target, replicateCommand);
         additionalMaintenanceCopiesNeeded -= 1;
+        maintIndexes.remove(replica.getReplicaIndex());
+        createdIndexes.add(replica.getReplicaIndex());
+      }
+    }
+  }
+
+  /**
+   * Function processes Replicas to fix placement policy issues.
+   * @param replicas
+   * @param deletionInFlight
+   * @param container
+   * @param excludedNodes
+   * @param sources
+   * @param createdIndexes
+   * @param placementStatus
+   * @throws IOException
+   */
+  private Map<DatanodeDetails, SCMCommand<?>> processMisreplication(
+          Set<ContainerReplica> replicas,
+          List<DatanodeDetails> deletionInFlight, ContainerInfo container,
+          List<DatanodeDetails> excludedNodes,
+          Map<Integer, Pair<ContainerReplica, NodeStatus>> sources,
+          Set<Integer> createdIndexes, ContainerPlacementStatus placementStatus)
+          throws IOException {
+    if (placementStatus.isPolicySatisfied() ||
+            placementStatus.misReplicationCount() <= createdIndexes.size()) {
+      return Collections.emptyMap();
+    }
+    Map<DatanodeDetails, SCMCommand<?>> commands = new HashMap<>();
+    List<Integer> sortedReplicaIdx =

Review Comment:
   We shouldn't handle rack count in the ReplicationManager, we have offloaded this requirement on the placement policy where we can choose nodes based on the used nodes & asking for additional replicas



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel closed pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

sodonnel closed pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling
URL: https://github.com/apache/ozone/pull/3836


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3836:
URL: https://github.com/apache/ozone/pull/3836#discussion_r1001670645


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/health/ECReplicationCheckHandler.java:
##########
@@ -119,7 +132,7 @@ public ContainerHealthResult checkHealth(ContainerCheckRequest request) {
       return new ContainerHealthResult.UnderReplicatedHealthResult(
           container, remainingRedundancy, dueToDecommission,
           replicaCount.isSufficientlyReplicated(true),
-          replicaCount.isUnrecoverable());
+          replicaCount.isUnrecoverable(), placementStatus);

Review Comment:
   In the RatisReplicationCheckHandler, I added two fields to the `UnderReplicatedHealthResult`, for `isMisReplicated` and `isMisReplicatedAfterPending` - It would be better if we could use those rather than adding the placementStatus to the `UnderReplicatedHealthResult` object.
   
   At the very least we should be consistent in the approach between the Ratis handler and this, as they both use that same `UnderReplicatedHealthResult` object and we could have common code later processing mis-replication from both Ratis and EC together, as fixing it just involved moving containers to a new rack usually.
   
   Also have a check in `RatisReplicationCheckHandler#handle()` it increments a counter on the report object for each mis-replicated container, so we need to add that logic here too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] aswinshakil commented on pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

aswinshakil commented on PR #3836:
URL: https://github.com/apache/ozone/pull/3836#issuecomment-1285940376

   Looks like the builds and many tests are failing, can you take a look at it? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on a diff in pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

sodonnel commented on code in PR #3836:
URL: https://github.com/apache/ozone/pull/3836#discussion_r1010318479


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/replication/ECUnderReplicationHandler.java:
##########
@@ -407,35 +408,49 @@ private Map<DatanodeDetails, SCMCommand<?>> processMisreplication(
           List<DatanodeDetails> deletionInFlight, ContainerInfo container,
           List<DatanodeDetails> excludedNodes,
           Map<Integer, Pair<ContainerReplica, NodeStatus>> sources,
-          Set<Integer> createdIndexes,
+          Set<Integer> excludedIndexes,
           ContainerHealthResult.UnderReplicatedHealthResult
                   containerHealthResult)
           throws IOException {
-    if (containerHealthResult.isSufficientlyReplicatedAfterPending() ||
+    if (containerHealthResult.isSufficientlyReplicatedAfterPending() &&
             containerHealthResult.getMisreplicationCountAfterPending()
-                    <= createdIndexes.size()) {
+                    <= excludedIndexes.size()) {
       return Collections.emptyMap();
     }
     Map<DatanodeDetails, SCMCommand<?>> commands = new HashMap<>();
+
     List<Integer> sortedReplicaIdx =
             getSourcesStream(replicas, deletionInFlight)
-                    .collect(Collectors.groupingBy(
-                            r -> r.getLeft().getReplicaIndex(),
-                            Collectors.counting()))
-                    .entrySet().stream()
-                    .sorted(Comparator.comparingLong(Map.Entry::getValue))
-                    .map(Map.Entry::getKey).collect(Collectors.toList());
+            .map(Pair::getKey).collect(Collectors.groupingBy(r ->

Review Comment:
   I've got no idea what this code is doing, and I have stared at it too long already. It needs some comments to explain it at least.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org

[GitHub] [ozone] sodonnel commented on pull request #3836: HDDS-6966. EC: Handle the placement policy satisfaction in HealthChecks handling

Posted by GitBox <gi...@apache.org>.

sodonnel commented on PR #3836:
URL: https://github.com/apache/ozone/pull/3836#issuecomment-1307964605

   Do you believe the algorithm I described will also work? I think it would be implemented much more simply, as I think the current implementation is hard to follow and will be difficult to understand when we come back to it in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org