You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/09/18 11:36:42 UTC

[GitHub] [druid] kfaraz opened a new pull request, #13114: Fix over-replication caused by balancing when inventory is not updated yet

kfaraz opened a new pull request, #13114:
URL: https://github.com/apache/druid/pull/13114

   Fixes item (1) in #12881 
   
   This PR must be merged after https://github.com/apache/druid/pull/13074
   
   ### Description
   
   The current implementation of segment balancing always often leads to over-replication.
   
   This can happen in the following (fairly common) scenario:
   - Segment is moving from server A to B for balancing
   - Load succeeds on B, callback is invoked to finish the move
   - If inventory view shows the segment loaded on B, drop on A is invoked
   - If inventory view for B is not updated yet, the drop on A is not invoked
   
   This is a frequent occurrence but under normal circumstances, it is not noticeable
   because load rules quickly drop the over-replicated segment from A. But if load rules get
   stuck for some reason (e.g. trying to reach target replication levels on a different tier),
   the number of over-replicated segments keeps increasing, thereby overloading historicals.
   
   ![over_replication](https://user-images.githubusercontent.com/18635897/183566987-6b3a2be9-05cc-4f70-9e4f-3da9796e2710.png)
   
   ### Changes
   1. For http loading, do not check the server inventory to determine if the drop should be invoked.
   The HTTP response status is enough to determine load success or failure.
   2. For zk-based loading,
       - continue to check the inventory because behaviour of zk loading is a little more error prone
       - remove path check in the callback as it is redundant given that the inventory is already being checked
       - some minor cleanup in the peon
   3. Fix the corresponding test in `SegmentLoadingNegativeTest`
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on a diff in pull request #13114: Fix over-replication caused by balancing when inventory is not updated yet

Posted by GitBox <gi...@apache.org>.
kfaraz commented on code in PR #13114:
URL: https://github.com/apache/druid/pull/13114#discussion_r976032416


##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/SegmentLoadingNegativeTest.java:
##########
@@ -217,44 +217,4 @@ public void testLoadOfFullyReplicatedSegmentIsNotCancelled()
     );
   }
 
-  /**
-   * Correct behaviour: Balancing should never cause over-replication, even when
-   * the inventory view is not updated.
-   * <p>
-   * Fix Apache #12881 to fix this test.
-   */
-  @Test
-  public void testBalancingWithStaleInventoryCausesOverReplication()

Review Comment:
   Moved this test to `SegmentLoadingTest.testBalancingWithoutSyncedInventory()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on a diff in pull request #13114: Fix over-replication caused by balancing when inventory is not updated yet

Posted by GitBox <gi...@apache.org>.
kfaraz commented on code in PR #13114:
URL: https://github.com/apache/druid/pull/13114#discussion_r976032416


##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/SegmentLoadingNegativeTest.java:
##########
@@ -217,44 +217,4 @@ public void testLoadOfFullyReplicatedSegmentIsNotCancelled()
     );
   }
 
-  /**
-   * Correct behaviour: Balancing should never cause over-replication, even when
-   * the inventory view is not updated.
-   * <p>
-   * Fix Apache #12881 to fix this test.
-   */
-  @Test
-  public void testBalancingWithStaleInventoryCausesOverReplication()

Review Comment:
   Moved this test to `SegmentLoadingTest.testBalancingWithStaleViewDoesNotOverReplicate()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on a diff in pull request #13114: Fix over-replication caused by balancing when inventory is not updated yet

Posted by GitBox <gi...@apache.org>.
kfaraz commented on code in PR #13114:
URL: https://github.com/apache/druid/pull/13114#discussion_r983097339


##########
server/src/main/java/org/apache/druid/server/coordinator/duty/BalanceSegments.java:
##########
@@ -293,26 +292,18 @@ protected boolean moveSegment(
         new ServerHolder(toServer, toPeon).getAvailableSize() > segmentToMove.getSize()) {
       log.debug("Moving [%s] from [%s] to [%s]", segmentId, fromServer.getName(), toServer.getName());
 
-      LoadPeonCallback callback = null;
+      ConcurrentMap<SegmentId, BalancerSegmentHolder> movingSegments =
+          currentlyMovingSegments.get(toServer.getTier());
+      movingSegments.put(segmentId, segment);
+      final LoadPeonCallback callback = moveSuccess -> movingSegments.remove(segmentId);
       try {
-        ConcurrentMap<SegmentId, BalancerSegmentHolder> movingSegments =
-            currentlyMovingSegments.get(toServer.getTier());
-        movingSegments.put(segmentId, segment);
-        callback = () -> movingSegments.remove(segmentId);
-        coordinator.moveSegment(
-            params,
-            fromServer,
-            toServer,
-            segmentToMove,
-            callback
-        );
+        coordinator
+            .moveSegment(params, fromServer, toServer, segmentToMove, callback);
         return true;
       }
       catch (Exception e) {
-        log.makeAlert(e, StringUtils.format("[%s] : Moving exception", segmentId)).emit();
-        if (callback != null) {
-          callback.execute();
-        }
+        log.makeAlert(e, "[%s] : Moving exception", segmentId).emit();
+        callback.execute(false);

Review Comment:
   Yes, just a style change for readability.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on a diff in pull request #13114: Fix over-replication caused by balancing when inventory is not updated yet

Posted by GitBox <gi...@apache.org>.
kfaraz commented on code in PR #13114:
URL: https://github.com/apache/druid/pull/13114#discussion_r976032416


##########
server/src/test/java/org/apache/druid/server/coordinator/simulate/SegmentLoadingNegativeTest.java:
##########
@@ -217,44 +217,4 @@ public void testLoadOfFullyReplicatedSegmentIsNotCancelled()
     );
   }
 
-  /**
-   * Correct behaviour: Balancing should never cause over-replication, even when
-   * the inventory view is not updated.
-   * <p>
-   * Fix Apache #12881 to fix this test.
-   */
-  @Test
-  public void testBalancingWithStaleInventoryCausesOverReplication()

Review Comment:
   Moved this test to `SegmentLoadingTest.testBalancingWithStaleInventoryDoesNotOverReplicate()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 merged pull request #13114: Fix over-replication caused by balancing when inventory is not updated yet

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 merged PR #13114:
URL: https://github.com/apache/druid/pull/13114


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] imply-cheddar commented on a diff in pull request #13114: Fix over-replication caused by balancing when inventory is not updated yet

Posted by GitBox <gi...@apache.org>.
imply-cheddar commented on code in PR #13114:
URL: https://github.com/apache/druid/pull/13114#discussion_r983078672


##########
server/src/main/java/org/apache/druid/server/coordinator/duty/BalanceSegments.java:
##########
@@ -293,26 +292,18 @@ protected boolean moveSegment(
         new ServerHolder(toServer, toPeon).getAvailableSize() > segmentToMove.getSize()) {
       log.debug("Moving [%s] from [%s] to [%s]", segmentId, fromServer.getName(), toServer.getName());
 
-      LoadPeonCallback callback = null;
+      ConcurrentMap<SegmentId, BalancerSegmentHolder> movingSegments =
+          currentlyMovingSegments.get(toServer.getTier());
+      movingSegments.put(segmentId, segment);
+      final LoadPeonCallback callback = moveSuccess -> movingSegments.remove(segmentId);
       try {
-        ConcurrentMap<SegmentId, BalancerSegmentHolder> movingSegments =
-            currentlyMovingSegments.get(toServer.getTier());
-        movingSegments.put(segmentId, segment);
-        callback = () -> movingSegments.remove(segmentId);
-        coordinator.moveSegment(
-            params,
-            fromServer,
-            toServer,
-            segmentToMove,
-            callback
-        );
+        coordinator
+            .moveSegment(params, fromServer, toServer, segmentToMove, callback);
         return true;
       }
       catch (Exception e) {
-        log.makeAlert(e, StringUtils.format("[%s] : Moving exception", segmentId)).emit();
-        if (callback != null) {
-          callback.execute();
-        }
+        log.makeAlert(e, "[%s] : Moving exception", segmentId).emit();
+        callback.execute(false);

Review Comment:
   Are these changes fixing something or just style?  It looks like just style trying to remove the null check from the callback by moving the work of building it outside of the `try` is that correct? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on pull request #13114: Fix over-replication caused by balancing when inventory is not updated yet

Posted by GitBox <gi...@apache.org>.
kfaraz commented on PR #13114:
URL: https://github.com/apache/druid/pull/13114#issuecomment-1261806662

   Thanks a lot for the review, @imply-cheddar !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org