You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2019/02/28 13:37:24 UTC

[GitHub] leventov opened a new issue #7159: On "balancing burst, then wait total loading" pattern and replication/loading cap in Coordinator balancing

leventov opened a new issue #7159: On "balancing burst, then wait total loading" pattern and replication/loading cap in Coordinator balancing
URL: https://github.com/apache/incubator-druid/issues/7159

Druid's segment balancing logic is split between two classes: `DruidCoordinatorBalancer` and `LoadRule`. They both avoid skip segment loading and balancing until all their actions from the previous non-skipped run (resulted in a _balancing burst_) are completely settled down (per tier), i. e. there are no loading segments on any of the historical nodes in some tier.

The "hard" reason for obeying this pattern in `LoadRule` is that `SegmentReplicantLookup` is effectively immutable an reconstructed in the beginning of a Coordinator's cycle (see code in `CoordinatorHistoricalManagerRunnable`).

The reason for obeying this pattern in `DruidCoordinatorBalancer` is not documented anywhere, but as far as I can tell, that's done to make balacing decisions more optimal: we quickly fill up loading queues of the "universally best" historical nodes (e. g. because they've just entered the cluster and are almost empty yet) and are forced to move segments to less optimal nodes. (This is also probably the primary criterion for choosing `maxSegmentsToMove` configuration value, although the documentation is silent about this.) This "soft" reason is implicitly in play in `LoadRule`, too.

The problem is that `LoadRule` and `DruidCoordinatorBalancer` use independent mechanism to implement the "burst - wait" pattern: `LoadRule` uses `ReplicationThrottler` (implicitly in [`canCreateReplicant()`](https://github.com/apache/incubator-druid/blob/e432965c13ef7799598d1ec2f88eb665b8fceaf5/server/src/main/java/org/apache/druid/server/coordinator/ReplicationThrottler.java#L88) method), `DruidCoordinatorBalancer` uses it's own `currentlyMovingSegments` field.

As you can notice from `ReplicationThrottler`'s name, it's also responsible for capping the number of segments that can be moved in `LoadRule` during a single burst, via `replicationThrottleLimit` configuration. In other words, `replicationThrottleLimit` has exactly the same purpose as `maxSegmentsToMove`. But they don't mention each other and they effectively add up to each other.

I see the solution in gathering all balancing logic to a single class, deprecate either `replicationThrottleLimit` or `maxSegmentsToMove` in favor of the other one as the unified limit for balancing bursts.

Also, there is a problem with waiting until exactly zero remaining loading segments from the previous burst in a tier: struggling loaders might make pauses between the bursts longer and thus the effective balacing and loading throughput lower. Instead, we may initiate the next burst when only 10-20% of segments are remaining unloaded from the previous burst. We don't want to forgo the "burst - wait" pattern completely and make just so many balancing and loading decisions on each Coordinator's run to add up to the unified limit because this may keep the cluster always close to the threshold of suboptimality of balancing decisions and thus making balancing decisions less optimal overall.

To be able to initiate the next burst when the previous burst is not complete `SegmentReplicantLookup` should be made concurrently updatable. Instead of recreating it before each Coordinator's run, it's an ever-living concurrent data structure.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org