You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Clara Xiong (Jira)" <ji...@apache.org> on 2022/08/12 20:02:00 UTC

[jira] [Comment Edited] (HBASE-25625) StochasticBalancer CostFunctions needs a better way to evaluate region count distribution

    [ https://issues.apache.org/jira/browse/HBASE-25625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579115#comment-17579115 ] 

Clara Xiong edited comment on HBASE-25625 at 8/12/22 8:01 PM:
--------------------------------------------------------------

[~bbeaudreault] The standard deviation solution, as [~dmanning] pointed out and as we simulated, didn't cover all cases better than linear deviation for balancing decisions. This case is more about triggering. We use a lower threshold (0.001) for our larger(500) clusters which works well for us so far. We might want to add a shortcut to trigger rebalancing if any node has >=2 fold(or any reasonable threshold) of average load, just like the shortcut of triggering by empty node, as a safeguard, instead of trying to find the one-fits-all heuristics. What do you think? Or do you have a better proposal?


was (Author: claraxiong):
[~bbeaudreault] The standard deviation solution, as [~dmanning] pointed out and as we simulated, didn't cover all cases better than linear deviation for balancing decisions. This case is more about triggering. We use a lower threshold (0.001) for our larger(500) clusters. We might want to add a shortcut to trigger rebalancing if any node has >=2 fold(or any reasonable threshold) of average load, just like the shortcut of triggering by empty node, as a safeguard, instead of trying to find the one-fits-all heuristics. What do you think?

> StochasticBalancer CostFunctions needs a better way to evaluate region count distribution
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-25625
>                 URL: https://issues.apache.org/jira/browse/HBASE-25625
>             Project: HBase
>          Issue Type: Improvement
>          Components: Balancer, master
>            Reporter: Clara Xiong
>            Assignee: Clara Xiong
>            Priority: Major
>         Attachments: image-2021-10-05-17-17-50-944.png
>
>
> Currently CostFunctions including RegionCountSkewCostFunctions, PrimaryRegionCountSkewCostFunctions and all load cost functions calculate the unevenness of the distribution by getting the sum of deviation per region server. This simple implementation works when the cluster is small. But when the cluster get larger with more region servers and regions, it doesn't work well with hot spots or a small number of unbalanced servers. The proposal is to use the standard deviation of the count per region server to capture the existence of a small portion of region servers with overwhelming load/allocation.
> TableSkewCostFunction uses the sum of the max deviation region per server for all tables as the measure of unevenness. It doesn't work in a very common scenario in operations. Say we have 100 regions on 50 nodes, two on each. We add 50 new nodes and they have 0 each. The max deviation from the mean is 1, compared to 99 in the worst case scenario of 100 regions on a single server. The normalized cost is 1/99 = 0.011 < default threshold of 0.05. Balancer wouldn't move.  The proposal is to use the standard deviation of the count per region server to detect this scenario, generating a cost of 3.1/31 = 0.1 in this case.
> Patch is in test and will follow shortly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)