You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ted Yu (JIRA)" <ji...@apache.org> on 2017/02/04 03:33:51 UTC

[jira] [Comment Edited] (HBASE-17565) StochasticLoadBalancer may incorrectly skip balancing due to skewed multiplier sum

    [ https://issues.apache.org/jira/browse/HBASE-17565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852542#comment-15852542 ] 

Ted Yu edited comment on HBASE-17565 at 2/4/17 3:33 AM:
--------------------------------------------------------

bq. This is not need to change to 1.01f .

With certain cluster state, total == sumMultiplier which results in (total / sumMultiplier) < minCostNeedBalance to be false, leading to testNeedBalance() failure.
Since the underlying bug in needsBalance() is fixed, the above change is needed for test to pass.

bq. when the cost is zero, then we don't need consider the multiplier

As I commented above, we should consider the aggregate effect of cost multiplied by multiplier, not just the cost itself.

bq. we should make the default multiplier as a small value?

The large multiplier for read replica was obtained through trial and error when developing read replica feature.
I think we should leave it as is.


was (Author: yuzhihong@gmail.com):
bq. This is not need to change to 1.01f .

With certain cluster state, total == sumMultiplier, leading to testNeedBalance() failure.
Since the underlying bug in needsBalance() is fixed, the above change is needed for test to pass.

bq. when the cost is zero, then we don't need consider the multiplier

As I commented above, we should consider the aggregate effect of cost multiplied by multiplier, not just the cost itself.

bq. we should make the default multiplier as a small value?

The large multiplier for read replica was obtained through trial and error when developing read replica feature.
I think we should leave it as is.

> StochasticLoadBalancer may incorrectly skip balancing due to skewed multiplier sum
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-17565
>                 URL: https://issues.apache.org/jira/browse/HBASE-17565
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>            Priority: Critical
>             Fix For: 2.0.0, 1.4.0
>
>         Attachments: 17565.v1.txt, 17565.v2.txt, 17565.v3.txt
>
>
> I was investigating why a 6 node cluster kept skipping balancing requests.
> Here were the region counts on the servers:
> 449, 448, 447, 449, 453, 0
> {code}
> 2017-01-26 22:04:47,145 INFO  [RpcServer.deafult.FPBQ.Fifo.handler=1,queue=0,port=16000] balancer.StochasticLoadBalancer: Skipping load balancing because balanced cluster; total cost is 127.0171157050385, sum multiplier is 111087.0 min cost which need balance is 0.05
> {code}
> The big multiplier sum caught my eyes. Here was what additional debug logging showed:
> {code}
> 2017-01-27 23:25:31,749 DEBUG [RpcServer.deafult.FPBQ.Fifo.handler=9,queue=0,port=16000] balancer.StochasticLoadBalancer: class org.apache.hadoop.hbase.master.balancer.          StochasticLoadBalancer$RegionReplicaHostCostFunction with multiplier 100000.0
> 2017-01-27 23:25:31,749 DEBUG [RpcServer.deafult.FPBQ.Fifo.handler=9,queue=0,port=16000] balancer.StochasticLoadBalancer: class org.apache.hadoop.hbase.master.balancer.          StochasticLoadBalancer$RegionReplicaRackCostFunction with multiplier 10000.0
> {code}
> Note however, that no table in the cluster used read replica.
> I can think of two ways of fixing this situation:
> 1. If there is no read replica in the cluster, ignore the multipliers for the above two functions.
> 2. When cost() returned by the CostFunction is 0 (or very very close to 0.0), ignore the multiplier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)