You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Charlie Qiangeng Xu (JIRA)" <ji...@apache.org> on 2016/11/07 17:21:58 UTC

[jira] [Comment Edited] (HBASE-17039) SimpleLoadBalancer schedules large amount of invalid region moves

    [ https://issues.apache.org/jira/browse/HBASE-17039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15644764#comment-15644764 ] 

Charlie Qiangeng Xu edited comment on HBASE-17039 at 11/7/16 5:21 PM:
----------------------------------------------------------------------

Just skimmed through the historical changes for this part,  
I found the code causing problem right now could be attributed to HBASE-7060.
The issue described in that Jira has been handled nicely by other part of current simpleLoadBalancer logic,
thus the code block aforementioned is not necessary, yet problematic.
[~yuzhihong@gmail.com], it seems you were involved in that JIRA, any interest to take a look at this one?


was (Author: xharlie):
Just skimmed through the historical changes for this part,  
I found the code causing problem right now could be attributed to HBASE-7060.
The problem mentioned in that Jira has been handled nicely by other part of current balancer logic,
yet the code block aforementioned would only cause problem right now.
[~yuzhihong@gmail.com], it seems you were involved in that JIRA, any interest to take a look at this one?

> SimpleLoadBalancer schedules large amount of invalid region moves
> -----------------------------------------------------------------
>
>                 Key: HBASE-17039
>                 URL: https://issues.apache.org/jira/browse/HBASE-17039
>             Project: HBase
>          Issue Type: Bug
>          Components: Balancer
>    Affects Versions: 2.0.0, 1.2.3, 1.1.7
>            Reporter: Charlie Qiangeng Xu
>            Assignee: Charlie Qiangeng Xu
>         Attachments: HBASE-17039.patch
>
>
> After increasing one of our clusters to 1600 nodes, we observed a large amount of invalid region moves(more than 30k moves) fired by the balance chore. Thus we simulated the problem and printed out the balance plan, only to find out many servers that had two regions for a certain table(we use by table strategy), sent out both regions to other two servers that have zero region. 
> In the SimpleLoadBalancer's balanceCluster function,
> the code block that determines the underLoadedServers might have a problem:
> {code}
>       if (load >= min && load > 0) {
>         continue; // look for other servers which haven't reached min
>       }
>       int regionsToPut = min - load;
>       if (regionsToPut == 0)
>       {
>         regionsToPut = 1;
>       }
> {code}
> if min is zero, some server that has load of zero, which equals to min would be marked as underloaded, which would cause the phenomenon mentioned above.
> Since we increased the cluster's size to 1600+, many tables that only have 1000 regions, now would encounter such issue.
> By fixing it up, the balance plan went back to normal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)