You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Allan Yang (JIRA)" <ji...@apache.org> on 2018/09/30 05:10:00 UTC
[jira] [Commented] (HBASE-21260) The whole balancer plans might be aborted if there are more than one plans to move a same region

    [ https://issues.apache.org/jira/browse/HBASE-21260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16633240#comment-16633240 ] 

Allan Yang commented on HBASE-21260:
------------------------------------

Good finding, I believe this issue is introduced in HBASE-20881 by [~Apache9]. In pre branch-2, move one region to different locations at the same time won't cause any exception(it will execute sequentially, and unassign the region from the right location).

> The whole balancer plans might be aborted if there are more than one plans to move a same region 
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-21260
>                 URL: https://issues.apache.org/jira/browse/HBASE-21260
>             Project: HBase
>          Issue Type: Bug
>          Components: Balancer, master
>    Affects Versions: 2.1.0, 2.0.0
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>
> In SimpleLoadBalancer, plans are generated firstly by average number regions per server for a table. Each server will be randomly assigned either floor(average) or ceiling(average) regions (if the average is not an integer number). But afterwards, the balanceOverall method might generate new plans of some regions of the table to balance server loads in whole cluster scope. As a result, there are plans to move a same region in one call of balance. 
> Currently, branch-2 is using async procedures to implement balancer plans. But the concurrency of moving the same regions will cause the balance method failed. And all the afterwards plans will not be implement when one plan encounters exception.
> We have encountered this problem in our practices, the logs are as follows,
> {color:#205081}2018-09-26,12:12:38,224 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1] org.apache.hadoop.hbase.master.HMaster: Balancer plans size is 3757, the balance interval is 79 ms, and the max number regions in transition is 25
> 2018-09-26,12:12:38,224 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1] org.apache.hadoop.hbase.master.HMaster: balance hri=1588230740, source=c4-hadoop-tst-st99.bj,52900,1537522783781, destination=c4-hadoop-tst-st28.bj,52900,1537520009497
> 2018-09-26,12:12:38,325 INFO [master/c4-hadoop-tst-ct15:52900.Chore.1] org.apache.hadoop.hbase.master.HMaster: balance hri=1588230740, source=c4-hadoop-tst-st99.bj,52900,1537522783781, destination=c4-hadoop-tst-st29.bj,52900,1537522784188
> 2018-09-26,12:12:38,325 INFO [PEWorker-16] org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: pid=119197, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=hbase:meta, region=1588230740, REOPEN/MOVE checking lock on 1588230740
> 2018-09-26,12:12:38,325 ERROR [master/c4-hadoop-tst-ct15:52900.Chore.1] org.apache.hadoop.hbase.master.balancer.BalancerChore: Failed to balance.
> org.apache.hadoop.hbase.HBaseIOException: rit=OPEN, location=c4-hadoop-tst-st99.bj,52900,1537522783781, table=hbase:meta, region=1588230740 is currently in transition
>         at org.apache.hadoop.hbase.master.assignment.AssignmentManager.preTransitCheck(AssignmentManager.java:536)
>         at org.apache.hadoop.hbase.master.assignment.AssignmentManager.createMoveRegionProcedure(AssignmentManager.java:592)
>         at org.apache.hadoop.hbase.master.assignment.AssignmentManager.moveAsync(AssignmentManager.java:609)
>         at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1707)
>         at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1622)
>         at org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:49)
>         at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:186)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>         at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:111)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745){color}
> This is a serious problem because it often occurs when new RSs started or old RSs failover. And what's more, no effective methods can be used to make the balance of the cluster back to normal.
> But the solution of this problem may be simple. We can cache Exceptions when implementing a plan, and then just skip it, avoiding failed plans effect later plans in the whole plans list. New calls of balance can fetch up the failed and skipped plans.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)