You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2010/06/23 09:45:49 UTC
[jira] Commented: (HBASE-2617) Load balancer falls into pathological state if one server under average - slop; endless churn

    [ https://issues.apache.org/jira/browse/HBASE-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881601#action_12881601 ] 

stack commented on HBASE-2617:
------------------------------

Forgot to apply to trunk.  Just did.  Attached patch.

> Load balancer falls into pathological state if one server under average - slop; endless churn
> ---------------------------------------------------------------------------------------------
>
>                 Key: HBASE-2617
>                 URL: https://issues.apache.org/jira/browse/HBASE-2617
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.20.5
>
>         Attachments: 2617-v2-trunk.txt, 2617-v2.txt, 2617.txt
>
>
> I'm looking at a 0.20.4 cluster of 80 fast machines.  It runs fine for a while and then falls over into crazy balancing churn (My view on logs is sporadic but have a log before me where this is happening).  Things I see that seem to be of 0.20.4 particularly:
> + We don't reach an equilibrium or at least it takes so long, its as though it wasn't every going to happen
> + Master log filled w/ open, close of one or two regions  usually the same ones over and over (The Regions that are candidates to close are provided by the RS.  They are ordered by hash of their name.  We return the top ten from this Set every time.  So, we always close the same regions all the time even if we just opened it)
> + Often, we'll tell an RS to close a region.  It will do the job.  In 0.20.4 we made it so if RS has any work at all for the master, that we return immediately rather than wait for the reporting period to elaspse.   So, on these fast machines, it can be back near immediately if it just opened another some other region, say.  It can get assigned the region it just closed.  Seems to happen frequently enough.
> For example, look at the below extract featuring a single regions life.  Its opened and closed 5 times in about 1/2 a second:
> {code}
> 2010-05-25 11:01:05,488 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose of GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,489 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,490 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region GenericMetaStore,139757491,1274779304880 to a025.example.com,60020,1274744064673
> 2010-05-25 11:01:05,510 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from a025.example.com,60020,1274744064673; 1 of 1
> 2010-05-25 11:01:05,510 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880 open on 10.209.32.189:60020
> 2010-05-25 11:01:05,511 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274744064673, server=10.209.32.189:60020
> 2010-05-25 11:01:05,548 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,552 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from a025.example.com,60020,1274744064673; 1 of 2
> 2010-05-25 11:01:05,552 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose of GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,552 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,556 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region GenericMetaStore,139757491,1274779304880 to a028.example.com,60020,1274747560769
> 2010-05-25 11:01:05,578 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from a028.example.com,60020,1274747560769; 1 of 1
> 2010-05-25 11:01:05,578 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880 open on 10.209.32.185:60020
> 2010-05-25 11:01:05,579 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274747560769, server=10.209.32.185:60020
> 2010-05-25 11:01:05,599 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,605 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from a028.example.com,60020,1274747560769; 1 of 2
> 2010-05-25 11:01:05,605 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose of GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,606 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,607 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region GenericMetaStore,139757491,1274779304880 to sjc1c104.example.com,60020,1274747062601
> 2010-05-25 11:01:05,640 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from sjc1c104.example.com,60020,1274747062601; 1 of 1
> 2010-05-25 11:01:05,640 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880 open on 10.209.42.181:60020
> 2010-05-25 11:01:05,641 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274747062601, server=10.209.42.181:60020
> 2010-05-25 11:01:05,723 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,729 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from sjc1c104.example.com,60020,1274747062601; 1 of 4
> 2010-05-25 11:01:05,729 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose of GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,730 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,731 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region GenericMetaStore,139757491,1274779304880 to sjc1c091.example.com,60020,1274747056415
> 2010-05-25 11:01:05,751 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from sjc1c091.example.com,60020,1274747056415; 1 of 1
> 2010-05-25 11:01:05,752 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880 open on 10.209.42.238:60020
> 2010-05-25 11:01:05,752 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274747056415, server=10.209.42.238:60020
> 2010-05-25 11:01:05,775 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,780 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_CLOSE: GenericMetaStore,139757491,1274779304880 from sjc1c091.example.com,60020,1274747056415; 1 of 2
> 2010-05-25 11:01:05,780 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose of GenericMetaStore,139757491,1274779304880, false, reassign: true
> 2010-05-25 11:01:05,780 INFO org.apache.hadoop.hbase.master.ProcessRegionClose$1: region set as unassigned: GenericMetaStore,139757491,1274779304880
> 2010-05-25 11:01:05,808 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region GenericMetaStore,139757491,1274779304880 to sjc1a003.example.com,60020,1274747057557
> 2010-05-25 11:01:05,828 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN: GenericMetaStore,139757491,1274779304880 from sjc1a003.example.com,60020,1274747057557; 1 of 1
> 2010-05-25 11:01:05,828 INFO org.apache.hadoop.hbase.master.RegionServerOperation: GenericMetaStore,139757491,1274779304880 open on 10.209.32.148:60020
> 2010-05-25 11:01:05,829 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated row GenericMetaStore,139757491,1274779304880 in region .META.,,1 with startcode=1274747057557, server=10.209.32.148:60020
> {code}
> The culprit seems to be the code that wants to bring up underloaded regionservers up to average.  That and something about the lightly loaded servers math that is off.
> I'm marking this as a blocker.  I'm not sure why its not more common.  There were some issues on this cluster regards disks filling but though such an event may have provoked the issue, we should have evened out eventually.
> Making this a blocker on 0.20.5.   Need to fix it for this user at least.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.