You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2011/08/31 01:02:09 UTC

[jira] [Created] (HBASE-4306) Race between CatalogJanitor and LoadBalancer

Race between CatalogJanitor and LoadBalancer
--------------------------------------------

                 Key: HBASE-4306
                 URL: https://issues.apache.org/jira/browse/HBASE-4306
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.90.4
            Reporter: Jean-Daniel Cryans
            Priority: Blocker
             Fix For: 0.90.5


It is possible for the LoadBalancer to try to assign an offline/split region while it is waiting to be CatalogJanitor'ed. It goes like this:

{quote}
2011-08-25 00:32:07,137 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: parent: Daughters; d1, d2 from sv4r22s16,60020,1314211225331
...
(cleaning never happens or whatever)
...
2011-08-29 13:45:14,561 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=parent, src=sv4r22s16,60020,1314211225331, dest=sv4r19s17,60020,1314218170402
2011-08-29 13:45:14,561 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region parent (offlining)
2011-08-29 13:45:14,588 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server serverName=sv4r22s16,60020,1314211225331, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for parent but we are not serving it for parent
{quote}

Here it took 4 days of balancing to finally get to try to balance the parent (that was never deleted because of HBASE-4238), but it can also happen if the balancer decides to balance the parent just before it's cleaned. The end effect is that the balancer will be disabled _forever_ until that's fixed.

The culprit here is that the master keeps the region "online" until AssignmentManager.regionOffline is called by the CJ, which means it's still treated like any other region although it's offline.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4306) Race between CatalogJanitor and LoadBalancer

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103864#comment-13103864 ] 

stack commented on HBASE-4306:
------------------------------

When a region splits, handleSplitReport is called on master.  It calls AM.regionOffline so the split parent region should be cleared from AM.regions.  It should not be in set to balance.

HBASE-4238 being fixed should at least change this from being a blocker to something less?

> Race between CatalogJanitor and LoadBalancer
> --------------------------------------------
>
>                 Key: HBASE-4306
>                 URL: https://issues.apache.org/jira/browse/HBASE-4306
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>
> It is possible for the LoadBalancer to try to assign an offline/split region while it is waiting to be CatalogJanitor'ed. It goes like this:
> {quote}
> 2011-08-25 00:32:07,137 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: parent: Daughters; d1, d2 from sv4r22s16,60020,1314211225331
> ...
> (cleaning never happens or whatever)
> ...
> 2011-08-29 13:45:14,561 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=parent, src=sv4r22s16,60020,1314211225331, dest=sv4r19s17,60020,1314218170402
> 2011-08-29 13:45:14,561 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region parent (offlining)
> 2011-08-29 13:45:14,588 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server serverName=sv4r22s16,60020,1314211225331, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for parent but we are not serving it for parent
> {quote}
> Here it took 4 days of balancing to finally get to try to balance the parent (that was never deleted because of HBASE-4238), but it can also happen if the balancer decides to balance the parent just before it's cleaned. The end effect is that the balancer will be disabled _forever_ until that's fixed.
> The culprit here is that the master keeps the region "online" until AssignmentManager.regionOffline is called by the CJ, which means it's still treated like any other region although it's offline.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4306) Race between CatalogJanitor and LoadBalancer

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4306:
-------------------------

         Priority: Minor  (was: Blocker)
    Fix Version/s:     (was: 0.90.5)
                       (was: 0.92.0)

> Race between CatalogJanitor and LoadBalancer
> --------------------------------------------
>
>                 Key: HBASE-4306
>                 URL: https://issues.apache.org/jira/browse/HBASE-4306
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Priority: Minor
>
> It is possible for the LoadBalancer to try to assign an offline/split region while it is waiting to be CatalogJanitor'ed. It goes like this:
> {quote}
> 2011-08-25 00:32:07,137 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: parent: Daughters; d1, d2 from sv4r22s16,60020,1314211225331
> ...
> (cleaning never happens or whatever)
> ...
> 2011-08-29 13:45:14,561 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=parent, src=sv4r22s16,60020,1314211225331, dest=sv4r19s17,60020,1314218170402
> 2011-08-29 13:45:14,561 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region parent (offlining)
> 2011-08-29 13:45:14,588 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server serverName=sv4r22s16,60020,1314211225331, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for parent but we are not serving it for parent
> {quote}
> Here it took 4 days of balancing to finally get to try to balance the parent (that was never deleted because of HBASE-4238), but it can also happen if the balancer decides to balance the parent just before it's cleaned. The end effect is that the balancer will be disabled _forever_ until that's fixed.
> The culprit here is that the master keeps the region "online" until AssignmentManager.regionOffline is called by the CJ, which means it's still treated like any other region although it's offline.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4306) Race between CatalogJanitor and LoadBalancer

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4306:
-------------------------

    Fix Version/s: 0.92.0

> Race between CatalogJanitor and LoadBalancer
> --------------------------------------------
>
>                 Key: HBASE-4306
>                 URL: https://issues.apache.org/jira/browse/HBASE-4306
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.92.0, 0.90.5
>
>
> It is possible for the LoadBalancer to try to assign an offline/split region while it is waiting to be CatalogJanitor'ed. It goes like this:
> {quote}
> 2011-08-25 00:32:07,137 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: parent: Daughters; d1, d2 from sv4r22s16,60020,1314211225331
> ...
> (cleaning never happens or whatever)
> ...
> 2011-08-29 13:45:14,561 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=parent, src=sv4r22s16,60020,1314211225331, dest=sv4r19s17,60020,1314218170402
> 2011-08-29 13:45:14,561 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region parent (offlining)
> 2011-08-29 13:45:14,588 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server serverName=sv4r22s16,60020,1314211225331, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for parent but we are not serving it for parent
> {quote}
> Here it took 4 days of balancing to finally get to try to balance the parent (that was never deleted because of HBASE-4238), but it can also happen if the balancer decides to balance the parent just before it's cleaned. The end effect is that the balancer will be disabled _forever_ until that's fixed.
> The culprit here is that the master keeps the region "online" until AssignmentManager.regionOffline is called by the CJ, which means it's still treated like any other region although it's offline.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4306) Race between CatalogJanitor and LoadBalancer

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103873#comment-13103873 ] 

stack commented on HBASE-4306:
------------------------------

Chatting with J-D.  Something else must be going on here if the parent region is in the set of regions to balance, the split message must have been missed.

Changing this from blocker to major.  Removing as necessary fix on 0.92. and 0.90.5 till we learn more.

> Race between CatalogJanitor and LoadBalancer
> --------------------------------------------
>
>                 Key: HBASE-4306
>                 URL: https://issues.apache.org/jira/browse/HBASE-4306
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.4
>            Reporter: Jean-Daniel Cryans
>            Priority: Blocker
>
> It is possible for the LoadBalancer to try to assign an offline/split region while it is waiting to be CatalogJanitor'ed. It goes like this:
> {quote}
> 2011-08-25 00:32:07,137 INFO org.apache.hadoop.hbase.master.ServerManager: Received REGION_SPLIT: parent: Daughters; d1, d2 from sv4r22s16,60020,1314211225331
> ...
> (cleaning never happens or whatever)
> ...
> 2011-08-29 13:45:14,561 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=parent, src=sv4r22s16,60020,1314211225331, dest=sv4r19s17,60020,1314218170402
> 2011-08-29 13:45:14,561 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region parent (offlining)
> 2011-08-29 13:45:14,588 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server serverName=sv4r22s16,60020,1314211225331, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for parent but we are not serving it for parent
> {quote}
> Here it took 4 days of balancing to finally get to try to balance the parent (that was never deleted because of HBASE-4238), but it can also happen if the balancer decides to balance the parent just before it's cleaned. The end effect is that the balancer will be disabled _forever_ until that's fixed.
> The culprit here is that the master keeps the region "online" until AssignmentManager.regionOffline is called by the CJ, which means it's still treated like any other region although it's offline.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira