You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Jieshan Bean (JIRA)" <ji...@apache.org> on 2011/07/06 05:40:17 UTC

[jira] [Created] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
------------------------------------------------------------------------------------------------------------------------

                 Key: HBASE-4064
                 URL: https://issues.apache.org/jira/browse/HBASE-4064
             Project: HBase
          Issue Type: Bug
          Components: master
    Affects Versions: 0.90.3
            Reporter: Jieshan Bean
             Fix For: 0.90.4


1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:

2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
.....

2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.


 public void unassign(HRegionInfo region, boolean force) {
    LOG.debug("Starting unassignment of region " +
      region.getRegionNameAsString() + " (offlining)");
    synchronized (this.regions) {
      // Check if this region is currently assigned
      if (!regions.containsKey(region)) {
        LOG.debug("Attempted to unassign region " +
          region.getRegionNameAsString() + " but it is not " +
          "currently assigned anywhere");
        return;
      }
    }
    String encodedName = region.getEncodedName();
    // Grab the state of this region and synchronize on it
    RegionState state;
    long stamp = -1;
    synchronized (regionsInTransition) {
      state = regionsInTransition.get(encodedName);
      if (state == null) {
        state = new RegionState(region, RegionState.State.PENDING_CLOSE);
        stamp =state.getStamp();               
        regionsInTransition.put(encodedName, state);
      } else if (force && state.isPendingClose()) {
        LOG.debug("Attempting to unassign region " +
            region.getRegionNameAsString() + " which is already pending close "
            + "but forcing an additional close");
        state.update(RegionState.State.PENDING_CLOSE);
      } else {
        LOG.debug("Attempting to unassign region " +
          region.getRegionNameAsString() + " but it is " +
          "already in transition (" + state.getState() + ")");
        return;
      }
    }
    
    // Send CLOSE RPC
    HServerInfo server = null;
    synchronized (this.regions) {
      server = regions.get(region);
    }
}

2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 

3  The following scenario shows how the above problem 2 happened:

(1)A table have a lot of regions, more than 70000 in my test. 
(2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070311#comment-13070311 ] 

Ted Yu commented on HBASE-4064:
-------------------------------

How did you come up with the duration below ?
{code}
+          Thread.sleep(5 * 1000);
{code}
The above code is inside a while loop. Why is it needed ?

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069098#comment-13069098 ] 

Jean-Daniel Cryans commented on HBASE-4064:
-------------------------------------------

This work should really be done for 0.92, not 0.90, because in its current state the master is _bound_ to have race conditions.

As I have commented in other jiras, the TimeoutMonitor acts out of band with regards to the other threads. While most of the code in the master assumes that every operation is atomic on a single region, the TM doesn't. It can figure out that a RIT expired and then only process it minutes later because of how that code was written. Here's how it works:

 # iterate on all RIT
 ## if the timed out
 ### Put the region in the assign/unassign list (depends on last state)
 ### also depending on state, do some ZK operation (which can take a few tens of ms)
 # Process all the regions to unassign one by one
 # Process all the regions to assign one by one

Here the issue is that at any time a region can change state while the TM runs. Let's say the TM saw region x timing out on PENDING_CLOSE, but then the millisecond later the master receives the notification that the region was finally closed. What happens is that the region will be forced unassigned in the loop at the end (while the region might already be reopened elsewhere). All of this is made even worse by the fact that ZK calls, in aggregate, really take a lot of time. Plus during the assign and unassign loops you need to talk to region servers.

The window for races is so large basically anything can happen.

One reason those operations are slow is that the master is suffering for a lot of lock contention, for reference see HBASE-3789. This is only in 0.92, and that's one more reason to focus our work there.

Something we did late in 0.90 to make it less prone to races is to bump the timeout itself, as I was mentioning in HBASE-3669.

If anything else should be done in 0.90, it would be to disable the TimeoutMonitor as it does more harm than good.

In 0.92 it should be completely redone to work atomically on regions, instead of doing its own thing. 

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064451#comment-13064451 ] 

Ted Yu commented on HBASE-4064:
-------------------------------

@Jieshan:
I think you're suggesting adding sequence Id to region open/close process so that an out-of-order process can abort its operation.

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13079927#comment-13079927 ] 

ramkrishna.s.vasudevan commented on HBASE-4064:
-----------------------------------------------

As JD said timeout monitor has to be refactored, but additionally before changing the state to CLOSING again we could have checked what is the current state of the node. 
Correct me if am wrong.

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061597#comment-13061597 ] 

stack commented on HBASE-4064:
------------------------------

Jieshan.  Please follow the conventions used by the code around you.  See how methods in the class add opening parenthesis on same line as function name rather than on line that follows, that we used two spaces for tabs, and that we have a space around operators -- on both sides, not just on one.

So, how is this 'fix' working?  If timestamp is not -1, we'll remove the region from the in-memory regions in transition?  It could be not -1 if we get a NotServingRegionException when we go to close the region.  That seems good that we cleanup the in-memory RIT if we get an exception (though something else must be going on if we went to close a region on a server that was not carrying it).  Same for the case of a generic Throwable.  Though, are we sure we want to give up on the closing completely?  What if a transient error?  Are there cases where we'd want to retry the close?

When would the state be -1?

Thanks Jieshan.

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-4064-v1.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066300#comment-13066300 ] 

Ted Yu commented on HBASE-4064:
-------------------------------

+1 on the patch v2.

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070106#comment-13070106 ] 

stack commented on HBASE-4064:
------------------------------

bq.  4.When master receives the watcher event, It removes the region from RIT and then remove from regions collection. There is a short window when diable table can't finish in a period(). The region may be unssigned again. My patch try to fix above case. remove regions collection firstly and disable thread can't get a processing region.

What is 'period()' in the above Gao?  Where is it?  Where is the code that does the 'unassign again'?  I'm trying to understand why in particular your patch addresses.  Do you think your patch solves the problem in spite of the race windows described above by J-D?

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "steven zhuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292777#comment-13292777 ] 

steven zhuang commented on HBASE-4064:
--------------------------------------

hi, Ramkrishna, thanks for you reply, I am now using HBase0.92, and no, I cannot upgrade the cluster at present.
actually my problem is a slightly different from this. I sort of closed the ROOT region under certain condition, later the cluster cannot be restarted, saying unsigning ROOT region because it's not assigned anywhere.
We fixed it by clear the root region node from the zookeeper's unsigned nodes list.
thanks anyway.

                
> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.7
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064400#comment-13064400 ] 

Jieshan Bean commented on HBASE-4064:
-------------------------------------

I have happened to several problems due to the two concorrent open/close process exected at the same time. The prev thread change the zk node state, but the latter one found the zk node was not the expect state, so the operation was fail.
If we use a timestamp to identify the a open/close operation flow. So if a the prev flow was timeout, the TimeoutMonitor will restart a new flow and update the timestamp. So once the prev one executed unexpectly, it should check the timestamp first, if it has outdated, it should skip the operation.
is it make any sense?

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070304#comment-13070304 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

!disableflow.png!

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "Jonathan Hsieh (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13412363#comment-13412363 ] 

Jonathan Hsieh commented on HBASE-4064:
---------------------------------------

There was some recent activity here, anyone planning on finishing this guy? (its been bumped a few times considering bumping it for the 0.90.7 release).                
                
> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.7
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069442#comment-13069442 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

@J-D Thanks for your replay. :)

I got it. In my case, The race is between the disable threads and ClosedRegionHandler threads.

 1.Disable thread get region from regions collection (reference getRegionsOfTable)

 2.Thread pool gets region and sends request to region server. at the same time puts region into RIT(regionsInTransition),   it indicates that the region is processing.

 3.Region server finishs closing region and changes the zk state, notifies the master.

 4.When master receives the watcher event, It removes the region from RIT and then remove from regions collection.
   There is a short window when diable table can't finish in a period(). The region may be unssigned again.

My patch try to fix above case. remove regions collection firstly and disable thread can't get a processing region.


I found the issue yestertay, Enable threads is also a race condition.  
(I changed the period for 1 minutes because of reproducing the issue). It seems pool couldn't finish but a new enable process starts. we need a sleep time when a enable period finishes

The master logs:
2011-07-22 13:33:27,806 INFO org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 regions of which 2156 are online.

2011-07-22 13:34:28,646 INFO org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 regions of which 982 are online.
2011-07-22 13:34:31,079 WARN org.apache.hadoop.hbase.master.AssignmentManager: gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,080 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x31502ef4f00000 Creating (or updating) unassigned node for c9b1c97ac6c00033ceb1890e45e66229 with OFFLINE state
2011-07-22 13:34:31,104 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. state=OFFLINE, ts=1311312871080
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. so generated a random one; hri=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229., src=, dest=C4C2.site,60020,1311310281335; 3 (online=3, exclude=null) available servers
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. to C4C2.site,60020,1311310281335
2011-07-22 13:34:31,122 WARN org.apache.hadoop.hbase.master.AssignmentManager: gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,123 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. state=PENDING_OPEN, ts=1311312871121
java.lang.IllegalStateException
	at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1081)
	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1036)
	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:864)
	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:844)
	at org.apache.hadoop.hbase.master.handler.EnableTableHandler$BulkEnabler$1.run(EnableTableHandler.java:154)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
2011-07-22 13:34:31,125 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
2011-07-22 13:34:31,482 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:60000-0x31502ef4f00000 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/c9b1c97ac6c00033ceb1890e45e66229
2011-07-22 13:34:31,482 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x31502ef4f00000 Unable to get data of znode /hbase/unassigned/c9b1c97ac6c00033ceb1890e45e66229
 


> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068812#comment-13068812 ] 

stack commented on HBASE-4064:
------------------------------

And what did you see. Do you have log snippets that demonstrate the fix working?  Thanks Gao.

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "steven zhuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13397332#comment-13397332 ] 

steven zhuang commented on HBASE-4064:
--------------------------------------

hi, Ramkrishna, sorry for the late response. 

This turn of event came because we wanted to update the hbase jar to enable our table-balancer without turning the whole cluster off. 

Here is what I did:
  1. turn off first half RS of the cluster, update the jars
  2. turn on the first half RS, (used "hbase-daemon.sh stop " on the RS which holds the META region coz it won't go down )
  3. turn off the second half RS of the cluster, update the jars,
  4. turn on the second half RS
At this time, I found the ROOT region is in transition, and status was CLOSING although the RS hold it was down . So I tried to restart the whole cluster, the cluster cannot be started now, coz the RS which master believes holds the ROOT region has no region served, master can only give out "NotServingRegionException". I closed all the RS, the started the master again, only to find the master trying to reach the same "ROOT-holding" RS which was down then, and gave "ServerNotRunningYetException" after dozens of retries.

I guess I restarted the "half cluster" too soon after I closed it, so root region was still in transition when the original RS restarted, and later I force closed the whole cluster, but status in ZK cluster stayed. so later when I restarted the cluster, the master tried to close the ROOT region instead of start it, but that time, the region is not on any RS, master then fall into a (dead?) loop.

the log files is all messed up, with splitlog and other crappy exception log, and is way too big(even just the grepped part) to upload, so hope my description helps.
                
> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.7
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064347#comment-13064347 ] 

Jieshan Bean commented on HBASE-4064:
-------------------------------------

I'm trying to make the change about the patch.. But I think it should be more considerate..
I try me best to commit the new patch today.
Thanks stack.

For the code format, sorry:(, I made the same mistake again, for our code format is not exact the same.

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-4064-v1.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070306#comment-13070306 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

The patch can't solve J-D issue. But it is improvement for disable table.

I make a flow chart(A ->B ->C->D). We can find there is a window between Remove region from RIT and Remove region from region clellections. So my patch want to change the positon.


> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jieshan Bean updated HBASE-4064:
--------------------------------

    Attachment: HBASE-4064-v1.patch

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-4064-v1.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4064:
-------------------------

    Fix Version/s:     (was: 0.90.4)
                   0.90.5

Moving it out of 0.90.4 in the meantime.

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076000#comment-13076000 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

Do we need fix this issue? If it need I will test it. or I will close the issue ?

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064720#comment-13064720 ] 

stack commented on HBASE-4064:
------------------------------

And it makes sense to me (Isn't there one already integral to zk?)

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066240#comment-13066240 ] 

Ted Yu commented on HBASE-4064:
-------------------------------

@Jinchao:
Please generate the patch at root level, not under src directory

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293328#comment-13293328 ] 

ramkrishna.s.vasudevan commented on HBASE-4064:
-----------------------------------------------

@Steven
Can you provide us some logs to see the problem.  Because its important we solve such issues.  May be it exists in other versions also?
Thanks Steve.
                
> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.7
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292731#comment-13292731 ] 

ramkrishna.s.vasudevan commented on HBASE-4064:
-----------------------------------------------

@Steven
Which version of Hbase you are using?
In the latest versions HBase-0.94 and above we are trying to solve these type of issues.  Can you try with those versions?
Thanks.
                
> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.7
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070310#comment-13070310 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

I have made a patch, but I don't verify now. I want to review whether is reasonable firstly. then do it.

In my cluster I had changed the parameter(hbase.bulk.assignment.waiton.empty.rit) to avoid this issue.


> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gaojinchao updated HBASE-4064:
------------------------------

    Attachment:     (was: HBASE-4064_branch90V2.patch)

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068843#comment-13068843 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

I reproduced the issue by the above code. and modifed for below code, the issue can't reproduce.(about 2000 region)
So I think V2 is good..
  public void regionOffline(final HRegionInfo regionInfo) {

    // remove the region plan as well just in case.
   clearRegionPlan(regionInfo);
   setOffline(regionInfo);
       
   try{
      Thread.sleep(1000);
    }catch(Throwable e){
      ;
    }
   synchronized(this.regionsInTransition) {
      if (this.regionsInTransition.remove(regionInfo.getEncodedName()) != null) {
        this.regionsInTransition.notifyAll();
      }
    }

  }
I don't collect the logs. So I produce it again. the logs:

//disable table start.
hbase(main):002:0> disable 'ufdr5'
11/07/21 15:07:25 DEBUG zookeeper.ZKUtil: hconnection-0x314b823b0a0005 Set watcher on existing znode /hbase/root-region-server

hmaster logs:
2011-07-21 15:10:42,300 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. state=CLOSED, ts=1311232060203
2011-07-21 15:10:42,300 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 32cb070f97e3b7b0d261f427a0fc110a has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown
2011-07-21 15:12:25,672 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 32cb070f97e3b7b0d261f427a0fc110a
2011-07-21 15:12:25,672 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a.
2011-07-21 15:12:25,672 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x314b823b0a0000 Deleting existing unassigned node for 32cb070f97e3b7b0d261f427a0fc110a that is in expected state RS_ZK_REGION_CLOSED
2011-07-21 15:12:25,673 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x314b823b0a0000 Retrieved 107 byte(s) of data from znode /hbase/unassigned/32cb070f97e3b7b0d261f427a0fc110a; data=region=ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a., server=C4C2.site,60020,1311231851946, state=RS_ZK_REGION_CLOSED
2011-07-21 15:12:25,688 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:60000-0x314b823b0a0000 Received ZooKeeper Event, type=NodeDeleted, state=SyncConnected, path=/hbase/unassigned/32cb070f97e3b7b0d261f427a0fc110a
2011-07-21 15:12:25,688 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x314b823b0a0000 Successfully deleted unassigned node for region 32cb070f97e3b7b0d261f427a0fc110a in expected state RS_ZK_REGION_CLOSED
2011-07-21 15:12:26,383 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. (offlining)
2011-07-21 15:12:26,391 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server serverName=C4C2.site,60020,1311231851946, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. but we are not serving it for 32cb070f97e3b7b0d261f427a0fc110a
2011-07-21 15:15:32,300 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. state=PENDING_CLOSE, ts=1311232346385
2011-07-21 15:15:32,301 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a.
2011-07-21 15:15:32,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x314b823b0a0000 Set watcher on existing znode /hbase/unassigned/32cb070f97e3b7b0d261f427a0fc110a
2011-07-21 15:15:32,303 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. (offlining)
2011-07-21 15:15:32,303 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. but it is not currently assigned anywhere
2011-07-21 15:15:42,300 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. state=PENDING_CLOSE, ts=1311232346385
2011-07-21 15:15:42,300 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a.
2011-07-21 15:15:42,301 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:60000-0x314b823b0a0000 Set watcher on existing znode /hbase/unassigned/32cb070f97e3b7b0d261f427a0fc110a
2011-07-21 15:15:42,303 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region ufdr5,0670286138,1311057525904.32cb070f97e3b7b0d261f427a0fc110a. (offlining) 

The log after modified code 

//I do this several times
hbase(main):004:0> disable 'ufdr5'
11/07/21 15:42:52 DEBUG zookeeper.ZKUtil: hconnection-0x314b967b920003 Set watcher on existing znode /hbase/root-region-server

2011-07-21 15:50:21,443 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:60000-0x1314b9678830000 Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, path=/hbase/unassigned
2011-07-21 15:50:21,571 INFO org.apache.hadoop.hbase.master.handler.DisableTableHandler: Disabled table is done=true
2011-07-21 15:51:26,908 DEBUG org.apache.hadoop.hbase.master.LoadBalancer: Server information: C4C3.site,60020,1311233178997=0, C4C2.site,60020,1311233178981=1, C4C1.site,60020,1311233179039=1
2011-07-21 15:51:26,908 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing.  servers=3 regions=2 average=0.6666667 mostloaded=1 leastloaded=0
2011-07-21 15:51:27,402 DEBUG org.apache.hadoop.hbase.master.CatalogJanitor: Scanned 2156 catalog row(s) and gc'd 0 unreferenced parent region(s)



> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069100#comment-13069100 ] 

stack commented on HBASE-4064:
------------------------------

Yah!  J-D is back!

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066305#comment-13066305 ] 

stack commented on HBASE-4064:
------------------------------

Gao v2 looks good to me too; do all tests pass and how do we know it works?  How did you test it?

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068796#comment-13068796 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

Hi, I verified the issue by adding a sleep in regionOffline. I think V2 is ok.

below code:
 public void regionOffline(final HRegionInfo regionInfo) {
    synchronized(this.regionsInTransition) {
      if (this.regionsInTransition.remove(regionInfo.getEncodedName()) != null) {
        this.regionsInTransition.notifyAll();
      }
    }
    try{
      Thread.sleep(1000);
    }catch(Throwable e){
      ;
    }
    // remove the region plan as well just in case.
    clearRegionPlan(regionInfo);
    setOffline(regionInfo);
  }

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "Jieshan Bean (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064453#comment-13064453 ] 

Jieshan Bean commented on HBASE-4064:
-------------------------------------

yes. is that ok?

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gaojinchao updated HBASE-4064:
------------------------------

    Attachment: HBASE-4064_branch90V2.patch

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gaojinchao updated HBASE-4064:
------------------------------

    Attachment: HBASE-4064_branch90V2.patch

I try to make a patch. if the region is in RIT, It shouldn't be unsigned again. So it seems changing the code position can solve this issue. 
ALL test passed, Please review and give some suggesion.

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066387#comment-13066387 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

@Stack:
I will reproduce and verify it after finishing review. Because it may spend a lot of time.


> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139860#comment-13139860 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

I found enable table issue was fixed in HBASE-4395. 
I think this issue should use the same way to fix.
                
> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064454#comment-13064454 ] 

Ted Yu commented on HBASE-4064:
-------------------------------

It makes sense to me.

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068876#comment-13068876 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

Please don't merge the patch, I found other issue and need dig whether is relation to the patch. Thanks.


> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gaojinchao updated HBASE-4064:
------------------------------

    Attachment: disableflow.png

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082212#comment-13082212 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

I will study the code for trunk and confirm that have fixed.

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "gaojinchao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070331#comment-13070331 ] 

gaojinchao commented on HBASE-4064:
-----------------------------------

Master may be crashed because of pool shutdown is asynchronous. 

The master show :
2011-07-22 13:33:27,806 INFO org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 regions of which 2156 are online.

2011-07-22 13:34:28,646 INFO org.apache.hadoop.hbase.master.handler.EnableTableHandler: Table has 2156 regions of which 982 are online.
2011-07-22 13:34:31,079 WARN org.apache.hadoop.hbase.master.AssignmentManager: gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,080 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x31502ef4f00000 Creating (or updating) unassigned node for c9b1c97ac6c00033ceb1890e45e66229 with OFFLINE state
2011-07-22 13:34:31,104 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. state=OFFLINE, ts=1311312871080
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. so generated a random one; hri=ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229., src=, dest=C4C2.site,60020,1311310281335; 3 (online=3, exclude=null) available servers
2011-07-22 13:34:31,121 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. to C4C2.site,60020,1311310281335
2011-07-22 13:34:31,122 WARN org.apache.hadoop.hbase.master.AssignmentManager: gjc:xxx ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229.
2011-07-22 13:34:31,123 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected state trying to OFFLINE; ufdr5,0590386138,1311057525896.c9b1c97ac6c00033ceb1890e45e66229. state=PENDING_OPEN, ts=1311312871121
java.lang.IllegalStateException
at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1081)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1036)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:864)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:844)
at java.lang.Thread.run(Thread.java:662)
2011-07-22 13:34:31,125 INFO org.apache.hadoop.hbase.master.HMaster: Aborting


> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.5
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "steven zhuang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292724#comment-13292724 ] 

steven zhuang commented on HBASE-4064:
--------------------------------------

any one still work around this issue? Pls tell me how to fix this. Thanks!
                
> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.7
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064344#comment-13064344 ] 

stack commented on HBASE-4064:
------------------------------

Any progress on this one Jieshan?

> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.4
>
>         Attachments: HBASE-4064-v1.patch
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4064) Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."

Posted by "ramkrishna.s.vasudevan (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-4064:
------------------------------------------

    Fix Version/s:     (was: 0.90.6)
                   0.90.7
    
> Two concurrent unassigning of the same region caused the endless loop of "Region has been PENDING_CLOSE for too long..."
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-4064
>                 URL: https://issues.apache.org/jira/browse/HBASE-4064
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.3
>            Reporter: Jieshan Bean
>             Fix For: 0.90.7
>
>         Attachments: HBASE-4064-v1.patch, HBASE-4064_branch90V2.patch, disableflow.png
>
>
> 1. If there is a "rubbish" RegionState object with "PENDING_CLOSE" in regionsInTransition(The RegionState was remained by some exception which should be removed, that's why I called it as "rubbish" object), but the region is not currently assigned anywhere, TimeoutMonitor will fall into an endless loop:
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:31,207 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:31,215 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out:  test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301
> 2011-06-27 10:32:41,164 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f.
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining)
> 2011-06-27 10:32:41,172 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere
> .....
> 2  In the following scenario, two concurrent unassigning call of the same region may lead to the above problem:
> the first unassign call send rpc call success, the master watched the event of "RS_ZK_REGION_CLOSED", process this event, will create a ClosedRegionHandler to remove the state of the region in master.eg.
> while ClosedRegionHandler is running in  "hbase.master.executor.closeregion.threads" thread (A), another unassign call of same region run in another thread(B).
> while thread B  run "if (!regions.containsKey(region))", this.regions have the region info, now  cpu switch to thread A.
> The thread A will remove the region from the sets of "this.regions" and "regionsInTransition", then switch to thread B. the thread B run continue, will throw an exception with the msg of "Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f", but without removing the new-adding RegionState from "regionsInTransition",and it can not be removed for ever.
>  public void unassign(HRegionInfo region, boolean force) {
>     LOG.debug("Starting unassignment of region " +
>       region.getRegionNameAsString() + " (offlining)");
>     synchronized (this.regions) {
>       // Check if this region is currently assigned
>       if (!regions.containsKey(region)) {
>         LOG.debug("Attempted to unassign region " +
>           region.getRegionNameAsString() + " but it is not " +
>           "currently assigned anywhere");
>         return;
>       }
>     }
>     String encodedName = region.getEncodedName();
>     // Grab the state of this region and synchronize on it
>     RegionState state;
>     long stamp = -1;
>     synchronized (regionsInTransition) {
>       state = regionsInTransition.get(encodedName);
>       if (state == null) {
>         state = new RegionState(region, RegionState.State.PENDING_CLOSE);
>         stamp =state.getStamp();               
>         regionsInTransition.put(encodedName, state);
>       } else if (force && state.isPendingClose()) {
>         LOG.debug("Attempting to unassign region " +
>             region.getRegionNameAsString() + " which is already pending close "
>             + "but forcing an additional close");
>         state.update(RegionState.State.PENDING_CLOSE);
>       } else {
>         LOG.debug("Attempting to unassign region " +
>           region.getRegionNameAsString() + " but it is " +
>           "already in transition (" + state.getState() + ")");
>         return;
>       }
>     }
>     
>     // Send CLOSE RPC
>     HServerInfo server = null;
>     synchronized (this.regions) {
>       server = regions.get(region);
>     }
> }
> 2011-06-27 10:20:59,583 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:20:59,585 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=158-1-91-101,20020,1308983865292, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/9a6e26d40293663a79523c58315b930f (region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f., server=158-1-91-101,20020,1308983865292, state=RS_ZK_REGION_CLOSED) 
> 2011-06-27 10:22:15,299 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=158-1-91-101,20020,1308983865292, region=9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=CLOSED, ts=1309141335247 
> 2011-06-27 10:25:22,636 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 9a6e26d40293663a79523c58315b930f has been CLOSED for too long, waiting on queued ClosedRegionHandler to run or server shutdown 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:25:55,022 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Deleting existing unassigned node for 9a6e26d40293663a79523c58315b930f that is in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,101 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x230ba8b85230036 Successfully deleted unassigned node for region 9a6e26d40293663a79523c58315b930f in expected state RS_ZK_REGION_CLOSED 
> 2011-06-27 10:25:55,301 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:25:55,302 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 9a6e26d40293663a79523c58315b930f 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. state=PENDING_CLOSE, ts=1309141555301 
> 2011-06-27 10:32:21,326 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. 
> 2011-06-27 10:32:21,438 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. (offlining) 
> 2011-06-27 10:32:21,441 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Attempted to unassign region test2,070712,1308971310309.9a6e26d40293663a79523c58315b930f. but it is not currently assigned anywhere 
> 3  The following scenario shows how the above problem 2 happened:
> (1)A table have a lot of regions, more than 70000 in my test. 
> (2)Disable the table, if 'BulkDisabler.waitUntilDone' timeout, 'DisableTableHandler.process' will create another BulkDisabler object and start its thread pool. The region which was still online will call AssignmentManager.unassign again. so the same region "AssignmentManager.unassign" could be called concurrentlly more than 1. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira