You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Ming Ma (JIRA)" <ji...@apache.org> on 2011/09/22 02:21:26 UTC

[jira] [Created] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
----------------------------------------------------------------------------------

                 Key: HBASE-4455
                 URL: https://issues.apache.org/jira/browse/HBASE-4455
             Project: HBase
          Issue Type: Bug
            Reporter: Ming Ma


Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.

1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.




T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
n-server and set watcher; sea-lab-3,60020,1316380037898

T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656


T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898

T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656


2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113135#comment-13113135 ] 

Ted Yu commented on HBASE-4455:
-------------------------------

All unit tests passed.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112264#comment-13112264 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/
-----------------------------------------------------------

Review request for hbase.


Summary
-------

1. Add more logging.
2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.


This addresses bug HBASE-4455.
    https://issues.apache.org/jira/browse/HBASE-4455


Diffs
-----

  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 

Diff: https://reviews.apache.org/r/2007/diff


Testing
-------

Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.


Thanks,

Ming



> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113887#comment-13113887 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------

bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > Great stuff!  I have some questions throughout but seems like this will make everything more resilient to root/meta servers failing.  Is the general approach to always verify / always check rather than relying on cached locations or values?
bq.  > 
bq.  > Have you thought about any ways that we could add some better unit tests around this stuff?  There's a TestRollingRestart that is obviously not good enough :)

The repro of such bug depends on timing of events. Initially I thought perhaps we can inject timeout into various places in the code. At this point, it is easier to just do the testing on a small cluster and eventually the bug will appear. Perhaps something we can work on later.

bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java, line 309
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45299#file45299line309>
bq.  >
bq.  >     why log the cached META server here?  didn't we just verify that it was not valid?

Fixed.

bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java, line 310
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45299#file45299line310>
bq.  >
bq.  >     why log the cached meta location here?  it might be confusing since it doesn't log that we just found this meta location was invalid

Fixed.

bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line 2532
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45300#file45300line2532>
bq.  >
bq.  >     add another * here, so: /**
bq.  >     
bq.  >     that ensure this gets picked up as javadoc

Fixed.

bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line 2558
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45300#file45300line2558>
bq.  >
bq.  >     this looks like a random debug statement, what does matchZK, sn: server mean?

Fixed.

bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java, lines 62-69
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45309#file45309line62>
bq.  >
bq.  >     why this change?  should this be rolled into the ZKNodeTracker rather than overriding the getData() behavior?

Fixed.

bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java, line 90
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45309#file45309line90>
bq.  >
bq.  >     it seems like you're covering up for bugs in the underlying ZKNodeTracker... can we fix that instead?  or if it's a matter of returning a cached value or not, can we just add a boolean flag for refresh/nocache?

Fixed.

bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java, line 294
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45299#file45299line294>
bq.  >
bq.  >     so we always verify the connection now?

Before the fix, all callers set it to "true". So there is no behavior change.

bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java, line 368
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45299#file45299line368>
bq.  >
bq.  >     why do we have two hard-coded timeouts in this area of code? :)
bq.  >     
bq.  >     this code seems to always sleep 500ms at a time unless you set timeout=0 and then it loops every 50ms?  that doesn't seem right... i could set timeout to 100ms and it would sleep 500ms.  sleeping 50ms every time would be better but there's probably a solution with less overhead (doing remote read queries every 50ms in a loop)
bq.  >     
bq.  >     could we just notifyAll() on metaAvailable whenever we relocate root?

Choose the min(500ms, timeout) at this point, given we might do more code cleanup around RootRegionTracker later. 

bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java, line 215
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45304#file45304line215>
bq.  >
bq.  >     i'm also a bit confused by this.  couldn't we just increase the thread pool size to 2? :)

Added more explanation in the ServerShutdownHandler.java about the scenario.

bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line 2533
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45300#file45300line2533>
bq.  >
bq.  >     what about this method is specific to the shutdown server?  this seems specific about regions in transition.  if we only use it in the context of servers being shut down then maybe name it accordingly?  it does seem like a generally useful method though and just related to ZK (could put it in a ZK util class?)

This method uses states in ZK and AssignmentManager. So it seems better to keep it in AssignmentManager. Keep the name given it might be useful outside shutdown scenario.

bq.  On 2011-09-23 08:17:29, Jonathan Gray wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java, line 144
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45306#file45306line144>
bq.  >
bq.  >     is this normal?  should it be a warn?  maybe a comment on why this would happen

This could happen when the RS shutdowns. When RS shutdowns, setClosedState will try to transition from CLOSING state to CLOSED. That will fail given the original state is OPENED instead of CLOSING.

Normally when AssignmentManager tries to close a region, it will first set the node to CLOSING before RPC call to RS. In that scenario, setClosedState will return successful.

- Ming

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/#review2037
-----------------------------------------------------------

On 2011-09-24 01:50:02, Ming Ma wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2007/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-24 01:50:02)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  1. Add more logging.
bq.  2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
bq.  4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
bq.  5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.
bq.  
bq.  
bq.  This addresses bug HBASE-4455.
bq.      https://issues.apache.org/jira/browse/HBASE-4455
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ClusterStatusTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 
bq.  
bq.  Diff: https://reviews.apache.org/r/2007/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ming
bq.  
bq.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113052#comment-13113052 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/
-----------------------------------------------------------

(Updated 2011-09-23 00:15:11.441661)


Review request for hbase.


Changes
-------

Thanks folks for the review. I have fixed all the issues raised. The fix for the test failure is to make sure the deadserver-in-progress count is correct by calling this.deadServers.add(serverName) before MetaServerShutdownHandler schedules ServerShutdownHandler.

 


Summary
-------

1. Add more logging.
2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.


This addresses bug HBASE-4455.
    https://issues.apache.org/jira/browse/HBASE-4455


Diffs (updated)
-----

  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 

Diff: https://reviews.apache.org/r/2007/diff


Testing
-------

Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.


Thanks,

Ming



> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "Ming Ma (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ming Ma resolved HBASE-4455.
----------------------------

    Resolution: Fixed
    
> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "Ming Ma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ming Ma reassigned HBASE-4455:
------------------------------

    Assignee: Ming Ma

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112345#comment-13112345 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/#review2018
-----------------------------------------------------------

Nice work.
Debug log should be reduced.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4548>

    Indentation.
    This should be LOG.debug()

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4549>

    A few spaces before timeout would be nice.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4550>

    Should be 'carrying .META.'

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/2007/#comment4551>

    There should be a verb in this sentence.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/2007/#comment4552>

    This line can be wrapped to the line above.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/2007/#comment4553>

    If variable is named matchZK, it would be symmetrical with matchAM.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/2007/#comment4554>

    addressFromAM should be used here.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
<https://reviews.apache.org/r/2007/#comment4555>

    Personally I prefer enum over boolean.
    enum is more readable and expandable.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
<https://reviews.apache.org/r/2007/#comment4556>

    Server is repeated. And a few places below.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
<https://reviews.apache.org/r/2007/#comment4557>

    the is repeated.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
<https://reviews.apache.org/r/2007/#comment4558>

    Good.

- Ted

On 2011-09-22 00:38:16, Ming Ma wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2007/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-22 00:38:16)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  1. Add more logging.
bq.  2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
bq.  4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
bq.  5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.
bq.  
bq.  
bq.  This addresses bug HBASE-4455.
bq.      https://issues.apache.org/jira/browse/HBASE-4455
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 
bq.  
bq.  Diff: https://reviews.apache.org/r/2007/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ming
bq.  
bq.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113869#comment-13113869 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/
-----------------------------------------------------------

(Updated 2011-09-24 01:50:02.986731)


Review request for hbase.


Changes
-------

Thanks folks for the review. I have fixes most of them. The below comments explain why the rests aren't fixed and some general questions asked.

1. I wondered if RootRegionTracker and MetaNodeTracker are really needed. Instead of waiting for ZK notification, checking with ZK directly should be ok. This won't have much impact on performance given most of the time where there isn't much regions movement. For now, we can keep RootRegionTracker and MetaNodeTracker. We can open a separate jira if that is needed.
2. Took out "refresh" parameter in CatalogTracker.getMetaServerConnection. All the callers of this function call the function will "true". So at this point, I just took it out.
3. Per Jonathan suggestions, modified ZookeeperNodeTracker to add refresh flag to getData, blockUntilAvailable.
4. About the question from Stack, yes, it looks like the same as 3809. it could be the same as 4245.
5. I put a more detailed description in ServerShutdownHandler.java about why we need to resubmit another ServerShutdownHandler request back to thread pool if the server carries -ROOT- or .META.
6. Regarding Jonathan's suggestion about relying on notifyAll() from -ROOT- inside waitForMeta, I just fixed the timeout value issue instead, in case later we decide RootRegionTracker isn't that useful.
7. Regarding Stack's HLogSplitting question, if the shutdown server carries -ROOT- or .META., it will first do HLogSplitting, and then resubmit another ServerShutdownHandler request for the same server which doesn't do HLogSplitting.


Summary
-------

1. Add more logging.
2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.


This addresses bug HBASE-4455.
    https://issues.apache.org/jira/browse/HBASE-4455


Diffs (updated)
-----

  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperNodeTracker.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ClusterStatusTracker.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java 1172205 
  http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 

Diff: https://reviews.apache.org/r/2007/diff


Testing
-------

Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.


Thanks,

Ming



> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114081#comment-13114081 ] 

Ted Yu commented on HBASE-4455:
-------------------------------

I am running whole test suite based on patch v3 and TRUNK. Will update with test results.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113886#comment-13113886 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------

bq.  On 2011-09-23 05:44:11, Michael Stack wrote:
bq.  > Really good stuff.  I just need a bit of it explained to me below because I'm being a little slow.  I think you've also nailed HBASE-3809 with this patch Ming (What you think?)  Good stuff.

That sounds right. HBASE-4245 could be the same as well.

bq.  On 2011-09-23 05:44:11, Michael Stack wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java, line 62
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45308#file45308line62>
bq.  >
bq.  >     Whats this?

Fixed.

bq.  On 2011-09-23 05:44:11, Michael Stack wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java, line 299
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45299#file45299line299>
bq.  >
bq.  >     Why this change? If we did not ask to refresh, why not return what we found?

No callers call the function with refresh==false, took it out for now.

bq.  On 2011-09-23 05:44:11, Michael Stack wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java, line 222
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45299#file45299line222>
bq.  >
bq.  >     Why make this public?

Fixed.

bq.  On 2011-09-23 05:44:11, Michael Stack wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java, line 535
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45299#file45299line535>
bq.  >
bq.  >     nit: No need of the parens

Fixed.

bq.  On 2011-09-23 05:44:11, Michael Stack wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java, line 215
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45304#file45304line215>
bq.  >
bq.  >     Oh, here I see the don't split logs flag.
bq.  >     
bq.  >     Funny.
bq.  >     
bq.  >     We used to do something like this in the old days.
bq.  >     
bq.  >     So, we come back in again, we don't split logs, but we are still a server that was carrying root or meta -- we reassign again?  I don't get how it works here.

If the server carries -ROOT- or .META, it will first do the log splitting and then submit another ServerShutDownHandler request without shouldSplitHLog==false.

- Ming

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/#review2035
-----------------------------------------------------------

On 2011-09-24 01:50:02, Ming Ma wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2007/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-24 01:50:02)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  1. Add more logging.
bq.  2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
bq.  4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
bq.  5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.
bq.  
bq.  
bq.  This addresses bug HBASE-4455.
bq.      https://issues.apache.org/jira/browse/HBASE-4455
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ClusterStatusTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 
bq.  
bq.  Diff: https://reviews.apache.org/r/2007/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ming
bq.  
bq.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4455:
-------------------------

    Fix Version/s: 0.92.0

Marking for 0.92.0.  This patch is almost done and includes some dirty fixing.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116059#comment-13116059 ] 

Hudson commented on HBASE-4455:
-------------------------------

Integrated in HBase-0.92 #24 (See [https://builds.apache.org/job/HBase-0.92/24/])
    HBASE-4455 Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager (Ming Ma)

tedyu : 
Files : 
* /hbase/branches/0.92/CHANGES.txt
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ClusterStatusTracker.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperNodeTracker.java

                
> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114101#comment-13114101 ] 

Ted Yu commented on HBASE-4455:
-------------------------------

Only got one test failure: TestLogRolling.testLogRollOnPipelineRestart which was due to lack of the latest addendum from Dhruba.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112357#comment-13112357 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/#review2019
-----------------------------------------------------------

looks overall pretty good - been meaning to do a similar change to split up shutdown processing but didn't find the time. A few nits below.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4559>

    typo: unavailable

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4560>

    debug

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4561>

    why catch this if you just rethrow it?

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
<https://reviews.apache.org/r/2007/#comment4562>

    should rename "splitHLog" to "shouldSplitHLog" -- since "splitHLog" could be read as past tense "alreadySplitHLog"

- Todd

On 2011-09-22 00:38:16, Ming Ma wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2007/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-22 00:38:16)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  1. Add more logging.
bq.  2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
bq.  4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
bq.  5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.
bq.  
bq.  
bq.  This addresses bug HBASE-4455.
bq.      https://issues.apache.org/jira/browse/HBASE-4455
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 
bq.  
bq.  Diff: https://reviews.apache.org/r/2007/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ming
bq.  
bq.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114439#comment-13114439 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/#review2066
-----------------------------------------------------------

Ship it!

Great work.

I really really wish there were some unit tests added here.  We should be able to recreate the scenarios you describe in the comments in TestRollingRestart and prove it was broken and is now fixed.  Have you tried?

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
<https://reviews.apache.org/r/2007/#comment4662>

    Might want to add a reference to the jira here... ie. See HBASE-4455 for more info

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
<https://reviews.apache.org/r/2007/#comment4663>

    nice!

- Jonathan

On 2011-09-24 01:50:02, Ming Ma wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2007/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-24 01:50:02)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  1. Add more logging.
bq.  2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
bq.  4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
bq.  5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.
bq.  
bq.  
bq.  This addresses bug HBASE-4455.
bq.      https://issues.apache.org/jira/browse/HBASE-4455
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ClusterStatusTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 
bq.  
bq.  Diff: https://reviews.apache.org/r/2007/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ming
bq.  
bq.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13115917#comment-13115917 ] 

Ted Yu commented on HBASE-4455:
-------------------------------

Integrated to 0.92 and TRUNK.

Thanks for the nice work, Ming.

Thanks for the review Todd, Jonathan, Ramkrishna and Stack.
                
> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113168#comment-13113168 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/#review2035
-----------------------------------------------------------

Ship it!

Really good stuff.  I just need a bit of it explained to me below because I'm being a little slow.  I think you've also nailed HBASE-3809 with this patch Ming (What you think?)  Good stuff.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4585>

    Why make this public?

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4586>

    Why this change? If we did not ask to refresh, why not return what we found?

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4587>

    oh.  this is a dumb bug!  Good one Ming.  Ugh this is bad.  I'm in here because of hbase-3446 and I did not see this.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4588>

    Good

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4590>

    Good

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4589>

    nit: No need of the parens

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
<https://reviews.apache.org/r/2007/#comment4591>

    When would this be false?  I don't see it in your patch.  (NVM -- I see it later)

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
<https://reviews.apache.org/r/2007/#comment4592>

    We will always have split logs before we got here? (NVM I see point below)

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
<https://reviews.apache.org/r/2007/#comment4593>

    This looks like a good change.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
<https://reviews.apache.org/r/2007/#comment4594>

    Oh, here I see the don't split logs flag.

    Funny.

    We used to do something like this in the old days.

    So, we come back in again, we don't split logs, but we are still a server that was carrying root or meta -- we reassign again?  I don't get how it works here.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java
<https://reviews.apache.org/r/2007/#comment4595>

    Whats this?

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java
<https://reviews.apache.org/r/2007/#comment4596>

    Good.

- Michael

On 2011-09-23 00:15:11, Ming Ma wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2007/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-23 00:15:11)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  1. Add more logging.
bq.  2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
bq.  4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
bq.  5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.
bq.  
bq.  
bq.  This addresses bug HBASE-4455.
bq.      https://issues.apache.org/jira/browse/HBASE-4455
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 
bq.  
bq.  Diff: https://reviews.apache.org/r/2007/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ming
bq.  
bq.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13114336#comment-13114336 ] 

Ted Yu commented on HBASE-4455:
-------------------------------

If there is no major revision based on version 3, allow me to integrate Tuesday.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "Ted Yu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112359#comment-13112359 ] 

Ted Yu commented on HBASE-4455:
-------------------------------

I got a failed test:
{code}
testBasicRollingRestart(org.apache.hadoop.hbase.master.TestRollingRestart)  Time elapsed: 26.875 sec  <<< FAILURE!
java.lang.AssertionError: expected:<22> but was:<14>
        at org.junit.Assert.fail(Assert.java:91)
        at org.junit.Assert.failNotEquals(Assert.java:645)
        at org.junit.Assert.assertEquals(Assert.java:126)
        at org.junit.Assert.assertEquals(Assert.java:470)
        at org.junit.Assert.assertEquals(Assert.java:454)
        at org.apache.hadoop.hbase.master.TestRollingRestart.assertRegionsAssigned(TestRollingRestart.java:363)
        at org.apache.hadoop.hbase.master.TestRollingRestart.testBasicRollingRestart(TestRollingRestart.java:171)
{code}
Please check.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116001#comment-13116001 ] 

Hudson commented on HBASE-4455:
-------------------------------

Integrated in HBase-TRUNK #2262 (See [https://builds.apache.org/job/HBase-TRUNK/2262/])
    HBASE-4455  Rolling restart RSs scenario, -ROOT-, .META. regions are lost in
               AssignmentManager (Ming Ma)

tedyu : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ClusterStatusTracker.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperNodeTracker.java

                
> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113884#comment-13113884 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------

bq.  On 2011-09-23 03:02:39, Ted Yu wrote:
bq.  > http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line 2555
bq.  > <https://reviews.apache.org/r/2007/diff/2/?file=45300#file45300line2555>
bq.  >
bq.  >     addressFromZK != null can be omitted - it is the condition of if block.

Fixed.

- Ming

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/#review2033
-----------------------------------------------------------

On 2011-09-24 01:50:02, Ming Ma wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2007/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-24 01:50:02)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  1. Add more logging.
bq.  2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
bq.  4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
bq.  5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.
bq.  
bq.  
bq.  This addresses bug HBASE-4455.
bq.      https://issues.apache.org/jira/browse/HBASE-4455
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ClusterStatusTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/MasterAddressTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 
bq.  
bq.  Diff: https://reviews.apache.org/r/2007/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ming
bq.  
bq.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113123#comment-13113123 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/#review2033
-----------------------------------------------------------

Ship it!

One minor comment. Can be fixed at time of commit.

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/2007/#comment4581>

    addressFromZK != null can be omitted - it is the condition of if block.

- Ted

On 2011-09-23 00:15:11, Ming Ma wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2007/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-23 00:15:11)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  1. Add more logging.
bq.  2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
bq.  4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
bq.  5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.
bq.  
bq.  
bq.  This addresses bug HBASE-4455.
bq.      https://issues.apache.org/jira/browse/HBASE-4455
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 
bq.  
bq.  Diff: https://reviews.apache.org/r/2007/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ming
bq.  
bq.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112374#comment-13112374 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/#review2020
-----------------------------------------------------------

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/2007/#comment4563>

    If the location in zk is the recent data do we still need to check with the AM address? if this zk data is null then we can check for AM address? Just a thought .

- ramkrishna

On 2011-09-22 00:38:16, Ming Ma wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2007/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-22 00:38:16)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  1. Add more logging.
bq.  2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
bq.  4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
bq.  5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.
bq.  
bq.  
bq.  This addresses bug HBASE-4455.
bq.      https://issues.apache.org/jira/browse/HBASE-4455
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 
bq.  
bq.  Diff: https://reviews.apache.org/r/2007/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ming
bq.  
bq.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4455) Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13113248#comment-13113248 ] 

jiraposter@reviews.apache.org commented on HBASE-4455:
------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2007/#review2037
-----------------------------------------------------------

Great stuff!  I have some questions throughout but seems like this will make everything more resilient to root/meta servers failing.  Is the general approach to always verify / always check rather than relying on cached locations or values?

Have you thought about any ways that we could add some better unit tests around this stuff?  There's a TestRollingRestart that is obviously not good enough :)

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4598>

    so we always verify the connection now?

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4597>

    why log the cached META server here?  didn't we just verify that it was not valid?

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4599>

    why log the cached meta location here?  it might be confusing since it doesn't log that we just found this meta location was invalid

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4601>

    nice catch

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4602>

    why do we have two hard-coded timeouts in this area of code? :)

    this code seems to always sleep 500ms at a time unless you set timeout=0 and then it loops every 50ms?  that doesn't seem right... i could set timeout to 100ms and it would sleep 500ms.  sleeping 50ms every time would be better but there's probably a solution with less overhead (doing remote read queries every 50ms in a loop)

    could we just notifyAll() on metaAvailable whenever we relocate root?

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java
<https://reviews.apache.org/r/2007/#comment4603>

    good

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/2007/#comment4604>

    add another * here, so: /**

    that ensure this gets picked up as javadoc

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/2007/#comment4605>

    what about this method is specific to the shutdown server?  this seems specific about regions in transition.  if we only use it in the context of servers being shut down then maybe name it accordingly?  it does seem like a generally useful method though and just related to ZK (could put it in a ZK util class?)

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<https://reviews.apache.org/r/2007/#comment4606>

    this looks like a random debug statement, what does matchZK, sn: server mean?

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
<https://reviews.apache.org/r/2007/#comment4607>

    i'm also a bit confused by this.  couldn't we just increase the thread pool size to 2? :)

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
<https://reviews.apache.org/r/2007/#comment4608>

    is this normal?  should it be a warn?  maybe a comment on why this would happen

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java
<https://reviews.apache.org/r/2007/#comment4609>

    why this change?  should this be rolled into the ZKNodeTracker rather than overriding the getData() behavior?

http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java
<https://reviews.apache.org/r/2007/#comment4610>

    it seems like you're covering up for bugs in the underlying ZKNodeTracker... can we fix that instead?  or if it's a matter of returning a cached value or not, can we just add a boolean flag for refresh/nocache?

- Jonathan

On 2011-09-23 00:15:11, Ming Ma wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/2007/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-23 00:15:11)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  1. Add more logging.
bq.  2. Clean up CatalogTracker. waitForMeta waits for "timeout" value. When waitForMetaServerConnectionDefault is called by MetaNodeTracker, the timeout value is large. So it doesn't retry in case .ROOT. is updated; add the proper implementation for CatalogTracker.verifyMetaRegionLocation
bq.  4. Check for the latest -ROOT- and .META. region location during the handling of server shutdown.
bq.  5. Right after assigning the -ROOT- or .META. in ServerShutdownHandler, don't block and wait for .META. availability. Resubmit another ServerShutdownHandler for regular regions.
bq.  
bq.  
bq.  This addresses bug HBASE-4455.
bq.      https://issues.apache.org/jira/browse/HBASE-4455
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RootRegionTracker.java 1172205 
bq.    http://svn.apache.org/repos/asf/hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperNodeTracker.java 1172205 
bq.  
bq.  Diff: https://reviews.apache.org/r/2007/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. The program can run for couple hours until it stops. -ROOT- and .META. are available during that time.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Ming
bq.  
bq.

> Rolling restart RSs scenario, -ROOT-, .META. regions are lost in AssignmentManager
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-4455
>                 URL: https://issues.apache.org/jira/browse/HBASE-4455
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>             Fix For: 0.92.0
>
>
> Keep Master up all the time, do rolling restart of RSs like this - stop RS1, wait for 2 seconds, stop RS2, start RS1, wait for 2 seconds, stop RS3, start RS2, wait for 2 seconds, etc. After a while, you will find the -ROOT-, .META. regions aren't in "regions in transtion" from AssignmentManager point of view, but they aren't assigned to any regions. Here are the issues.
> 1. .-ROOT- or .META. location is stale when MetaServerShutdownHandler is invoked to check if it contains -ROOT- region. That is due to long delay from ZK notification and async nature of the system. Here is an example, even though new root region server sea-lab-1,60020,1316380133656 is set at T2, at T3 the shutdown process for sea-lab-1,60020,1316380133656, the root location still points to old server sea-lab-3,60020,1316380037898.
> T1: 2011-09-18 14:08:52,470 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-regio
> n-server and set watcher; sea-lab-3,60020,1316380037898
> T2: 2011-09-18 14:08:57,173 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Setting ROOT region location in ZooKeeper as sea-lab-1,60020,1316380133656
> T3: 2011-09-18 14:10:26,393 DEBUG org.apache.hadoop.hbase.master.ServerManager: Adde
> d=sea-lab-1,60020,1316380133656 to dead servers, submitted shutdown handler to be executed, root=false, meta=true, current Root Location: sea-lab-3,60020,1316380037898
> T4: 2011-09-18 14:12:37,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6
> 0000-0x1327e43175e0000 Retrieved 29 byte(s) of data from znode /hbase/root-region-server and set watcher; sea-lab-1,60020,1316380133656
> 2. The MetaServerShutdownHandler worker thread that waits for -ROOT- or .META. availability could be blocked. If meanwhile, the new server that -ROOT- or .META. is being assigned restarted, another instance of MetaServerShutdownHandler is queued. Eventually, all MetaServerShutdownHandler worker threads are filled up. It looks like HBASE-4245.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira