You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Jonathan Gray <jg...@apache.org> on 2010/10/08 01:34:04 UTC

Review Request: HBASE-2700 Unit test of master failover while regions in transition

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/995/
-----------------------------------------------------------

Review request for hbase and stack.


Summary
-------

First go at a unit test of master failover with regions in transition.

Comment from the test method:

  /**
   * Complex test of master failover that tests as many permutations of the
   * different possible states that regions in transition could be in within ZK.
   * <p>
   * This tests the proper handling of these states by the failed-over master
   * and includes a thorough testing of the timeout code as well.
   * <p>
   * Starts with a single master and three regionservers.
   * <p>
   * Creates two tables, enabledTable and disabledTable, each containing 5
   * regions.  The disabledTable is then disabled.
   * <p>
   * After reaching steady-state, the master is killed.  We then mock several
   * states in ZK.
   * <p>
   * After mocking them, we will startup a new master which should become the
   * active master and also detect that it is a failover.  The primary test
   * passing condition will be that all regions of the enabled table are
   * assigned and all the regions of the disabled table are not assigned.
   * <p>
   * The different scenarios to be tested are below:
   * <p>
   * <b>ZK State:  OFFLINE</b>
   * <p>A node can get into OFFLINE state if</p>
   * <ul>
   * <li>An RS fails to open a region, so it reverts the state back to OFFLINE
   * <li>The Master is assigning the region to a RS before it sends RPC
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Master has assigned an enabled region but RS failed so a region is
   *     not assigned anywhere and is sitting in ZK as OFFLINE</li>
   * <li>This seems to cover both cases?</li>
   * </ul>
   * <p>
   * <b>ZK State:  CLOSING</b>
   * <p>A node can get into CLOSING state if</p>
   * <ul>
   * <li>An RS has begun to close a region
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region was being closed but the RS died before finishing the close
   * <li>Region of enabled table was being closed but did not complete
   * <li>Region of disabled table was being closed but did not complete
   * </ul>
   * <p>
   * <b>ZK State:  CLOSED</b>
   * <p>A node can get into CLOSED state if</p>
   * <ul>
   * <li>An RS has completed closing a region but not acknowledged by master yet
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of a table that should be enabled was closed on an RS
   * <li>Region of a table that should be disabled was closed on an RS
   * </ul>
   * <p>
   * <b>ZK State:  OPENING</b>
   * <p>A node can get into OPENING state if</p>
   * <ul>
   * <li>An RS has begun to open a region
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>RS was opening a region of enabled table but never finishes
   * </ul>
   * <p>
   * <b>ZK State:  OPENED</b>
   * <p>A node can get into OPENED state if</p>
   * <ul>
   * <li>An RS has finished opening a region but not acknowledged by master yet
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of a table that should be enabled was opened on an RS
   * <li>Region of a table that should be disabled was opened on an RS
   * <li>Region of a table that should be enabled was opened by a now-dead RS
   * <li>Region of a table that should be disabled was opened by a now-dead RS
   * </ul>
   * <p>
   * <b>ZK State:  NONE</b>
   * <p>A region could not have a transition node if</p>
   * <ul>
   * <li>The server hosting the region died and no master processed it
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of enabled table was on a dead RS that was not yet processed
   * <li>Region of disabled table was on a dead RS that was not yet processed
   * </ul>
   * @throws Exception
   */


This addresses bug HBASE-2700.
    http://issues.apache.org/jira/browse/HBASE-2700


Diffs
-----

  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1005264 
  trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1005264 
  trunk/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java 1005264 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1005264 
  trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1005264 
  trunk/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java 1005264 
  trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1005264 

Diff: http://review.cloudera.org/r/995/diff


Testing
-------

running the unit test!


Thanks,

Jonathan


Re: Review Request: HBASE-2700 Unit test of master failover while regions in transition

Posted by Jonathan Gray <jg...@apache.org>.

> On 2010-10-18 12:00:11, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line 192
> > <http://review.cloudera.org/r/995/diff/5/?file=14788#file14788line192>
> >
> >     This looks like imprtant change.

Yes, it is.  All of these methods have some more javadoc about what's going on.


> On 2010-10-18 12:00:11, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line 104
> > <http://review.cloudera.org/r/995/diff/5/?file=14788#file14788line104>
> >
> >     Why change access?

I'm accessing it in the unit tests (this is how I can ensure which RS a region goes to)


> On 2010-10-18 12:00:11, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line 862
> > <http://review.cloudera.org/r/995/diff/5/?file=14788#file14788line862>
> >
> >     What you doing here?  Just noting that are state transitions are a little confused at this point?  Maybe add a 'FIX' to end of the log message to be clear its a transition that is unexpected

This is not an unexpected transition.  It's just an interesting case, worthy of logging at DEBUG :)  This is for a CLOSING times out and you are forcing the send of an addition close rpc.


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/995/#review1554
-----------------------------------------------------------


On 2010-10-18 11:05:52, Jonathan Gray wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/995/
> -----------------------------------------------------------
> 
> (Updated 2010-10-18 11:05:52)
> 
> 
> Review request for hbase and stack.
> 
> 
> Summary
> -------
> 
> First go at a unit test of master failover with regions in transition.
> 
> Comment from the test method:
> 
>   /**
>    * Complex test of master failover that tests as many permutations of the
>    * different possible states that regions in transition could be in within ZK.
>    * <p>
>    * This tests the proper handling of these states by the failed-over master
>    * and includes a thorough testing of the timeout code as well.
>    * <p>
>    * Starts with a single master and three regionservers.
>    * <p>
>    * Creates two tables, enabledTable and disabledTable, each containing 5
>    * regions.  The disabledTable is then disabled.
>    * <p>
>    * After reaching steady-state, the master is killed.  We then mock several
>    * states in ZK.
>    * <p>
>    * After mocking them, we will startup a new master which should become the
>    * active master and also detect that it is a failover.  The primary test
>    * passing condition will be that all regions of the enabled table are
>    * assigned and all the regions of the disabled table are not assigned.
>    * <p>
>    * The different scenarios to be tested are below:
>    * <p>
>    * <b>ZK State:  OFFLINE</b>
>    * <p>A node can get into OFFLINE state if</p>
>    * <ul>
>    * <li>An RS fails to open a region, so it reverts the state back to OFFLINE
>    * <li>The Master is assigning the region to a RS before it sends RPC
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Master has assigned an enabled region but RS failed so a region is
>    *     not assigned anywhere and is sitting in ZK as OFFLINE</li>
>    * <li>This seems to cover both cases?</li>
>    * </ul>
>    * <p>
>    * <b>ZK State:  CLOSING</b>
>    * <p>A node can get into CLOSING state if</p>
>    * <ul>
>    * <li>An RS has begun to close a region
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region was being closed but the RS died before finishing the close
>    * <li>Region of enabled table was being closed but did not complete
>    * <li>Region of disabled table was being closed but did not complete
>    * </ul>
>    * <p>
>    * <b>ZK State:  CLOSED</b>
>    * <p>A node can get into CLOSED state if</p>
>    * <ul>
>    * <li>An RS has completed closing a region but not acknowledged by master yet
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of a table that should be enabled was closed on an RS
>    * <li>Region of a table that should be disabled was closed on an RS
>    * </ul>
>    * <p>
>    * <b>ZK State:  OPENING</b>
>    * <p>A node can get into OPENING state if</p>
>    * <ul>
>    * <li>An RS has begun to open a region
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>RS was opening a region of enabled table but never finishes
>    * </ul>
>    * <p>
>    * <b>ZK State:  OPENED</b>
>    * <p>A node can get into OPENED state if</p>
>    * <ul>
>    * <li>An RS has finished opening a region but not acknowledged by master yet
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of a table that should be enabled was opened on an RS
>    * <li>Region of a table that should be disabled was opened on an RS
>    * <li>Region of a table that should be enabled was opened by a now-dead RS
>    * <li>Region of a table that should be disabled was opened by a now-dead RS
>    * </ul>
>    * <p>
>    * <b>ZK State:  NONE</b>
>    * <p>A region could not have a transition node if</p>
>    * <ul>
>    * <li>The server hosting the region died and no master processed it
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of enabled table was on a dead RS that was not yet processed
>    * <li>Region of disabled table was on a dead RS that was not yet processed
>    * </ul>
>    * @throws Exception
>    */
> 
> 
> This addresses bug HBASE-2700.
>     http://issues.apache.org/jira/browse/HBASE-2700
> 
> 
> Diffs
> -----
> 
>   trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1023927 
>   trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1023927 
>   trunk/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java 1023927 
>   trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1023927 
> 
> Diff: http://review.cloudera.org/r/995/diff
> 
> 
> Testing
> -------
> 
> running the unit test!
> 
> 
> Thanks,
> 
> Jonathan
> 
>


Re: Review Request: HBASE-2700 Unit test of master failover while regions in transition

Posted by st...@duboce.net.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/995/#review1554
-----------------------------------------------------------

Ship it!


+1 on commit.  Few things to address on commit below.


trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<http://review.cloudera.org/r/995/#comment5250>

    Why change access?



trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<http://review.cloudera.org/r/995/#comment5251>

    This looks like imprtant change.



trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<http://review.cloudera.org/r/995/#comment5252>

    What you doing here?  Just noting that are state transitions are a little confused at this point?  Maybe add a 'FIX' to end of the log message to be clear its a transition that is unexpected


- stack


On 2010-10-18 11:05:52, Jonathan Gray wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/995/
> -----------------------------------------------------------
> 
> (Updated 2010-10-18 11:05:52)
> 
> 
> Review request for hbase and stack.
> 
> 
> Summary
> -------
> 
> First go at a unit test of master failover with regions in transition.
> 
> Comment from the test method:
> 
>   /**
>    * Complex test of master failover that tests as many permutations of the
>    * different possible states that regions in transition could be in within ZK.
>    * <p>
>    * This tests the proper handling of these states by the failed-over master
>    * and includes a thorough testing of the timeout code as well.
>    * <p>
>    * Starts with a single master and three regionservers.
>    * <p>
>    * Creates two tables, enabledTable and disabledTable, each containing 5
>    * regions.  The disabledTable is then disabled.
>    * <p>
>    * After reaching steady-state, the master is killed.  We then mock several
>    * states in ZK.
>    * <p>
>    * After mocking them, we will startup a new master which should become the
>    * active master and also detect that it is a failover.  The primary test
>    * passing condition will be that all regions of the enabled table are
>    * assigned and all the regions of the disabled table are not assigned.
>    * <p>
>    * The different scenarios to be tested are below:
>    * <p>
>    * <b>ZK State:  OFFLINE</b>
>    * <p>A node can get into OFFLINE state if</p>
>    * <ul>
>    * <li>An RS fails to open a region, so it reverts the state back to OFFLINE
>    * <li>The Master is assigning the region to a RS before it sends RPC
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Master has assigned an enabled region but RS failed so a region is
>    *     not assigned anywhere and is sitting in ZK as OFFLINE</li>
>    * <li>This seems to cover both cases?</li>
>    * </ul>
>    * <p>
>    * <b>ZK State:  CLOSING</b>
>    * <p>A node can get into CLOSING state if</p>
>    * <ul>
>    * <li>An RS has begun to close a region
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region was being closed but the RS died before finishing the close
>    * <li>Region of enabled table was being closed but did not complete
>    * <li>Region of disabled table was being closed but did not complete
>    * </ul>
>    * <p>
>    * <b>ZK State:  CLOSED</b>
>    * <p>A node can get into CLOSED state if</p>
>    * <ul>
>    * <li>An RS has completed closing a region but not acknowledged by master yet
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of a table that should be enabled was closed on an RS
>    * <li>Region of a table that should be disabled was closed on an RS
>    * </ul>
>    * <p>
>    * <b>ZK State:  OPENING</b>
>    * <p>A node can get into OPENING state if</p>
>    * <ul>
>    * <li>An RS has begun to open a region
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>RS was opening a region of enabled table but never finishes
>    * </ul>
>    * <p>
>    * <b>ZK State:  OPENED</b>
>    * <p>A node can get into OPENED state if</p>
>    * <ul>
>    * <li>An RS has finished opening a region but not acknowledged by master yet
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of a table that should be enabled was opened on an RS
>    * <li>Region of a table that should be disabled was opened on an RS
>    * <li>Region of a table that should be enabled was opened by a now-dead RS
>    * <li>Region of a table that should be disabled was opened by a now-dead RS
>    * </ul>
>    * <p>
>    * <b>ZK State:  NONE</b>
>    * <p>A region could not have a transition node if</p>
>    * <ul>
>    * <li>The server hosting the region died and no master processed it
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of enabled table was on a dead RS that was not yet processed
>    * <li>Region of disabled table was on a dead RS that was not yet processed
>    * </ul>
>    * @throws Exception
>    */
> 
> 
> This addresses bug HBASE-2700.
>     http://issues.apache.org/jira/browse/HBASE-2700
> 
> 
> Diffs
> -----
> 
>   trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java 1023927 
>   trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1023927 
>   trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1023927 
>   trunk/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java 1023927 
>   trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1023927 
> 
> Diff: http://review.cloudera.org/r/995/diff
> 
> 
> Testing
> -------
> 
> running the unit test!
> 
> 
> Thanks,
> 
> Jonathan
> 
>


Re: Review Request: HBASE-2700 Unit test of master failover while regions in transition

Posted by Jonathan Gray <jg...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/995/
-----------------------------------------------------------

(Updated 2010-10-18 11:05:52.404802)


Review request for hbase and stack.


Changes
-------

Small cleanup of comments and whitespace.


Summary
-------

First go at a unit test of master failover with regions in transition.

Comment from the test method:

  /**
   * Complex test of master failover that tests as many permutations of the
   * different possible states that regions in transition could be in within ZK.
   * <p>
   * This tests the proper handling of these states by the failed-over master
   * and includes a thorough testing of the timeout code as well.
   * <p>
   * Starts with a single master and three regionservers.
   * <p>
   * Creates two tables, enabledTable and disabledTable, each containing 5
   * regions.  The disabledTable is then disabled.
   * <p>
   * After reaching steady-state, the master is killed.  We then mock several
   * states in ZK.
   * <p>
   * After mocking them, we will startup a new master which should become the
   * active master and also detect that it is a failover.  The primary test
   * passing condition will be that all regions of the enabled table are
   * assigned and all the regions of the disabled table are not assigned.
   * <p>
   * The different scenarios to be tested are below:
   * <p>
   * <b>ZK State:  OFFLINE</b>
   * <p>A node can get into OFFLINE state if</p>
   * <ul>
   * <li>An RS fails to open a region, so it reverts the state back to OFFLINE
   * <li>The Master is assigning the region to a RS before it sends RPC
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Master has assigned an enabled region but RS failed so a region is
   *     not assigned anywhere and is sitting in ZK as OFFLINE</li>
   * <li>This seems to cover both cases?</li>
   * </ul>
   * <p>
   * <b>ZK State:  CLOSING</b>
   * <p>A node can get into CLOSING state if</p>
   * <ul>
   * <li>An RS has begun to close a region
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region was being closed but the RS died before finishing the close
   * <li>Region of enabled table was being closed but did not complete
   * <li>Region of disabled table was being closed but did not complete
   * </ul>
   * <p>
   * <b>ZK State:  CLOSED</b>
   * <p>A node can get into CLOSED state if</p>
   * <ul>
   * <li>An RS has completed closing a region but not acknowledged by master yet
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of a table that should be enabled was closed on an RS
   * <li>Region of a table that should be disabled was closed on an RS
   * </ul>
   * <p>
   * <b>ZK State:  OPENING</b>
   * <p>A node can get into OPENING state if</p>
   * <ul>
   * <li>An RS has begun to open a region
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>RS was opening a region of enabled table but never finishes
   * </ul>
   * <p>
   * <b>ZK State:  OPENED</b>
   * <p>A node can get into OPENED state if</p>
   * <ul>
   * <li>An RS has finished opening a region but not acknowledged by master yet
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of a table that should be enabled was opened on an RS
   * <li>Region of a table that should be disabled was opened on an RS
   * <li>Region of a table that should be enabled was opened by a now-dead RS
   * <li>Region of a table that should be disabled was opened by a now-dead RS
   * </ul>
   * <p>
   * <b>ZK State:  NONE</b>
   * <p>A region could not have a transition node if</p>
   * <ul>
   * <li>The server hosting the region died and no master processed it
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of enabled table was on a dead RS that was not yet processed
   * <li>Region of disabled table was on a dead RS that was not yet processed
   * </ul>
   * @throws Exception
   */


This addresses bug HBASE-2700.
    http://issues.apache.org/jira/browse/HBASE-2700


Diffs (updated)
-----

  trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1023927 
  trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1023927 
  trunk/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java 1023927 
  trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1023927 

Diff: http://review.cloudera.org/r/995/diff


Testing
-------

running the unit test!


Thanks,

Jonathan


Re: Review Request: HBASE-2700 Unit test of master failover while regions in transition

Posted by Jonathan Gray <jg...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/995/
-----------------------------------------------------------

(Updated 2010-10-18 11:01:23.149541)


Review request for hbase and stack.


Changes
-------

Finishes the third unit test which has RIT in addition to a failure of an RS that happens concurrently with no master being around.

All three unit tests are passing for me in eclipse and on cmd line (simple failover, RIT failover, RIT+RS failover)

I think this should be enough to resolve 2700 and maybe some others.


Summary
-------

First go at a unit test of master failover with regions in transition.

Comment from the test method:

  /**
   * Complex test of master failover that tests as many permutations of the
   * different possible states that regions in transition could be in within ZK.
   * <p>
   * This tests the proper handling of these states by the failed-over master
   * and includes a thorough testing of the timeout code as well.
   * <p>
   * Starts with a single master and three regionservers.
   * <p>
   * Creates two tables, enabledTable and disabledTable, each containing 5
   * regions.  The disabledTable is then disabled.
   * <p>
   * After reaching steady-state, the master is killed.  We then mock several
   * states in ZK.
   * <p>
   * After mocking them, we will startup a new master which should become the
   * active master and also detect that it is a failover.  The primary test
   * passing condition will be that all regions of the enabled table are
   * assigned and all the regions of the disabled table are not assigned.
   * <p>
   * The different scenarios to be tested are below:
   * <p>
   * <b>ZK State:  OFFLINE</b>
   * <p>A node can get into OFFLINE state if</p>
   * <ul>
   * <li>An RS fails to open a region, so it reverts the state back to OFFLINE
   * <li>The Master is assigning the region to a RS before it sends RPC
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Master has assigned an enabled region but RS failed so a region is
   *     not assigned anywhere and is sitting in ZK as OFFLINE</li>
   * <li>This seems to cover both cases?</li>
   * </ul>
   * <p>
   * <b>ZK State:  CLOSING</b>
   * <p>A node can get into CLOSING state if</p>
   * <ul>
   * <li>An RS has begun to close a region
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region was being closed but the RS died before finishing the close
   * <li>Region of enabled table was being closed but did not complete
   * <li>Region of disabled table was being closed but did not complete
   * </ul>
   * <p>
   * <b>ZK State:  CLOSED</b>
   * <p>A node can get into CLOSED state if</p>
   * <ul>
   * <li>An RS has completed closing a region but not acknowledged by master yet
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of a table that should be enabled was closed on an RS
   * <li>Region of a table that should be disabled was closed on an RS
   * </ul>
   * <p>
   * <b>ZK State:  OPENING</b>
   * <p>A node can get into OPENING state if</p>
   * <ul>
   * <li>An RS has begun to open a region
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>RS was opening a region of enabled table but never finishes
   * </ul>
   * <p>
   * <b>ZK State:  OPENED</b>
   * <p>A node can get into OPENED state if</p>
   * <ul>
   * <li>An RS has finished opening a region but not acknowledged by master yet
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of a table that should be enabled was opened on an RS
   * <li>Region of a table that should be disabled was opened on an RS
   * <li>Region of a table that should be enabled was opened by a now-dead RS
   * <li>Region of a table that should be disabled was opened by a now-dead RS
   * </ul>
   * <p>
   * <b>ZK State:  NONE</b>
   * <p>A region could not have a transition node if</p>
   * <ul>
   * <li>The server hosting the region died and no master processed it
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of enabled table was on a dead RS that was not yet processed
   * <li>Region of disabled table was on a dead RS that was not yet processed
   * </ul>
   * @throws Exception
   */


This addresses bug HBASE-2700.
    http://issues.apache.org/jira/browse/HBASE-2700


Diffs (updated)
-----

  trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java 1023927 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1023927 
  trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1023927 
  trunk/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java 1023927 
  trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1023927 

Diff: http://review.cloudera.org/r/995/diff


Testing
-------

running the unit test!


Thanks,

Jonathan


Re: Review Request: HBASE-2700 Unit test of master failover while regions in transition

Posted by Jonathan Gray <jg...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/995/
-----------------------------------------------------------

(Updated 2010-10-10 17:11:47.360208)


Review request for hbase and stack.


Changes
-------

Fixed remaining case.  We now check if an opened table is disabled, and if so, close it.

Next step is doing a hard kill of an RS while no master is alive but this is commitable if anyone wants to review, make sure it passes on hudson.


Summary
-------

First go at a unit test of master failover with regions in transition.

Comment from the test method:

  /**
   * Complex test of master failover that tests as many permutations of the
   * different possible states that regions in transition could be in within ZK.
   * <p>
   * This tests the proper handling of these states by the failed-over master
   * and includes a thorough testing of the timeout code as well.
   * <p>
   * Starts with a single master and three regionservers.
   * <p>
   * Creates two tables, enabledTable and disabledTable, each containing 5
   * regions.  The disabledTable is then disabled.
   * <p>
   * After reaching steady-state, the master is killed.  We then mock several
   * states in ZK.
   * <p>
   * After mocking them, we will startup a new master which should become the
   * active master and also detect that it is a failover.  The primary test
   * passing condition will be that all regions of the enabled table are
   * assigned and all the regions of the disabled table are not assigned.
   * <p>
   * The different scenarios to be tested are below:
   * <p>
   * <b>ZK State:  OFFLINE</b>
   * <p>A node can get into OFFLINE state if</p>
   * <ul>
   * <li>An RS fails to open a region, so it reverts the state back to OFFLINE
   * <li>The Master is assigning the region to a RS before it sends RPC
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Master has assigned an enabled region but RS failed so a region is
   *     not assigned anywhere and is sitting in ZK as OFFLINE</li>
   * <li>This seems to cover both cases?</li>
   * </ul>
   * <p>
   * <b>ZK State:  CLOSING</b>
   * <p>A node can get into CLOSING state if</p>
   * <ul>
   * <li>An RS has begun to close a region
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region was being closed but the RS died before finishing the close
   * <li>Region of enabled table was being closed but did not complete
   * <li>Region of disabled table was being closed but did not complete
   * </ul>
   * <p>
   * <b>ZK State:  CLOSED</b>
   * <p>A node can get into CLOSED state if</p>
   * <ul>
   * <li>An RS has completed closing a region but not acknowledged by master yet
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of a table that should be enabled was closed on an RS
   * <li>Region of a table that should be disabled was closed on an RS
   * </ul>
   * <p>
   * <b>ZK State:  OPENING</b>
   * <p>A node can get into OPENING state if</p>
   * <ul>
   * <li>An RS has begun to open a region
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>RS was opening a region of enabled table but never finishes
   * </ul>
   * <p>
   * <b>ZK State:  OPENED</b>
   * <p>A node can get into OPENED state if</p>
   * <ul>
   * <li>An RS has finished opening a region but not acknowledged by master yet
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of a table that should be enabled was opened on an RS
   * <li>Region of a table that should be disabled was opened on an RS
   * <li>Region of a table that should be enabled was opened by a now-dead RS
   * <li>Region of a table that should be disabled was opened by a now-dead RS
   * </ul>
   * <p>
   * <b>ZK State:  NONE</b>
   * <p>A region could not have a transition node if</p>
   * <ul>
   * <li>The server hosting the region died and no master processed it
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of enabled table was on a dead RS that was not yet processed
   * <li>Region of disabled table was on a dead RS that was not yet processed
   * </ul>
   * @throws Exception
   */


This addresses bug HBASE-2700.
    http://issues.apache.org/jira/browse/HBASE-2700


Diffs (updated)
-----

  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1006362 
  trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1006362 
  trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1006362 
  trunk/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java 1006362 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1006362 
  trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1006362 
  trunk/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java 1006362 
  trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1006362 

Diff: http://review.cloudera.org/r/995/diff


Testing
-------

running the unit test!


Thanks,

Jonathan


Re: Review Request: HBASE-2700 Unit test of master failover while regions in transition

Posted by Jonathan Gray <jg...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/995/
-----------------------------------------------------------

(Updated 2010-10-08 14:05:53.966038)


Review request for hbase and stack.


Changes
-------

Fixes more of the test.  It passes!

The one test in here that is currently commented out is if a region of a disabled table is in an OPENED state.  The new master comes up and finishes out the OPEN, but in our open handling we don't check if the table is actually supposed to be disabled (at which point we'd trigger an unassign).

Should we handle that case?  Could it actually happen?  Just don't want to add lots of unnecessary checks everywhere.


Summary
-------

First go at a unit test of master failover with regions in transition.

Comment from the test method:

  /**
   * Complex test of master failover that tests as many permutations of the
   * different possible states that regions in transition could be in within ZK.
   * <p>
   * This tests the proper handling of these states by the failed-over master
   * and includes a thorough testing of the timeout code as well.
   * <p>
   * Starts with a single master and three regionservers.
   * <p>
   * Creates two tables, enabledTable and disabledTable, each containing 5
   * regions.  The disabledTable is then disabled.
   * <p>
   * After reaching steady-state, the master is killed.  We then mock several
   * states in ZK.
   * <p>
   * After mocking them, we will startup a new master which should become the
   * active master and also detect that it is a failover.  The primary test
   * passing condition will be that all regions of the enabled table are
   * assigned and all the regions of the disabled table are not assigned.
   * <p>
   * The different scenarios to be tested are below:
   * <p>
   * <b>ZK State:  OFFLINE</b>
   * <p>A node can get into OFFLINE state if</p>
   * <ul>
   * <li>An RS fails to open a region, so it reverts the state back to OFFLINE
   * <li>The Master is assigning the region to a RS before it sends RPC
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Master has assigned an enabled region but RS failed so a region is
   *     not assigned anywhere and is sitting in ZK as OFFLINE</li>
   * <li>This seems to cover both cases?</li>
   * </ul>
   * <p>
   * <b>ZK State:  CLOSING</b>
   * <p>A node can get into CLOSING state if</p>
   * <ul>
   * <li>An RS has begun to close a region
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region was being closed but the RS died before finishing the close
   * <li>Region of enabled table was being closed but did not complete
   * <li>Region of disabled table was being closed but did not complete
   * </ul>
   * <p>
   * <b>ZK State:  CLOSED</b>
   * <p>A node can get into CLOSED state if</p>
   * <ul>
   * <li>An RS has completed closing a region but not acknowledged by master yet
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of a table that should be enabled was closed on an RS
   * <li>Region of a table that should be disabled was closed on an RS
   * </ul>
   * <p>
   * <b>ZK State:  OPENING</b>
   * <p>A node can get into OPENING state if</p>
   * <ul>
   * <li>An RS has begun to open a region
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>RS was opening a region of enabled table but never finishes
   * </ul>
   * <p>
   * <b>ZK State:  OPENED</b>
   * <p>A node can get into OPENED state if</p>
   * <ul>
   * <li>An RS has finished opening a region but not acknowledged by master yet
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of a table that should be enabled was opened on an RS
   * <li>Region of a table that should be disabled was opened on an RS
   * <li>Region of a table that should be enabled was opened by a now-dead RS
   * <li>Region of a table that should be disabled was opened by a now-dead RS
   * </ul>
   * <p>
   * <b>ZK State:  NONE</b>
   * <p>A region could not have a transition node if</p>
   * <ul>
   * <li>The server hosting the region died and no master processed it
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Region of enabled table was on a dead RS that was not yet processed
   * <li>Region of disabled table was on a dead RS that was not yet processed
   * </ul>
   * @throws Exception
   */


This addresses bug HBASE-2700.
    http://issues.apache.org/jira/browse/HBASE-2700


Diffs (updated)
-----

  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1005264 
  trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1005264 
  trunk/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java 1005264 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1005264 
  trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1005264 
  trunk/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java 1005264 
  trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1005264 

Diff: http://review.cloudera.org/r/995/diff


Testing
-------

running the unit test!


Thanks,

Jonathan


Re: Review Request: HBASE-2700 Unit test of master failover while regions in transition

Posted by Jonathan Gray <jg...@apache.org>.

> On 2010-10-08 13:43:59, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java, line 267
> > <http://review.cloudera.org/r/995/diff/1/?file=14445#file14445line267>
> >
> >     When would this happen?

   * <b>ZK State:  OFFLINE</b>
   * <p>A node can get into OFFLINE state if</p>
   * <ul>
   * <li>An RS fails to open a region, so it reverts the state back to OFFLINE
   * <li>The Master is assigning the region to a RS before it sends RPC
   * </ul>
   * <p>We will mock the scenarios</p>
   * <ul>
   * <li>Master has assigned an enabled region but RS failed so a region is
   *     not assigned anywhere and is sitting in ZK as OFFLINE</li>
   * <li>This seems to cover both cases?</li>
   * </ul>


> On 2010-10-08 13:43:59, stack wrote:
> > trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java, line 675
> > <http://review.cloudera.org/r/995/diff/1/?file=14448#file14448line675>
> >
> >     Don't we have this in AssignmentManager already?
> >     isRegionsInTransition I believe its called.
> >     
> >     There is white space added at end of the two @throws lines.

This tests ZK not the RIT map on the master.  So for unit tests, you're testing two different things.  Since i'm mocking data up in ZK, i wanted to ensure nothing left in zk.


> On 2010-10-08 13:43:59, stack wrote:
> > trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java, line 462
> > <http://review.cloudera.org/r/995/diff/1/?file=14451#file14451line462>
> >
> >     What about the case where not all regions have been assigned -- say the master was killed mid-startup before all regions mentioned in .META. had been assigned by master?  There should be a fixup where we compare the difference?  Can we we even handle this case?  We'd need to ask RSs what they are holding?

IMO we don't need to support this (for now).  I think it is acceptable that nothing can fail during a startup.  If the master dies or an RS dies during initial startup, you have to restart.  I think RS deaths may even work fine but I think it's okay to have a SPOF during startup.


- Jonathan


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/995/#review1496
-----------------------------------------------------------


On 2010-10-07 16:34:04, Jonathan Gray wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/995/
> -----------------------------------------------------------
> 
> (Updated 2010-10-07 16:34:04)
> 
> 
> Review request for hbase and stack.
> 
> 
> Summary
> -------
> 
> First go at a unit test of master failover with regions in transition.
> 
> Comment from the test method:
> 
>   /**
>    * Complex test of master failover that tests as many permutations of the
>    * different possible states that regions in transition could be in within ZK.
>    * <p>
>    * This tests the proper handling of these states by the failed-over master
>    * and includes a thorough testing of the timeout code as well.
>    * <p>
>    * Starts with a single master and three regionservers.
>    * <p>
>    * Creates two tables, enabledTable and disabledTable, each containing 5
>    * regions.  The disabledTable is then disabled.
>    * <p>
>    * After reaching steady-state, the master is killed.  We then mock several
>    * states in ZK.
>    * <p>
>    * After mocking them, we will startup a new master which should become the
>    * active master and also detect that it is a failover.  The primary test
>    * passing condition will be that all regions of the enabled table are
>    * assigned and all the regions of the disabled table are not assigned.
>    * <p>
>    * The different scenarios to be tested are below:
>    * <p>
>    * <b>ZK State:  OFFLINE</b>
>    * <p>A node can get into OFFLINE state if</p>
>    * <ul>
>    * <li>An RS fails to open a region, so it reverts the state back to OFFLINE
>    * <li>The Master is assigning the region to a RS before it sends RPC
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Master has assigned an enabled region but RS failed so a region is
>    *     not assigned anywhere and is sitting in ZK as OFFLINE</li>
>    * <li>This seems to cover both cases?</li>
>    * </ul>
>    * <p>
>    * <b>ZK State:  CLOSING</b>
>    * <p>A node can get into CLOSING state if</p>
>    * <ul>
>    * <li>An RS has begun to close a region
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region was being closed but the RS died before finishing the close
>    * <li>Region of enabled table was being closed but did not complete
>    * <li>Region of disabled table was being closed but did not complete
>    * </ul>
>    * <p>
>    * <b>ZK State:  CLOSED</b>
>    * <p>A node can get into CLOSED state if</p>
>    * <ul>
>    * <li>An RS has completed closing a region but not acknowledged by master yet
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of a table that should be enabled was closed on an RS
>    * <li>Region of a table that should be disabled was closed on an RS
>    * </ul>
>    * <p>
>    * <b>ZK State:  OPENING</b>
>    * <p>A node can get into OPENING state if</p>
>    * <ul>
>    * <li>An RS has begun to open a region
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>RS was opening a region of enabled table but never finishes
>    * </ul>
>    * <p>
>    * <b>ZK State:  OPENED</b>
>    * <p>A node can get into OPENED state if</p>
>    * <ul>
>    * <li>An RS has finished opening a region but not acknowledged by master yet
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of a table that should be enabled was opened on an RS
>    * <li>Region of a table that should be disabled was opened on an RS
>    * <li>Region of a table that should be enabled was opened by a now-dead RS
>    * <li>Region of a table that should be disabled was opened by a now-dead RS
>    * </ul>
>    * <p>
>    * <b>ZK State:  NONE</b>
>    * <p>A region could not have a transition node if</p>
>    * <ul>
>    * <li>The server hosting the region died and no master processed it
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of enabled table was on a dead RS that was not yet processed
>    * <li>Region of disabled table was on a dead RS that was not yet processed
>    * </ul>
>    * @throws Exception
>    */
> 
> 
> This addresses bug HBASE-2700.
>     http://issues.apache.org/jira/browse/HBASE-2700
> 
> 
> Diffs
> -----
> 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1005264 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1005264 
>   trunk/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java 1005264 
>   trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1005264 
>   trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1005264 
>   trunk/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java 1005264 
>   trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1005264 
> 
> Diff: http://review.cloudera.org/r/995/diff
> 
> 
> Testing
> -------
> 
> running the unit test!
> 
> 
> Thanks,
> 
> Jonathan
> 
>


Re: Review Request: HBASE-2700 Unit test of master failover while regions in transition

Posted by st...@duboce.net.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/995/#review1496
-----------------------------------------------------------


Test looks great.  There are a few comments below.


trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
<http://review.cloudera.org/r/995/#comment5132>

    When would this happen?



trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java
<http://review.cloudera.org/r/995/#comment5133>

    Don't we have this in AssignmentManager already?
    isRegionsInTransition I believe its called.
    
    There is white space added at end of the two @throws lines.



trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java
<http://review.cloudera.org/r/995/#comment5134>

    What about the case where not all regions have been assigned -- say the master was killed mid-startup before all regions mentioned in .META. had been assigned by master?  There should be a fixup where we compare the difference?  Can we we even handle this case?  We'd need to ask RSs what they are holding?


- stack


On 2010-10-07 16:34:04, Jonathan Gray wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> http://review.cloudera.org/r/995/
> -----------------------------------------------------------
> 
> (Updated 2010-10-07 16:34:04)
> 
> 
> Review request for hbase and stack.
> 
> 
> Summary
> -------
> 
> First go at a unit test of master failover with regions in transition.
> 
> Comment from the test method:
> 
>   /**
>    * Complex test of master failover that tests as many permutations of the
>    * different possible states that regions in transition could be in within ZK.
>    * <p>
>    * This tests the proper handling of these states by the failed-over master
>    * and includes a thorough testing of the timeout code as well.
>    * <p>
>    * Starts with a single master and three regionservers.
>    * <p>
>    * Creates two tables, enabledTable and disabledTable, each containing 5
>    * regions.  The disabledTable is then disabled.
>    * <p>
>    * After reaching steady-state, the master is killed.  We then mock several
>    * states in ZK.
>    * <p>
>    * After mocking them, we will startup a new master which should become the
>    * active master and also detect that it is a failover.  The primary test
>    * passing condition will be that all regions of the enabled table are
>    * assigned and all the regions of the disabled table are not assigned.
>    * <p>
>    * The different scenarios to be tested are below:
>    * <p>
>    * <b>ZK State:  OFFLINE</b>
>    * <p>A node can get into OFFLINE state if</p>
>    * <ul>
>    * <li>An RS fails to open a region, so it reverts the state back to OFFLINE
>    * <li>The Master is assigning the region to a RS before it sends RPC
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Master has assigned an enabled region but RS failed so a region is
>    *     not assigned anywhere and is sitting in ZK as OFFLINE</li>
>    * <li>This seems to cover both cases?</li>
>    * </ul>
>    * <p>
>    * <b>ZK State:  CLOSING</b>
>    * <p>A node can get into CLOSING state if</p>
>    * <ul>
>    * <li>An RS has begun to close a region
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region was being closed but the RS died before finishing the close
>    * <li>Region of enabled table was being closed but did not complete
>    * <li>Region of disabled table was being closed but did not complete
>    * </ul>
>    * <p>
>    * <b>ZK State:  CLOSED</b>
>    * <p>A node can get into CLOSED state if</p>
>    * <ul>
>    * <li>An RS has completed closing a region but not acknowledged by master yet
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of a table that should be enabled was closed on an RS
>    * <li>Region of a table that should be disabled was closed on an RS
>    * </ul>
>    * <p>
>    * <b>ZK State:  OPENING</b>
>    * <p>A node can get into OPENING state if</p>
>    * <ul>
>    * <li>An RS has begun to open a region
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>RS was opening a region of enabled table but never finishes
>    * </ul>
>    * <p>
>    * <b>ZK State:  OPENED</b>
>    * <p>A node can get into OPENED state if</p>
>    * <ul>
>    * <li>An RS has finished opening a region but not acknowledged by master yet
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of a table that should be enabled was opened on an RS
>    * <li>Region of a table that should be disabled was opened on an RS
>    * <li>Region of a table that should be enabled was opened by a now-dead RS
>    * <li>Region of a table that should be disabled was opened by a now-dead RS
>    * </ul>
>    * <p>
>    * <b>ZK State:  NONE</b>
>    * <p>A region could not have a transition node if</p>
>    * <ul>
>    * <li>The server hosting the region died and no master processed it
>    * </ul>
>    * <p>We will mock the scenarios</p>
>    * <ul>
>    * <li>Region of enabled table was on a dead RS that was not yet processed
>    * <li>Region of disabled table was on a dead RS that was not yet processed
>    * </ul>
>    * @throws Exception
>    */
> 
> 
> This addresses bug HBASE-2700.
>     http://issues.apache.org/jira/browse/HBASE-2700
> 
> 
> Diffs
> -----
> 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1005264 
>   trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1005264 
>   trunk/src/main/java/org/apache/hadoop/hbase/util/JVMClusterUtil.java 1005264 
>   trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1005264 
>   trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java 1005264 
>   trunk/src/test/java/org/apache/hadoop/hbase/MiniHBaseCluster.java 1005264 
>   trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 1005264 
> 
> Diff: http://review.cloudera.org/r/995/diff
> 
> 
> Testing
> -------
> 
> running the unit test!
> 
> 
> Thanks,
> 
> Jonathan
> 
>