You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "nkeywal (Created) (JIRA)" <ji...@apache.org> on 2012/03/13 15:27:38 UTC

[jira] [Created] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

KeeperException.SessionExpiredException management could be improved in Master
------------------------------------------------------------------------------

                 Key: HBASE-5572
                 URL: https://issues.apache.org/jira/browse/HBASE-5572
             Project: HBase
          Issue Type: Improvement
          Components: master
    Affects Versions: 0.96.0
            Reporter: nkeywal
            Assignee: nkeywal
            Priority: Minor


Synthesis:
 1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
 2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
 3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
 4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.


Changes are:
 2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
 1) Removing TestMasterZKSessionRecovery


Detailed justification:
testMasterZKSessionRecoveryFailure says:

{noformat}
  /**
   * Negative test of master recovery from zk session expiry.
   *
   * Starts with one master. Fakes the master zk session expired.
   * Ensures the master cannot recover the expired zk session since
   * the master zk node is still there.
   */
  public void testMasterZKSessionRecoveryFailure() throws Exception {
    MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
    HMaster m = cluster.getMaster();
    m.abort("Test recovery from zk session expired",
      new KeeperException.SessionExpiredException());
    assertTrue(m.isStopped());
  }
{noformat}

This tests works, i.e. the assertion is always verified.
But do we really want this behavior?

When looking at the code, we see that this what's happening is strange:

- HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
- HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
- HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
- HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
- ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.

In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228689#comment-13228689 ] 

nkeywal commented on HBASE-5572:
--------------------------------

for an unknown reason the first two patches didn't make it to hadoop-qa. Rewriting once again.
                
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5572:
---------------------------

    Status: Open  (was: Patch Available)
    
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231693#comment-13231693 ] 

Hudson commented on HBASE-5572:
-------------------------------

Integrated in HBase-TRUNK #2686 (See [https://builds.apache.org/job/HBase-TRUNK/2686/])
    HBASE-5549 HBASE-5572 Master can fail if ZooKeeper session expires (N Keywal) (Revision 1301775)

     Result = SUCCESS
tedyu : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWatcher.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/TestZooKeeper.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationPeer.java

                
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228598#comment-13228598 ] 

stack commented on HBASE-5572:
------------------------------

+1 on patch.  Waiting on hadoopqa...
                
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5572:
---------------------------

    Status: Patch Available  (was: Open)
    
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228872#comment-13228872 ] 

Hadoop QA commented on HBASE-5572:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12518238/5572.v2.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 5 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 161 new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1179//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1179//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1179//console

This message is automatically generated.
                
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5572:
---------------------------

    Fix Version/s: 0.96.0
           Status: Patch Available  (was: Open)
    
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13231854#comment-13231854 ] 

Hudson commented on HBASE-5572:
-------------------------------

Integrated in HBase-TRUNK-security #140 (See [https://builds.apache.org/job/HBase-TRUNK-security/140/])
    HBASE-5549 HBASE-5572 Master can fail if ZooKeeper session expires (N Keywal) (Revision 1301775)

     Result = SUCCESS
tedyu : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/RecoverableZooKeeper.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWatcher.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestingUtility.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/TestZooKeeper.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestDistributedLogSplitting.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransactionOnCluster.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/replication/TestReplication.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/replication/TestReplicationPeer.java

                
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "Zhihong Yu (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zhihong Yu updated HBASE-5572:
------------------------------

      Resolution: Fixed
    Hadoop Flags: Reviewed
          Status: Resolved  (was: Patch Available)

Resolved as part of HBASE-5549
                
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228487#comment-13228487 ] 

stack commented on HBASE-5572:
------------------------------

Above sounds good.  Would suggest we retain the test that verifies that we expire znode if same host and port.  That behavior can be useful in the single-master case.
                
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5572.v1.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5572:
---------------------------

    Status: Open  (was: Patch Available)
    
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5572.v1.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228476#comment-13228476 ] 

nkeywal commented on HBASE-5572:
--------------------------------

Yes. I've done 3 modifications in the code, two like for like (hopefully!) and one with a different behavior. I:
- removed the variable named cleanSetOfActiveMaster, replaced by "return true" or "return false". 
- replaced the recursive call by a while(true) loop. 
- implicitly (it's hidden because there is no recursive call anymore) changed the function behavior: we now return the final result. For this reason the function behaves differently (we return true instead of false), but it's more on line with the method contract. This change breaks the testMasterZKSessionRecoveryFailure, because it does not fail anymore. 

TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure was testing explicitly the behavior with both SessionExpired AND master with same host & port. I removed it, but I can move it to TestZooKeeper (to save a cluster start/stop) and reverse the assertion in the test (now it does not fail).




                
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5572.v1.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5572:
---------------------------

    Status: Open  (was: Patch Available)
    
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13228460#comment-13228460 ] 

stack commented on HBASE-5572:
------------------------------

@N Thanks for digging in.  So, it looks like your patch retains the behavior where if the current master has same host and port, we'll expire it, and then try and register ourselves (because we go around to the top of your new while loop)?  Is that so?  I believe we have a test to ensure this behavior IIRC.  Patch looks good.  +1.
                
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5572.v1.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5572:
---------------------------

    Status: Patch Available  (was: Open)
    
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5572.v1.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5572:
---------------------------

    Attachment: 5572.v2.patch
    
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5572:
---------------------------

    Attachment: 5572.v2.patch
    
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5572:
---------------------------

    Status: Patch Available  (was: Open)
    
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5572.v1.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5572:
---------------------------

    Attachment: 5572.v2.patch
    
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5572.v1.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230035#comment-13230035 ] 

nkeywal commented on HBASE-5572:
--------------------------------

@stack: it seems the patch could be committed?
                
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>             Fix For: 0.96.0
>
>         Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 5572.v2.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

Posted by "nkeywal (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

nkeywal updated HBASE-5572:
---------------------------

    Attachment: 5572.v1.patch
    
> KeeperException.SessionExpiredException management could be improved in Master
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-5572
>                 URL: https://issues.apache.org/jira/browse/HBASE-5572
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5572.v1.patch
>
>
> Synthesis:
>  1) TestMasterZKSessionRecovery distinguish two cases on SessionExpiredException. One is explicitly not managed. However, is seems that there is no reason for this.
>  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a quite complex function, with a useless recursive call.
>  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is equivalent to TestZooKeeper#testMasterSessionExpired
>  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be removed if we merge the two cases mentioned above.
> Changes are:
>  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a single case and remove recursion
>  1) Removing TestMasterZKSessionRecovery
> Detailed justification:
> testMasterZKSessionRecoveryFailure says:
> {noformat}
>   /**
>    * Negative test of master recovery from zk session expiry.
>    *
>    * Starts with one master. Fakes the master zk session expired.
>    * Ensures the master cannot recover the expired zk session since
>    * the master zk node is still there.
>    */
>   public void testMasterZKSessionRecoveryFailure() throws Exception {
>     MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
>     HMaster m = cluster.getMaster();
>     m.abort("Test recovery from zk session expired",
>       new KeeperException.SessionExpiredException());
>     assertTrue(m.isStopped());
>   }
> {noformat}
> This tests works, i.e. the assertion is always verified.
> But do we really want this behavior?
> When looking at the code, we see that this what's happening is strange:
> - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false HMaster#abort stops the master.
> - HMaster#abortNow checks the exception type. As it's a SessionExpiredException it will try to recover, calling HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false (and that will make HMaster#abort stopping the master)
> - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and then try to become the active master. If it cannot, it will return false (and that will make HMaster#abort stopping the master).
> - HMaster#becomeActiveMaster returns the result of ActiveMasterManager#blockUntilBecomingActiveMaster. blockUntilBecomingActiveMaster says it will return false if there is any error preventing it to become the active master.
> - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master address. If it's the same port & host, it deletes the nodes, that will start a recursive call to blockUntilBecomingActiveMaster. This second call succeeds (we became the active master) and return true. This result is ignored by the first blockUntilBecomingActiveMaster: it return false (even if we actually became the active master), hence the whole suite call returns false and HMaster#abort stops the master.
> In other words, the comment says "Ensures the master cannot recover the expired zk session since the master zk node is still there." but we're actually doing a check just for this and deleting the node. If we were not ignoring the result, we would return "true", so we would not stop the master, so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira