You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org> on 2012/05/29 19:10:22 UTC

[jira] [Created] (HBASE-6122) Backup master does not become Active master after ZK exception

ramkrishna.s.vasudevan created HBASE-6122:
---------------------------------------------

             Summary: Backup master does not become Active master after ZK exception
                 Key: HBASE-6122
                 URL: https://issues.apache.org/jira/browse/HBASE-6122
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.94.0
            Reporter: ramkrishna.s.vasudevan
             Fix For: 0.96.0, 0.94.1


-> Active master gets ZK expiry exception.
-> Backup master becomes active.
-> The previous active master retries and becomes the back up master.
Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.

{code}
if (abortNow(msg, t)) {
      if (t != null) LOG.fatal(msg, t);
      else LOG.fatal(msg);
      this.abort = true;
      stop("Aborting");
    }
{code}
In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
{code}
    synchronized (this.clusterHasActiveMaster) {
      while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
        try {
          this.clusterHasActiveMaster.wait();
        } catch (InterruptedException e) {
          // We expect to be interrupted when a master dies, will fall out if so
          LOG.debug("Interrupted waiting for master to die", e);
        }
      }
      if (!clusterStatusTracker.isClusterUp()) {
        this.master.stop("Cluster went down before this master became active");
      }
      if (this.master.isStopped()) {
        return cleanSetOfActiveMaster;
      }
      // Try to become active master again now that there is no active master
      blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
    }
    return cleanSetOfActiveMaster;
{code}
When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
{code}
// Try to become active master again now that there is no active master
      blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
{code}
We tend to return the 'cleanSetOfActiveMaster' which was previously false.
Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-6122:
------------------------------------------

    Attachment: HBASE-6122_0.92.patch

I found some changes in the trunk code.  So not sure if it is applicable in trunk.  Attached patches for 0.94 and 0.92.
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.96.0, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285959#comment-13285959 ] 

Hudson commented on HBASE-6122:
-------------------------------

Integrated in HBase-0.92 #433 (See [https://builds.apache.org/job/HBase-0.92/433/])
    HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344350)

     Result = FAILURE
ramkrishna : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java

                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan resolved HBASE-6122.
-------------------------------------------

      Resolution: Fixed
    Hadoop Flags: Reviewed
    
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286350#comment-13286350 ] 

stack commented on HBASE-6122:
------------------------------

@Ram Which assert should be changed?  Do you want to include the assert change in your patch?  Or are you suggesting a previous test case is broke?  If so, which?  Thanks.
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286353#comment-13286353 ] 

ramkrishna.s.vasudevan commented on HBASE-6122:
-----------------------------------------------

I have attached the patch Stack.  It is changing the assert.
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286298#comment-13286298 ] 

ramkrishna.s.vasudevan commented on HBASE-6122:
-----------------------------------------------

Oh... Let me check out the reason for the failure.  Sorry for the mess.
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287296#comment-13287296 ] 

ramkrishna.s.vasudevan commented on HBASE-6122:
-----------------------------------------------

@N
The trunk code is different.  Currently there is a while(true) loop and as far as i see it should be ok in trunk.
I did not try to reproduce in trunk.
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286235#comment-13286235 ] 

Hudson commented on HBASE-6122:
-------------------------------

Integrated in HBase-0.92 #435 (See [https://builds.apache.org/job/HBase-0.92/435/])
    HBASE-6122 Backup master does not become Active master after ZK exception: REVERT (Revision 1344466)

     Result = SUCCESS
stack : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java

                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-6122:
------------------------------------------

    Fix Version/s:     (was: 0.96.0)
    
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286359#comment-13286359 ] 

stack commented on HBASE-6122:
------------------------------

Looks good Ram.  +1
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286334#comment-13286334 ] 

ramkrishna.s.vasudevan commented on HBASE-6122:
-----------------------------------------------

I checked the test case.
Ideally the flow is making the master to become active but the problem as described in this JIRA still makes the master to go down.

I added a log in ActiveMasterManager.blockUntilBecomingActiveMaster
{code}
{code}
{code}
2012-05-31 10:52:29,050 INFO  [pool-29-thread-1] master.ActiveMasterManager(149): Master is now available Htipl-01388.china.huawei.com,3569,1338441734226
2012-05-31 10:52:29,050 INFO  [pool-29-thread-1] master.ActiveMasterManager(151): Master=Htipl-01388.china.huawei.com,3569,1338441734226
{code}
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286353#comment-13286353 ] 

ramkrishna.s.vasudevan edited comment on HBASE-6122 at 5/31/12 6:09 AM:
------------------------------------------------------------------------

I have attached the patch Stack.  It is changing the assert of the testcase TestMasterZKSessionRecovery.testMasterZKSessionRecoveryFailure
                
      was (Author: ram_krish):
    I have attached the patch Stack.  It is changing the assert.
                  
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285823#comment-13285823 ] 

ramkrishna.s.vasudevan commented on HBASE-6122:
-----------------------------------------------

Committed to 0.92 and 0.94.
Thanks for the review Lars.
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287278#comment-13287278 ] 

nkeywal commented on HBASE-6122:
--------------------------------

@ram
bq. I found some changes in the trunk code. So not sure if it is applicable in trunk. Attached patches for 0.94 and 0.92.

Do you mean that the problem is not reproducible on trunk?
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286334#comment-13286334 ] 

ramkrishna.s.vasudevan edited comment on HBASE-6122 at 5/31/12 5:26 AM:
------------------------------------------------------------------------

I checked the test case.
Ideally the flow is making the master to become active but the problem as described in this JIRA still makes the master to go down.

I added a log in ActiveMasterManager.blockUntilBecomingActiveMaster
{code}
        LOG.info("Master is now available "+this.sn);
        this.clusterHasActiveMaster.set(true);
        LOG.info("Master=" + this.sn);
        return cleanSetOfActiveMaster;
{code}
See the below log in the logs.
{code}
2012-05-31 10:52:29,050 INFO  [pool-29-thread-1] master.ActiveMasterManager(149): Master is now available Htipl-01388.china.huawei.com,3569,1338441734226
2012-05-31 10:52:29,050 INFO  [pool-29-thread-1] master.ActiveMasterManager(151): Master=Htipl-01388.china.huawei.com,3569,1338441734226
{code}
This means ideally the master should come up if there is no problem in again becoming active.  Along with the patch this testcase should be modified to make the assertTrue to assertFalse.

Pls correct me if am wrong.  The fix still remains valid.
                
      was (Author: ram_krish):
    I checked the test case.
Ideally the flow is making the master to become active but the problem as described in this JIRA still makes the master to go down.

I added a log in ActiveMasterManager.blockUntilBecomingActiveMaster
{code}
{code}
{code}
2012-05-31 10:52:29,050 INFO  [pool-29-thread-1] master.ActiveMasterManager(149): Master is now available Htipl-01388.china.huawei.com,3569,1338441734226
2012-05-31 10:52:29,050 INFO  [pool-29-thread-1] master.ActiveMasterManager(151): Master=Htipl-01388.china.huawei.com,3569,1338441734226
{code}
                  
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-6122:
------------------------------------------

    Fix Version/s: 0.92.2
    
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.96.0, 0.94.1
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286765#comment-13286765 ] 

Hudson commented on HBASE-6122:
-------------------------------

Integrated in HBase-0.94 #240 (See [https://builds.apache.org/job/HBase-0.94/240/])
    HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344798)

     Result = SUCCESS
ramkrishna : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
* /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287164#comment-13287164 ] 

Hudson commented on HBASE-6122:
-------------------------------

Integrated in HBase-0.94-security #33 (See [https://builds.apache.org/job/HBase-0.94-security/33/])
    HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344798)
HBASE-6122 Backup master does not become Active master after ZK exception: REVERT (Revision 1344467)
HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344348)

     Result = FAILURE
ramkrishna : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
* /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

stack : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java

ramkrishna : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java

                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286072#comment-13286072 ] 

stack commented on HBASE-6122:
------------------------------

I reverted from 0.92 and 0.94 branches till we figure the failures.
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan resolved HBASE-6122.
-------------------------------------------

    Resolution: Fixed

Committed again with the latest patch to 0.92 and 0.94.  Hope things are ok this time. Thanks Stack for your review.
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285871#comment-13285871 ] 

Hudson commented on HBASE-6122:
-------------------------------

Integrated in HBase-0.94 #233 (See [https://builds.apache.org/job/HBase-0.94/233/])
    HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344348)

     Result = FAILURE
ramkrishna : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java

                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lars Hofhansl closed HBASE-6122.
--------------------------------

    
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch, HBASE-6122.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-6122:
------------------------------------------

    Attachment: HBASE-6122_0.94.patch

Pls take a look at the patch.  I have not corrected the test case for the testcase to pass.  Ideally the previous test was covering up the bug.  Correct me if am wrong.
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-6122:
------------------------------------------

    Attachment: HBASE-6122.patch
    
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286166#comment-13286166 ] 

Hudson commented on HBASE-6122:
-------------------------------

Integrated in HBase-0.94 #236 (See [https://builds.apache.org/job/HBase-0.94/236/])
    HBASE-6122 Backup master does not become Active master after ZK exception: REVERT (Revision 1344467)

     Result = SUCCESS
stack : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java

                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Reopened] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack reopened HBASE-6122:
--------------------------


Reopening.  Backing out these patches.  It seems reponsible for these failures:
https://builds.apache.org/job/HBase-0.92/433/
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "nkeywal (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287306#comment-13287306 ] 

nkeywal commented on HBASE-6122:
--------------------------------

Thanks, I will give it a try to be sure.
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan reassigned HBASE-6122:
---------------------------------------------

    Assignee: ramkrishna.s.vasudevan
    
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.96.0, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13286844#comment-13286844 ] 

Hudson commented on HBASE-6122:
-------------------------------

Integrated in HBase-0.92 #439 (See [https://builds.apache.org/job/HBase-0.92/439/])
    HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344799)

     Result = FAILURE
ramkrishna : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan updated HBASE-6122:
------------------------------------------

    Attachment: HBASE-6122_0.94.patch
    
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.96.0, 0.94.1
>
>         Attachments: HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287235#comment-13287235 ] 

Hudson commented on HBASE-6122:
-------------------------------

Integrated in HBase-0.92-security #109 (See [https://builds.apache.org/job/HBase-0.92-security/109/])
    HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344799)
HBASE-6122 Backup master does not become Active master after ZK exception: REVERT (Revision 1344466)
HBASE-6122 Backup master does not become Active master after ZK exception (Ram) (Revision 1344350)

     Result = SUCCESS
ramkrishna : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/master/TestMasterZKSessionRecovery.java

stack : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java

ramkrishna : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/master/ActiveMasterManager.java

                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.94.1
>
>         Attachments: HBASE-6122.patch, HBASE-6122_0.92.patch, HBASE-6122_0.94.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6122) Backup master does not become Active master after ZK exception

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13284961#comment-13284961 ] 

Lars Hofhansl commented on HBASE-6122:
--------------------------------------

+1 patch looks good to me.
                
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.96.0, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira