You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "ramkrishna.s.vasudevan (JIRA)" <ji...@apache.org> on 2012/05/30 06:44:23 UTC

[jira] [Assigned] (HBASE-6122) Backup master does not become Active master after ZK exception

     [ https://issues.apache.org/jira/browse/HBASE-6122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ramkrishna.s.vasudevan reassigned HBASE-6122:
---------------------------------------------

    Assignee: ramkrishna.s.vasudevan
    
> Backup master does not become Active master after ZK exception
> --------------------------------------------------------------
>
>                 Key: HBASE-6122
>                 URL: https://issues.apache.org/jira/browse/HBASE-6122
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>             Fix For: 0.92.2, 0.96.0, 0.94.1
>
>         Attachments: HBASE-6122_0.92.patch, HBASE-6122_0.94.patch
>
>
> -> Active master gets ZK expiry exception.
> -> Backup master becomes active.
> -> The previous active master retries and becomes the back up master.
> Now when the new active master goes down and the current back up master comes up, it goes down again with the zk expiry exception it got in the first step.
> {code}
> if (abortNow(msg, t)) {
>       if (t != null) LOG.fatal(msg, t);
>       else LOG.fatal(msg);
>       this.abort = true;
>       stop("Aborting");
>     }
> {code}
> In ActiveMasterManager.blockUntilBecomingActiveMaster we try to wait till the back up master becomes active. 
> {code}
>     synchronized (this.clusterHasActiveMaster) {
>       while (this.clusterHasActiveMaster.get() && !this.master.isStopped()) {
>         try {
>           this.clusterHasActiveMaster.wait();
>         } catch (InterruptedException e) {
>           // We expect to be interrupted when a master dies, will fall out if so
>           LOG.debug("Interrupted waiting for master to die", e);
>         }
>       }
>       if (!clusterStatusTracker.isClusterUp()) {
>         this.master.stop("Cluster went down before this master became active");
>       }
>       if (this.master.isStopped()) {
>         return cleanSetOfActiveMaster;
>       }
>       // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
>     }
>     return cleanSetOfActiveMaster;
> {code}
> When the back up master (it is in back up mode as he got ZK exception), once again tries to come to active we don't get the return value that comes out from 
> {code}
> // Try to become active master again now that there is no active master
>       blockUntilBecomingActiveMaster(startupStatus,clusterStatusTracker);
> {code}
> We tend to return the 'cleanSetOfActiveMaster' which was previously false.
> Now because of this instead of again becoming active the back up master goes down in the abort() code.  Thanks to Gopi,my colleague for reporting this issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira