You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Ryan Rawson <ry...@gmail.com> on 2009/03/24 07:10:38 UTC

Re: [jira] Commented: (HBASE-1232) zookeeper client wont reconnect if there is a problem

My issue i originally complained about was from the _clients_ point of view
who doesnt actually create ephemeral nodes.

But the other problems stand.

On Mon, Mar 23, 2009 at 11:01 PM, Nitay Joffe (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688578#action_12688578]
>
> Nitay Joffe commented on HBASE-1232:
> ------------------------------------
>
> When a SessionExpired occurs we will lose our ephemeral nodes. This means
> everyone else in the cluster will think that node is down. To fix this we
> need to restart the node completely.
>
> For example, if the master's connection to ZooKeeper throws SessionExpired
> it loses its ephemeral address node in ZooKeeper and everyone will think the
> master has died. In fact, another master may come up now that we have the HA
> master lock.
>
> I'm writing the #restart() methods for HMaster and HRegionServer.
> Effectively it's just something like:
>
> {code}
>  shutdown();
>  run();
> {code}
>
> I notice that the shutdown/stop methods in those classes just set a flag
> which is later picked up and causes a shutdown. How do I make sure the
> server is actually shutdown between the shutdown() call and the run() call?
>
> > zookeeper client wont reconnect if there is a problem
> > -----------------------------------------------------
> >
> >                 Key: HBASE-1232
> >                 URL: https://issues.apache.org/jira/browse/HBASE-1232
> >             Project: Hadoop HBase
> >          Issue Type: Bug
> >         Environment: java 1.7, zookeeper 3.0.1
> >            Reporter: ryan rawson
> >            Assignee: Nitay Joffe
> >            Priority: Critical
> >             Fix For: 0.20.0
> >
> >
> > my regionserver got wedged:
> > 2009-03-02 15:43:30,938 WARN
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create /hbase:
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /hbase
> >         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:87)
> >         at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:35)
> >         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482)
> >         at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219)
> >         at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240)
> >         at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328)
> >         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783)
> >         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468)
> >         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443)
> >         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518)
> >         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477)
> >         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450)
> >         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295)
> >         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919)
> >         at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950)
> >         at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370)
> >         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314)
> >         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294)
> >         at
> org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237)
> >         at
> org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216)
> >         at
> org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174)
> >         at
> org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607)
> >         at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174)
> >         at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107)
> > this message repeats over and over.
> > Looking at the code in question:
> >   private boolean ensureExists(final String znode) {
> >     try {
> >       zooKeeper.create(znode, new byte[0],
> >                        Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
> >       LOG.debug("Created ZNode " + znode);
> >       return true;
> >     } catch (KeeperException.NodeExistsException e) {
> >       return true;      // ok, move on.
> >     } catch (KeeperException.NoNodeException e) {
> >       return ensureParentExists(znode) && ensureExists(znode);
> >     } catch (KeeperException e) {
> >       LOG.warn("Failed to create " + znode + ":", e);
> >     } catch (InterruptedException e) {
> >       LOG.warn("Failed to create " + znode + ":", e);
> >     }
> >     return false;
> >   }
> > We need to catch this exception specifically and reopen the ZK
> connection.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Commented: (HBASE-1232) zookeeper client wont reconnect if there is a problem

Posted by Nitay <ni...@gmail.com>.
Yeah, I'm handling all three cases (Master, RegionServer, Client) in the
same code. We could just let the Master/RegionServer fail on the
SessionExpired and have the user clean it up? but that seems ugly since it
is something we can handle.

On Mon, Mar 23, 2009 at 11:10 PM, Ryan Rawson <ry...@gmail.com> wrote:

> My issue i originally complained about was from the _clients_ point of view
> who doesnt actually create ephemeral nodes.
>
> But the other problems stand.
>
> On Mon, Mar 23, 2009 at 11:01 PM, Nitay Joffe (JIRA) <jira@apache.org
> >wrote:
>
> >
> >    [
> >
> https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688578#action_12688578
> ]
> >
> > Nitay Joffe commented on HBASE-1232:
> > ------------------------------------
> >
> > When a SessionExpired occurs we will lose our ephemeral nodes. This means
> > everyone else in the cluster will think that node is down. To fix this we
> > need to restart the node completely.
> >
> > For example, if the master's connection to ZooKeeper throws
> SessionExpired
> > it loses its ephemeral address node in ZooKeeper and everyone will think
> the
> > master has died. In fact, another master may come up now that we have the
> HA
> > master lock.
> >
> > I'm writing the #restart() methods for HMaster and HRegionServer.
> > Effectively it's just something like:
> >
> > {code}
> >  shutdown();
> >  run();
> > {code}
> >
> > I notice that the shutdown/stop methods in those classes just set a flag
> > which is later picked up and causes a shutdown. How do I make sure the
> > server is actually shutdown between the shutdown() call and the run()
> call?
> >
> > > zookeeper client wont reconnect if there is a problem
> > > -----------------------------------------------------
> > >
> > >                 Key: HBASE-1232
> > >                 URL: https://issues.apache.org/jira/browse/HBASE-1232
> > >             Project: Hadoop HBase
> > >          Issue Type: Bug
> > >         Environment: java 1.7, zookeeper 3.0.1
> > >            Reporter: ryan rawson
> > >            Assignee: Nitay Joffe
> > >            Priority: Critical
> > >             Fix For: 0.20.0
> > >
> > >
> > > my regionserver got wedged:
> > > 2009-03-02 15:43:30,938 WARN
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create
> /hbase:
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired for /hbase
> > >         at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:87)
> > >         at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:35)
> > >         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482)
> > >         at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219)
> > >         at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240)
> > >         at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950)
> > >         at
> > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370)
> > >         at
> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314)
> > >         at
> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294)
> > >         at
> > org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237)
> > >         at
> > org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216)
> > >         at
> >
> org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174)
> > >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607)
> > >         at
> >
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174)
> > >         at
> >
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107)
> > > this message repeats over and over.
> > > Looking at the code in question:
> > >   private boolean ensureExists(final String znode) {
> > >     try {
> > >       zooKeeper.create(znode, new byte[0],
> > >                        Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
> > >       LOG.debug("Created ZNode " + znode);
> > >       return true;
> > >     } catch (KeeperException.NodeExistsException e) {
> > >       return true;      // ok, move on.
> > >     } catch (KeeperException.NoNodeException e) {
> > >       return ensureParentExists(znode) && ensureExists(znode);
> > >     } catch (KeeperException e) {
> > >       LOG.warn("Failed to create " + znode + ":", e);
> > >     } catch (InterruptedException e) {
> > >       LOG.warn("Failed to create " + znode + ":", e);
> > >     }
> > >     return false;
> > >   }
> > > We need to catch this exception specifically and reopen the ZK
> > connection.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>