You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Ryan Rawson <ry...@gmail.com> on 2009/03/24 07:10:38 UTC
Re: [jira] Commented: (HBASE-1232) zookeeper client wont reconnect if
there is a problem
My issue i originally complained about was from the _clients_ point of view
who doesnt actually create ephemeral nodes.
But the other problems stand.
On Mon, Mar 23, 2009 at 11:01 PM, Nitay Joffe (JIRA) <ji...@apache.org>wrote:
>
> [
> https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688578#action_12688578]
>
> Nitay Joffe commented on HBASE-1232:
> ------------------------------------
>
> When a SessionExpired occurs we will lose our ephemeral nodes. This means
> everyone else in the cluster will think that node is down. To fix this we
> need to restart the node completely.
>
> For example, if the master's connection to ZooKeeper throws SessionExpired
> it loses its ephemeral address node in ZooKeeper and everyone will think the
> master has died. In fact, another master may come up now that we have the HA
> master lock.
>
> I'm writing the #restart() methods for HMaster and HRegionServer.
> Effectively it's just something like:
>
> {code}
> shutdown();
> run();
> {code}
>
> I notice that the shutdown/stop methods in those classes just set a flag
> which is later picked up and causes a shutdown. How do I make sure the
> server is actually shutdown between the shutdown() call and the run() call?
>
> > zookeeper client wont reconnect if there is a problem
> > -----------------------------------------------------
> >
> > Key: HBASE-1232
> > URL: https://issues.apache.org/jira/browse/HBASE-1232
> > Project: Hadoop HBase
> > Issue Type: Bug
> > Environment: java 1.7, zookeeper 3.0.1
> > Reporter: ryan rawson
> > Assignee: Nitay Joffe
> > Priority: Critical
> > Fix For: 0.20.0
> >
> >
> > my regionserver got wedged:
> > 2009-03-02 15:43:30,938 WARN
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create /hbase:
> > org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /hbase
> > at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:87)
> > at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:35)
> > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482)
> > at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219)
> > at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240)
> > at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328)
> > at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783)
> > at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468)
> > at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443)
> > at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518)
> > at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477)
> > at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450)
> > at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295)
> > at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919)
> > at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950)
> > at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370)
> > at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314)
> > at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294)
> > at
> org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237)
> > at
> org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216)
> > at
> org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174)
> > at
> org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607)
> > at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174)
> > at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107)
> > this message repeats over and over.
> > Looking at the code in question:
> > private boolean ensureExists(final String znode) {
> > try {
> > zooKeeper.create(znode, new byte[0],
> > Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
> > LOG.debug("Created ZNode " + znode);
> > return true;
> > } catch (KeeperException.NodeExistsException e) {
> > return true; // ok, move on.
> > } catch (KeeperException.NoNodeException e) {
> > return ensureParentExists(znode) && ensureExists(znode);
> > } catch (KeeperException e) {
> > LOG.warn("Failed to create " + znode + ":", e);
> > } catch (InterruptedException e) {
> > LOG.warn("Failed to create " + znode + ":", e);
> > }
> > return false;
> > }
> > We need to catch this exception specifically and reopen the ZK
> connection.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
Re: [jira] Commented: (HBASE-1232) zookeeper client wont reconnect if
there is a problem
Posted by Nitay <ni...@gmail.com>.
Yeah, I'm handling all three cases (Master, RegionServer, Client) in the
same code. We could just let the Master/RegionServer fail on the
SessionExpired and have the user clean it up? but that seems ugly since it
is something we can handle.
On Mon, Mar 23, 2009 at 11:10 PM, Ryan Rawson <ry...@gmail.com> wrote:
> My issue i originally complained about was from the _clients_ point of view
> who doesnt actually create ephemeral nodes.
>
> But the other problems stand.
>
> On Mon, Mar 23, 2009 at 11:01 PM, Nitay Joffe (JIRA) <jira@apache.org
> >wrote:
>
> >
> > [
> >
> https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688578#action_12688578
> ]
> >
> > Nitay Joffe commented on HBASE-1232:
> > ------------------------------------
> >
> > When a SessionExpired occurs we will lose our ephemeral nodes. This means
> > everyone else in the cluster will think that node is down. To fix this we
> > need to restart the node completely.
> >
> > For example, if the master's connection to ZooKeeper throws
> SessionExpired
> > it loses its ephemeral address node in ZooKeeper and everyone will think
> the
> > master has died. In fact, another master may come up now that we have the
> HA
> > master lock.
> >
> > I'm writing the #restart() methods for HMaster and HRegionServer.
> > Effectively it's just something like:
> >
> > {code}
> > shutdown();
> > run();
> > {code}
> >
> > I notice that the shutdown/stop methods in those classes just set a flag
> > which is later picked up and causes a shutdown. How do I make sure the
> > server is actually shutdown between the shutdown() call and the run()
> call?
> >
> > > zookeeper client wont reconnect if there is a problem
> > > -----------------------------------------------------
> > >
> > > Key: HBASE-1232
> > > URL: https://issues.apache.org/jira/browse/HBASE-1232
> > > Project: Hadoop HBase
> > > Issue Type: Bug
> > > Environment: java 1.7, zookeeper 3.0.1
> > > Reporter: ryan rawson
> > > Assignee: Nitay Joffe
> > > Priority: Critical
> > > Fix For: 0.20.0
> > >
> > >
> > > my regionserver got wedged:
> > > 2009-03-02 15:43:30,938 WARN
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create
> /hbase:
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired for /hbase
> > > at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:87)
> > > at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:35)
> > > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482)
> > > at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219)
> > > at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240)
> > > at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328)
> > > at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783)
> > > at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468)
> > > at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443)
> > > at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518)
> > > at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477)
> > > at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450)
> > > at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295)
> > > at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919)
> > > at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950)
> > > at
> > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370)
> > > at
> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314)
> > > at
> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294)
> > > at
> > org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237)
> > > at
> > org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216)
> > > at
> >
> org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174)
> > > at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607)
> > > at
> >
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174)
> > > at
> >
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107)
> > > this message repeats over and over.
> > > Looking at the code in question:
> > > private boolean ensureExists(final String znode) {
> > > try {
> > > zooKeeper.create(znode, new byte[0],
> > > Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
> > > LOG.debug("Created ZNode " + znode);
> > > return true;
> > > } catch (KeeperException.NodeExistsException e) {
> > > return true; // ok, move on.
> > > } catch (KeeperException.NoNodeException e) {
> > > return ensureParentExists(znode) && ensureExists(znode);
> > > } catch (KeeperException e) {
> > > LOG.warn("Failed to create " + znode + ":", e);
> > > } catch (InterruptedException e) {
> > > LOG.warn("Failed to create " + znode + ":", e);
> > > }
> > > return false;
> > > }
> > > We need to catch this exception specifically and reopen the ZK
> > connection.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>