You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Nitay Joffe (JIRA)" <ji...@apache.org> on 2009/03/24 10:13:55 UTC

[jira] Issue Comment Edited: (HBASE-1232) zookeeper client wont reconnect if there is a problem

    [ https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688578#action_12688578 ] 

Nitay Joffe edited comment on HBASE-1232 at 3/24/09 2:12 AM:
-------------------------------------------------------------

When a SessionExpired occurs we will lose our ephemeral nodes. This means everyone else in the cluster will think that node is down. To fix this we need to restart the node completely.

For example, if the master's connection to ZooKeeper throws SessionExpired it loses its ephemeral address node in ZooKeeper and everyone will think the master has died. In fact, another master may come up now that we have the HA master lock.

      was (Author: nitay):
    When a SessionExpired occurs we will lose our ephemeral nodes. This means everyone else in the cluster will think that node is down. To fix this we need to restart the node completely.

For example, if the master's connection to ZooKeeper throws SessionExpired it loses its ephemeral address node in ZooKeeper and everyone will think the master has died. In fact, another master may come up now that we have the HA master lock.

I'm writing the #restart() methods for HMaster and HRegionServer. Effectively it's just something like:

{code}
  shutdown();
  run();
{code}

I notice that the shutdown/stop methods in those classes just set a flag which is later picked up and causes a shutdown. How do I make sure the server is actually shutdown between the shutdown() call and the run() call?
  
> zookeeper client wont reconnect if there is a problem
> -----------------------------------------------------
>
>                 Key: HBASE-1232
>                 URL: https://issues.apache.org/jira/browse/HBASE-1232
>             Project: Hadoop HBase
>          Issue Type: Bug
>         Environment: java 1.7, zookeeper 3.0.1
>            Reporter: ryan rawson
>            Assignee: Nitay Joffe
>            Priority: Critical
>             Fix For: 0.20.0
>
>
> my regionserver got wedged:
> 2009-03-02 15:43:30,938 WARN org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create /hbase:
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:87)
>         at org.apache.zookeeper.KeeperException.create(KeeperException.java:35)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240)
>         at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919)
>         at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950)
>         at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370)
>         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314)
>         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294)
>         at org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237)
>         at org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216)
>         at org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174)
>         at org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607)
>         at org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174)
>         at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107)
> this message repeats over and over.  
> Looking at the code in question:
>   private boolean ensureExists(final String znode) {
>     try {
>       zooKeeper.create(znode, new byte[0],
>                        Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
>       LOG.debug("Created ZNode " + znode);
>       return true;
>     } catch (KeeperException.NodeExistsException e) {
>       return true;      // ok, move on.
>     } catch (KeeperException.NoNodeException e) {
>       return ensureParentExists(znode) && ensureExists(znode);
>     } catch (KeeperException e) {
>       LOG.warn("Failed to create " + znode + ":", e);
>     } catch (InterruptedException e) {
>       LOG.warn("Failed to create " + znode + ":", e);
>     }
>     return false;
>   }
> We need to catch this exception specifically and reopen the ZK connection.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.