You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2010/09/14 23:54:45 UTC

[jira] Created: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg

rolling-restart.sh shouldn't rely on zoo.cfg
--------------------------------------------

                 Key: HBASE-2998
                 URL: https://issues.apache.org/jira/browse/HBASE-2998
             Project: HBase
          Issue Type: Bug
            Reporter: Jean-Daniel Cryans
             Fix For: 0.90.0


I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running). When the script calls this line:

{code}
bin/hbase zkcli stat $zmaster
{code}

It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused, ignores it, procedes to restart the master (which waits on the znode), and the starts restarting the region servers. They can't shutdown properly under 60 seconds, since they need a master, so they get killed. What follows is pretty ugly and pretty much requires a whole restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-2998.
--------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Thanks for the review Jon.  I did as you suggested (and that test passes).  I just tried it too up on cluster w/ 5 node ensemble.  Committing.



> rolling-restart.sh shouldn't rely on zoo.cfg
> --------------------------------------------
>
>                 Key: HBASE-2998
>                 URL: https://issues.apache.org/jira/browse/HBASE-2998
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: 2998.txt
>
>
> I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running). When the script calls this line:
> {code}
> bin/hbase zkcli stat $zmaster
> {code}
> It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused, ignores it, procedes to restart the master (which waits on the znode), and the starts restarting the region servers. They can't shutdown properly under 60 seconds, since they need a master, so they get killed. What follows is pretty ugly and pretty much requires a whole restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2998:
-------------------------

    Priority: Critical  (was: Major)

This has to work.  Making it critical.

> rolling-restart.sh shouldn't rely on zoo.cfg
> --------------------------------------------
>
>                 Key: HBASE-2998
>                 URL: https://issues.apache.org/jira/browse/HBASE-2998
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.90.0
>
>
> I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running). When the script calls this line:
> {code}
> bin/hbase zkcli stat $zmaster
> {code}
> It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused, ignores it, procedes to restart the master (which waits on the znode), and the starts restarting the region servers. They can't shutdown properly under 60 seconds, since they need a master, so they get killed. What follows is pretty ugly and pretty much requires a whole restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923991#action_12923991 ] 

HBase Review Board commented on HBASE-2998:
-------------------------------------------

Message from: "Jonathan Gray" <jg...@apache.org>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1057/#review1624
-----------------------------------------------------------

Ship it!


+1 after doing changes we discussed on IRC.  Namely to make sure that the shutdown methods are idempotent and will work for stopping a backup master and that TestMasterFailover passes.  Also some minor logging/comment changes around deleting root location.

- Jonathan





> rolling-restart.sh shouldn't rely on zoo.cfg
> --------------------------------------------
>
>                 Key: HBASE-2998
>                 URL: https://issues.apache.org/jira/browse/HBASE-2998
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: 2998.txt
>
>
> I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running). When the script calls this line:
> {code}
> bin/hbase zkcli stat $zmaster
> {code}
> It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused, ignores it, procedes to restart the master (which waits on the znode), and the starts restarting the region servers. They can't shutdown properly under 60 seconds, since they need a master, so they get killed. What follows is pretty ugly and pretty much requires a whole restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-2998:
-------------------------

    Attachment: 2998.txt

Not finished yet....

> rolling-restart.sh shouldn't rely on zoo.cfg
> --------------------------------------------
>
>                 Key: HBASE-2998
>                 URL: https://issues.apache.org/jira/browse/HBASE-2998
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: 2998.txt
>
>
> I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running). When the script calls this line:
> {code}
> bin/hbase zkcli stat $zmaster
> {code}
> It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused, ignores it, procedes to restart the master (which waits on the znode), and the starts restarting the region servers. They can't shutdown properly under 60 seconds, since they need a master, so they get killed. What follows is pretty ugly and pretty much requires a whole restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923269#action_12923269 ] 

HBase Review Board commented on HBASE-2998:
-------------------------------------------

Message from: "Jonathan Gray" <jg...@apache.org>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1057/#review1594
-----------------------------------------------------------


Looking good!


trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ShutdownHook.java
<http://review.cloudera.org/r/1057/#comment5394>

    


- Jonathan





> rolling-restart.sh shouldn't rely on zoo.cfg
> --------------------------------------------
>
>                 Key: HBASE-2998
>                 URL: https://issues.apache.org/jira/browse/HBASE-2998
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: 2998.txt
>
>
> I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running). When the script calls this line:
> {code}
> bin/hbase zkcli stat $zmaster
> {code}
> It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused, ignores it, procedes to restart the master (which waits on the znode), and the starts restarting the region servers. They can't shutdown properly under 60 seconds, since they need a master, so they get killed. What follows is pretty ugly and pretty much requires a whole restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923372#action_12923372 ] 

HBase Review Board commented on HBASE-2998:
-------------------------------------------

Message from: stack@duboce.net

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1057/
-----------------------------------------------------------

(Updated 2010-10-21 01:54:34.658192)


Review request for hbase, Jean-Daniel Cryans and Jonathan Gray.


Changes
-------

New patch includes faster assign of regions on startup (Uses async create/exists-set-watcher).  Getting this working helps w/ rolling restart tests.  Assign and watcher set for 2k regions runs fast now... used to be 90 seconds for 2k regions over 10 servers ... now its a matter of seconds for total bulk assign of all regions in just over a minute.

This patch is not yet ready.  I need to test more.


Summary
-------

Fix 'hbase zkcli' so it reads zk ensemble location from hbase config/zoo.cfg.  This fixes rolling restart.  Patch also includes fix so rolling restarts work on new master.

A src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperMainServerArg.java
  Test for new TZMSA class.
M src/main/java/org/apache/hadoop/hbase/zookeeper/ZKServerTool.java
  Minor edit of javadoc.
A src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperMainServerArg.java
  Tool to emit what ZooKeeperMain wants for a server argument.
M src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
  (isAbort): Added.
M src/main/java/org/apache/hadoop/hbase/regionserver/ShutdownHook.java
  Shutdown hook now needs to startup region shutdowns since  new
  master changed how shutdown sequence runs.
M src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
  Don't do opens if server is stopped.
M src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java
  Minor formatting.
M bin/hbase
  Run new ZKMSA tool to figure '-server host:port' to pass ZKM
M bin/hbase-daemon.sh
  Make default wait be longer.


This addresses bug hbase-2998.
    http://issues.apache.org/jira/browse/hbase-2998


Diffs (updated)
-----

  trunk/bin/hbase 1025815 
  trunk/bin/hbase-daemon.sh 1025815 
  trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1025815 
  trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1025815 
  trunk/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java 1025815 
  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1025815 
  trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1025815 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1025815 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1025815 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKServerTool.java 1025815 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java 1025815 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWatcher.java 1025815 

Diff: http://review.cloudera.org/r/1057/diff


Testing
-------


Thanks,

stack




> rolling-restart.sh shouldn't rely on zoo.cfg
> --------------------------------------------
>
>                 Key: HBASE-2998
>                 URL: https://issues.apache.org/jira/browse/HBASE-2998
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: 2998.txt
>
>
> I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running). When the script calls this line:
> {code}
> bin/hbase zkcli stat $zmaster
> {code}
> It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused, ignores it, procedes to restart the master (which waits on the znode), and the starts restarting the region servers. They can't shutdown properly under 60 seconds, since they need a master, so they get killed. What follows is pretty ugly and pretty much requires a whole restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923970#action_12923970 ] 

HBase Review Board commented on HBASE-2998:
-------------------------------------------

Message from: stack@duboce.net

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1057/
-----------------------------------------------------------

(Updated 2010-10-22 11:59:16.222128)


Review request for hbase, Jean-Daniel Cryans and Jonathan Gray.


Changes
-------

Here is updated patch. I've been testing it up on cluster.  It fixes a bunch of things that rolling restart unearths...but there is still work to do.  Meantime, this patch is growing beyond scope of the JIRA so would like to get it in in its current state.

Includes consideration of Jon's last review -- shutdown handler didn't need
amending afterall.

Includes refactor of master run because failure during processing of failover
was having the master exit but not go down -- rpc server was left up.


Summary
-------

Fix 'hbase zkcli' so it reads zk ensemble location from hbase config/zoo.cfg.  This fixes rolling restart.  Patch also includes fix so rolling restarts work on new master.

A src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperMainServerArg.java
  Test for new TZMSA class.
M src/main/java/org/apache/hadoop/hbase/zookeeper/ZKServerTool.java
  Minor edit of javadoc.
A src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperMainServerArg.java
  Tool to emit what ZooKeeperMain wants for a server argument.
M src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
  (isAbort): Added.
M src/main/java/org/apache/hadoop/hbase/regionserver/ShutdownHook.java
  Shutdown hook now needs to startup region shutdowns since  new
  master changed how shutdown sequence runs.
M src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java
  Don't do opens if server is stopped.
M src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java
  Minor formatting.
M bin/hbase
  Run new ZKMSA tool to figure '-server host:port' to pass ZKM
M bin/hbase-daemon.sh
  Make default wait be longer.


This addresses bug hbase-2998.
    http://issues.apache.org/jira/browse/hbase-2998


Diffs (updated)
-----

  trunk/bin/hbase 1026448 
  trunk/bin/hbase-daemon.sh 1026448 
  trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 1026448 
  trunk/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java 1026448 
  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 1026448 
  trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1026448 
  trunk/src/main/java/org/apache/hadoop/hbase/master/handler/OpenedRegionHandler.java 1026448 
  trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java 1026448 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1026448 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKServerTool.java 1026448 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKUtil.java 1026448 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperMainServerArg.java PRE-CREATION 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZooKeeperWatcher.java 1026448 
  trunk/src/test/java/org/apache/hadoop/hbase/zookeeper/TestZooKeeperMainServerArg.java PRE-CREATION 

Diff: http://review.cloudera.org/r/1057/diff


Testing
-------


Thanks,

stack




> rolling-restart.sh shouldn't rely on zoo.cfg
> --------------------------------------------
>
>                 Key: HBASE-2998
>                 URL: https://issues.apache.org/jira/browse/HBASE-2998
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: 2998.txt
>
>
> I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running). When the script calls this line:
> {code}
> bin/hbase zkcli stat $zmaster
> {code}
> It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused, ignores it, procedes to restart the master (which waits on the znode), and the starts restarting the region servers. They can't shutdown properly under 60 seconds, since they need a master, so they get killed. What follows is pretty ugly and pretty much requires a whole restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack reassigned HBASE-2998:
----------------------------

    Assignee: stack

> rolling-restart.sh shouldn't rely on zoo.cfg
> --------------------------------------------
>
>                 Key: HBASE-2998
>                 URL: https://issues.apache.org/jira/browse/HBASE-2998
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>
> I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running). When the script calls this line:
> {code}
> bin/hbase zkcli stat $zmaster
> {code}
> It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused, ignores it, procedes to restart the master (which waits on the znode), and the starts restarting the region servers. They can't shutdown properly under 60 seconds, since they need a master, so they get killed. What follows is pretty ugly and pretty much requires a whole restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12922854#action_12922854 ] 

stack commented on HBASE-2998:
------------------------------

So, ZKMain takes a -server argument which is host+port.  I was thinking of doing something like the following in hbase so when passed zkcli we actually passed in the -server argument with the host+port read from hbase Configuration (or from zoo.cfg if present):

{code}
Index: bin/hbase
===================================================================
--- bin/hbase   (revision 1024523)
+++ bin/hbase   (working copy)
@@ -262,7 +262,8 @@
     HBASE_OPTS="$HBASE_OPTS $HBASE_ZOOKEEPER_OPTS"
   fi
 elif [ "$COMMAND" = "zkcli" ] ; then
-  CLASS='org.apache.zookeeper.ZooKeeperMain'
+  SERVERPORT=`"$bin"/hbase org.apache.hadoop.hbase.zookeeper.ZKServerTool -hostport | grep '^ZK hostport:' | sed 's,^ZK hostport:,,'` 
+  CLASS='org.apache.zookeeper.ZooKeeperMain -server ${SERVERPORT}'
 elif [ "$COMMAND" = "classpath" ] ; then
   echo $CLASSPATH
   exit 0
{code}

I need to mangle the ZKServerTool some to output what I want.

> rolling-restart.sh shouldn't rely on zoo.cfg
> --------------------------------------------
>
>                 Key: HBASE-2998
>                 URL: https://issues.apache.org/jira/browse/HBASE-2998
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.90.0
>
>
> I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running). When the script calls this line:
> {code}
> bin/hbase zkcli stat $zmaster
> {code}
> It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused, ignores it, procedes to restart the master (which waits on the znode), and the starts restarting the region servers. They can't shutdown properly under 60 seconds, since they need a master, so they get killed. What follows is pretty ugly and pretty much requires a whole restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-2998) rolling-restart.sh shouldn't rely on zoo.cfg

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923373#action_12923373 ] 

HBase Review Board commented on HBASE-2998:
-------------------------------------------

Message from: stack@duboce.net


bq.  On 2010-10-20 17:54:04, Jonathan Gray wrote:
bq.  > trunk/src/main/java/org/apache/hadoop/hbase/regionserver/ShutdownHook.java, line 103
bq.  > <http://review.cloudera.org/r/1057/diff/1/?file=15040#file15040line103>
bq.  >
bq.  >
bq.  
bq.  Jonathan Gray wrote:
bq.      Not sure where my comment went :)  Just wondering if in the rolling restart case, we aren't aborting, right?
bq.      
bq.      Is the stop() now a hard stop and that's why we need to close regions first?

Looking into this more --prompted by your comment -- this extra callout to shut down user regions in shutdown hook shouldn't be needed.  Something else is going on.  I'm digging in.

Right, we are not aborting when rolling restart.  We are intercepting the jvm kill signal.


- stack


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1057/#review1594
-----------------------------------------------------------





> rolling-restart.sh shouldn't rely on zoo.cfg
> --------------------------------------------
>
>                 Key: HBASE-2998
>                 URL: https://issues.apache.org/jira/browse/HBASE-2998
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jean-Daniel Cryans
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: 2998.txt
>
>
> I tried the rolling-restart script on our dev environment, which is configured with zoo.cfg for zookeeper, and it worked pretty well. Then I tried it on our MR cluster, which doesn't have a zoo.cfg, and we suffered some downtime (no biggie tho, nothing critical was running). When the script calls this line:
> {code}
> bin/hbase zkcli stat $zmaster
> {code}
> It directly runs a ZooKeeperMain which isn't modified to read from the HBase configuration files. What happens next if ZK isn't running on the master node is that it receives a ConnectionRefused, ignores it, procedes to restart the master (which waits on the znode), and the starts restarting the region servers. They can't shutdown properly under 60 seconds, since they need a master, so they get killed. What follows is pretty ugly and pretty much requires a whole restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.