You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2013/06/15 23:23:19 UTC

[jira] [Created] (HBASE-8748) Be able to accomodate zookeeper going away for a minute or two -- or more

stack created HBASE-8748:
----------------------------

             Summary: Be able to accomodate zookeeper going away for a minute or two -- or more
                 Key: HBASE-8748
                 URL: https://issues.apache.org/jira/browse/HBASE-8748
             Project: HBase
          Issue Type: Brainstorming
          Components: Zookeeper
            Reporter: stack


I was talking w/ Christophe Taton yesterday and he asked what happens if zookeeper goes away for a minute or two -- say a network or ensemble hiccup of some type -- then what happens?

Unless the ensemble comes back inside the zk session timeout, the cluster will go down.

To my knowledge, zk has hiccuped a few times.  There was the bug where sequence numbers rolled around the top causing the ensemble to blip (fixed in a newer zk).  There was another event where <speculation>some combination of a leader election and accumulated log files (>100k)</speculation> caused the ensemble blip at SU.  

At FB apparently the zk session is way up -- > 5minutes -- in case a top-of-the-rack switch reboots partitioning the network separating nodes from the zk ensemble and rather than rely on presence of ephemeral nodes, rather, they depend on heartbeats to determine presence or not of a regionserver (w/ some smarts so that if all members of a rack disappear at the same time, it is not likely they all crashed at same time).

I am stating the obvious I know but the base presumption that zk will just always be there is lazy on our part and we should not be acting as though it were.

Marking this a brainstorming issue because will need a bit of discussion/design undoing our current presumption.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira