You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2019/12/20 22:28:01 UTC

[jira] [Resolved] (HBASE-8748) Be able to accomodate zookeeper going away for a minute or two -- or more

     [ https://issues.apache.org/jira/browse/HBASE-8748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Stack resolved HBASE-8748.
----------------------------------
    Resolution: Won't Fix

Stale. Context is different now.

> Be able to accomodate zookeeper going away for a minute or two -- or more
> -------------------------------------------------------------------------
>
>                 Key: HBASE-8748
>                 URL: https://issues.apache.org/jira/browse/HBASE-8748
>             Project: HBase
>          Issue Type: Brainstorming
>          Components: Zookeeper
>            Reporter: Michael Stack
>            Priority: Major
>
> I was talking w/ Christophe Taton yesterday and he asked what happens if zookeeper goes away for a minute or two -- say a network or ensemble hiccup of some type -- then what happens?
> Unless the ensemble comes back inside the zk session timeout, the cluster will go down.
> To my knowledge, zk has hiccuped a few times.  There was the bug where sequence numbers rolled around the top causing the ensemble to blip (fixed in a newer zk).  There was another event where <speculation>some combination of a leader election and accumulated log files (>100k)</speculation> caused the ensemble blip at SU.  
> At FB apparently the zk session is way up -- > 5minutes -- in case a top-of-the-rack switch reboots partitioning the network separating nodes from the zk ensemble and rather than rely on presence of ephemeral nodes, rather, they depend on heartbeats to determine presence or not of a regionserver (w/ some smarts so that if all members of a rack disappear at the same time, it is not likely they all crashed at same time).
> I am stating the obvious I know but the base presumption that zk will just always be there is lazy on our part and we should not be acting as though it were.
> Marking this a brainstorming issue because will need a bit of discussion/design undoing our current presumption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)