You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by J316 Services <j3...@icloud.com> on 2016/06/02 22:06:42 UTC

zookeeper deployment strategy for multi data centers

We have two data centers and got two servers at each.

At an event of a data center failure, with the quorum majority rule - the other surviving data center seems to be no use at all and we'll be out of luck.

Any thoughts on how best to deploy in this scenario?


Appreciate your thoughts on this.


Thanks,
Nomar



Printing e-mails wastes valuable natural resources. Please don't print this message unless it is absolutely necessary. Thank you for thinking green!

Sent from my iPhone

Re: zookeeper deployment strategy for multi data centers

Posted by Nomar Morado <j3...@icloud.com>.

We need to keep zookeeper up and running at the event of catastrophic loss of one data center.

Printing e-mails wastes valuable natural resources. Please don't print this message unless it is absolutely necessary. Thank you for thinking green!

Sent from my iPhone

> On Jun 3, 2016, at 12:18 AM, sagar shukla <sa...@yahoo.com.INVALID> wrote:
> 
> what combination of functional scenarios are you looking for?
> Regards,Sagar 
> 
>    On Friday, June 3, 2016 3:36 AM, J316 Services <j3...@icloud.com> wrote:
> 
> 
> We have two data centers and got two servers at each.
> 
> At an event of a data center failure, with the quorum majority rule - the other surviving data center seems to be no use at all and we'll be out of luck.
> 
> Any thoughts on how best to deploy in this scenario?
> 
> 
> Appreciate your thoughts on this.
> 
> 
> Thanks,
> Nomar
> 
> 
> 
> Printing e-mails wastes valuable natural resources. Please don't print this message unless it is absolutely necessary. Thank you for thinking green!
> 
> Sent from my iPhone
>

Re: zookeeper deployment strategy for multi data centers

Posted by sagar shukla <sa...@yahoo.com.INVALID>.

what combination of functional scenarios are you looking for?
Regards,Sagar 

    On Friday, June 3, 2016 3:36 AM, J316 Services <j3...@icloud.com> wrote:
 

 We have two data centers and got two servers at each.

At an event of a data center failure, with the quorum majority rule - the other surviving data center seems to be no use at all and we'll be out of luck.

Any thoughts on how best to deploy in this scenario?


Appreciate your thoughts on this.


Thanks,
Nomar



Printing e-mails wastes valuable natural resources. Please don't print this message unless it is absolutely necessary. Thank you for thinking green!

Sent from my iPhone

Re: zookeeper deployment strategy for multi data centers

Posted by Flavio Junqueira <fp...@apache.org>.

There is no black magic going on with weights or hierarchies. Whatever you do, you can't escape from the curse of the intersection, and I'm sorry for stating the obvious, but that's the key reason why masking a data center going down out of two is so hard.

In general, using majorities makes it simpler to reason about the problem, in particular because the failure scenarios are uniform. With weights, you can play some tricks, but they do not necessarily give you a great advantage. For example, say I have 6 servers, I split them between two data centers, and I assign a weight of 2 to one of them. I can form a quorum in one single data center despite the fact that I have an equal number of servers in each data center. I could alternatively have 5 voting members and one observer and get a similar behavior. One difference is that in the case the server with weight 2 is up, I can tolerate up to 3 nodes crashing. If the server with weight 2 is down, then I can tolerate only one additional crash. With the 5V+1O configuration, I can tolerate two crashed servers always.

The hierarchical stuff starts making sense for me with more complex scenarios. For example, say I have three data centers and I put three servers in each. Using a hierarchy, I can tolerate one data center going down plus one crash in each of the remaining data centers, so I can make progress with 4 nodes up, even though my total is 9. This works because we pick a majority of votes from a majority of groups. Note that it is not any 4 servers, though. If one data center is down, then I can't crash two nodes in one of the remaining data centers. Also, because we only require 4 votes to form a quorum, we require fewer votes to commit requests, and possibly fewer cross-DC votes. This last observation affects mostly failure scenarios because with majority, one also needs only two cross-dc votes to form a quorum in the absence of crashes.

The scenario that has been traditionally a problem, and that's really not new, is the active-passive one, which we can't really make it work transparently. The workaround is to manually reconfigure servers, knowing that there could be some data loss. Even with reconfiguration in 3.5, we are still stuck in the case the active goes down because we need an old quorum to get the reconfiguration to succeed. 

It is possible that the hierarchical approach makes some sense for some scenarios because of the following argument. It is not always desirable, but say that you want to replicate synchronously across data centers. If I use the 5V + 1O configuration with 3V in the active DC and 2V in the passive DC, I can't guarantee that committed requests will be persisted in at least one node in the passive DC. In fact, to make that guarantee, we need a whole DC worth of votes + 1. With the hierarchical approach, we can have two groups, which forces every commit to be replicated in both groups, but I can tolerate crashes in both groups and guarantee that updates are synchronously replicated. In the case the active DC goes down, we can reconfigure manually from two groups to one group, and since the replication is synchronous, the passive DC will have all commits.

Hopefully this analysis is correct and makes some sense. Cross-dc replication is a fascinating topic!

-Flavio

The hierarchical quorum stuff starts to make sense when you have more complex deployments, not when 
> On 03 Jun 2016, at 23:54, Camille Fournier <ca...@apache.org> wrote:
> 
> You don't need weights to run this cluster successfully across 2
> datacenters, unless you want to run with 4 live read/write nodes which
> isn't really a recommended setup (we advise odd numbers because running
> with even numbers doesn't generally buy you anything).
> 
> I would probably run 3 voting members, 1 observer if you want to run 4
> nodes. In that setup you can lose any one voting node, and of course the
> observer, and be fine. If you lose 2 voting nodes, whether in the same DC
> or x-DC, you will not be able to continue. But votes only need to be acked
> by any 2 servers to be committed.
> 
> In the case of weights and 4 servers, you will either need to ack both of
> the servers in the weighted datacenter or the 2 in the unweighted DC and
> one in the weighted DC.
> 
> I've actually yet to see the killer app for using hierarchy and weights
> although I'd be interested in hearing about it if someone has an example.
> It's not clear that there's a huge value here unless the observer is
> significantly less effective than a full r/w quorum member which would be
> surprising.
> 
> C
> 
> 
> On Fri, Jun 3, 2016 at 6:33 PM, Dan Benediktson <
> dbenediktson@twitter.com.invalid> wrote:
> 
>> Weights will at least let you do better: if you weight it, you can make it
>> so that datacenter A will survive even if datacenter B goes down, but not
>> the other way around. While not ideal, it's probably better than the
>> non-weighted alternative. (2, 2, 1, 1) weights might work fairly well - as
>> long as any three machines are up, or both machines in the preferred
>> datacenter, quorum can be achieved.
>> 
>> On Fri, Jun 3, 2016 at 3:23 PM, Camille Fournier <ca...@apache.org>
>> wrote:
>> 
>>> You can't solve this with weights.
>>> On Jun 3, 2016 6:03 PM, "Michael Han" <ha...@cloudera.com> wrote:
>>> 
>>>> ZK supports more than just majority quorum rule, there are also
>> weights /
>>>> hierarchy of groups based quorum [1]. So probably one can assign more
>>>> weights to one out of two data center which can form a weight based
>>> quorum
>>>> even if another DC is failing?
>>>> 
>>>> Another idea is to instead of forming a single ZK ensemble across DCs,
>>>> forming multiple ZK ensembles across DCs with one ensemble per DC. This
>>>> solution might be applicable for heavy read / light write workload
>> while
>>>> providing certain degree of fault tolerance. Some relevant discussions
>>> [2].
>>>> 
>>>> [1] https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html (
>>>> weight.x=nnnnn, group.x=nnnnn[:nnnnn]).
>>>> [2]
>>>> 
>>>> 
>>> 
>> https://www.quora.com/Has-anyone-deployed-a-ZooKeeper-ensemble-across-data-centers
>>>> 
>>>> On Fri, Jun 3, 2016 at 2:16 PM, Alexander Shraer <sh...@gmail.com>
>>>> wrote:
>>>> 
>>>>>> Is there any settings to override the quorum rule? Would you know
>> the
>>>>> rationale behind it?
>>>>> 
>>>>> The rule comes from a theoretical impossibility saying that you must
>>>> have n
>>>>>> 2f replicas
>>>>> to tolerate f failures, for any algorithm trying to solve consensus
>>> while
>>>>> being able to handle
>>>>> periods of asynchrony (unbounded message delays, processing times,
>>> etc).
>>>>> The earliest proof is probably here: paper
>>>>> <
>>> https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf
>>>>> .
>>>>> ZooKeeper is assuming this model, so the bound applies
>>>>> to it.
>>>>> 
>>>>> The intuition is what's called a 'partition argument'. Essentially if
>>>> only
>>>>> 2f replicas were sufficient, you
>>>>> could arbitrarily divide them into 2 sets of f replicas, and create a
>>>>> situation where each set of f
>>>>> must go on independently without coordinating with the other set
>> (split
>>>>> brain), when the links between the two sets are slow (i.e., a network
>>>>> partition),
>>>>> simply because the other set could also be down (the algorithm
>>> tolerates
>>>> f
>>>>> failures) and it can't distinguish the two situations.
>>>>> When n > 2f this can be avoided since one of the sets will have
>>> majority
>>>>> while the other set won't.
>>>>> 
>>>>> The key here is that the links between the two data centers can
>>>> arbitrarily
>>>>> delay messages, so an automatic
>>>>> 'fail-over' where one data center decides that the other one is down
>> is
>>>>> usually considered unsafe. If in your system
>>>>> you have a reliable way to know that the other data center is really
>> in
>>>>> fact down (this is a synchrony assumption), you could do as Camille
>>>>> suggested and
>>>>> reconfigure the system to only include the remaining data center.
>> This
>>>>> would still be very tricky to do since this reconfiguration
>>>>> would have to involve manually changing configuration files and
>>> rebooting
>>>>> servers, while somehow making sure that you're
>>>>> not loosing committed state. So not recommended.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier <
>> camille@apache.org>
>>>>> wrote:
>>>>> 
>>>>>> 2 servers is the same as 1 server wrt fault tolerance, so yes, you
>>> are
>>>>>> correct. If they want fault tolerance, they have to run 3 (or
>> more).
>>>>>> 
>>>>>> On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <ap...@elyograg.org>
>>>>> wrote:
>>>>>> 
>>>>>>> On 6/3/2016 1:44 PM, Nomar Morado wrote:
>>>>>>>> Is there any settings to override the quorum rule? Would you
>> know
>>>> the
>>>>>>>> rationale behind it? Ideally, you will want to operate the
>>>>> application
>>>>>>>> even if at least one data center is up.
>>>>>>> 
>>>>>>> I do not know if the quorum rule can be overridden, or whether
>> your
>>>>>>> application can tell the difference between a loss of quorum and
>>>>>>> zookeeper going down entirely.  I really don't know anything
>> about
>>>>>>> zookeeper client code or zookeeper internals.
>>>>>>> 
>>>>>>> From what I understand, majority quorum is the only way to be
>>>>>>> *completely* sure that cluster software like SolrCloud or your
>>>>>>> application can handle write operations with confidence that they
>>> are
>>>>>>> applied correctly.  If you lose quorum, which will happen if only
>>> one
>>>>> DC
>>>>>>> is operational, then your application should go read-only.  This
>> is
>>>>> what
>>>>>>> SolrCloud does.
>>>>>>> 
>>>>>>> I am a committer on the Apache Solr project, and Solr uses
>>> zookeeper
>>>>>>> when it is running in SolrCloud mode.  The cloud code is handled
>> by
>>>>>>> other people -- I don't know much about it.
>>>>>>> 
>>>>>>> I joined this list because I wanted to have the ZK devs include a
>>>>>>> clarification in zookeeper documentation -- oddly enough, related
>>> to
>>>>> the
>>>>>>> very thing we are discussing.  I wanted to be sure that the
>>>>>>> documentation explicitly mentioned that three serversare required
>>>> for a
>>>>>>> fault-tolerant setup.  Some SolrCloud users don't want to accept
>>> this
>>>>> as
>>>>>>> a fact, and believe that two servers should be enough.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Shawn
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Cheers
>>>> Michael.
>>>> 
>>> 
>>

Re: zookeeper deployment strategy for multi data centers

Posted by Camille Fournier <ca...@apache.org>.

You don't need weights to run this cluster successfully across 2
datacenters, unless you want to run with 4 live read/write nodes which
isn't really a recommended setup (we advise odd numbers because running
with even numbers doesn't generally buy you anything).

I would probably run 3 voting members, 1 observer if you want to run 4
nodes. In that setup you can lose any one voting node, and of course the
observer, and be fine. If you lose 2 voting nodes, whether in the same DC
or x-DC, you will not be able to continue. But votes only need to be acked
by any 2 servers to be committed.

In the case of weights and 4 servers, you will either need to ack both of
the servers in the weighted datacenter or the 2 in the unweighted DC and
one in the weighted DC.

I've actually yet to see the killer app for using hierarchy and weights
although I'd be interested in hearing about it if someone has an example.
It's not clear that there's a huge value here unless the observer is
significantly less effective than a full r/w quorum member which would be
surprising.

C


On Fri, Jun 3, 2016 at 6:33 PM, Dan Benediktson <
dbenediktson@twitter.com.invalid> wrote:

> Weights will at least let you do better: if you weight it, you can make it
> so that datacenter A will survive even if datacenter B goes down, but not
> the other way around. While not ideal, it's probably better than the
> non-weighted alternative. (2, 2, 1, 1) weights might work fairly well - as
> long as any three machines are up, or both machines in the preferred
> datacenter, quorum can be achieved.
>
> On Fri, Jun 3, 2016 at 3:23 PM, Camille Fournier <ca...@apache.org>
> wrote:
>
> > You can't solve this with weights.
> > On Jun 3, 2016 6:03 PM, "Michael Han" <ha...@cloudera.com> wrote:
> >
> > > ZK supports more than just majority quorum rule, there are also
> weights /
> > > hierarchy of groups based quorum [1]. So probably one can assign more
> > > weights to one out of two data center which can form a weight based
> > quorum
> > > even if another DC is failing?
> > >
> > > Another idea is to instead of forming a single ZK ensemble across DCs,
> > > forming multiple ZK ensembles across DCs with one ensemble per DC. This
> > > solution might be applicable for heavy read / light write workload
> while
> > > providing certain degree of fault tolerance. Some relevant discussions
> > [2].
> > >
> > > [1] https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html (
> > > weight.x=nnnnn, group.x=nnnnn[:nnnnn]).
> > > [2]
> > >
> > >
> >
> https://www.quora.com/Has-anyone-deployed-a-ZooKeeper-ensemble-across-data-centers
> > >
> > > On Fri, Jun 3, 2016 at 2:16 PM, Alexander Shraer <sh...@gmail.com>
> > > wrote:
> > >
> > > > > Is there any settings to override the quorum rule? Would you know
> the
> > > > rationale behind it?
> > > >
> > > > The rule comes from a theoretical impossibility saying that you must
> > > have n
> > > > > 2f replicas
> > > > to tolerate f failures, for any algorithm trying to solve consensus
> > while
> > > > being able to handle
> > > > periods of asynchrony (unbounded message delays, processing times,
> > etc).
> > > > The earliest proof is probably here: paper
> > > > <
> > https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf
> > > >.
> > > > ZooKeeper is assuming this model, so the bound applies
> > > > to it.
> > > >
> > > > The intuition is what's called a 'partition argument'. Essentially if
> > > only
> > > > 2f replicas were sufficient, you
> > > > could arbitrarily divide them into 2 sets of f replicas, and create a
> > > > situation where each set of f
> > > > must go on independently without coordinating with the other set
> (split
> > > > brain), when the links between the two sets are slow (i.e., a network
> > > > partition),
> > > > simply because the other set could also be down (the algorithm
> > tolerates
> > > f
> > > > failures) and it can't distinguish the two situations.
> > > > When n > 2f this can be avoided since one of the sets will have
> > majority
> > > > while the other set won't.
> > > >
> > > > The key here is that the links between the two data centers can
> > > arbitrarily
> > > > delay messages, so an automatic
> > > > 'fail-over' where one data center decides that the other one is down
> is
> > > > usually considered unsafe. If in your system
> > > > you have a reliable way to know that the other data center is really
> in
> > > > fact down (this is a synchrony assumption), you could do as Camille
> > > > suggested and
> > > > reconfigure the system to only include the remaining data center.
> This
> > > > would still be very tricky to do since this reconfiguration
> > > > would have to involve manually changing configuration files and
> > rebooting
> > > > servers, while somehow making sure that you're
> > > > not loosing committed state. So not recommended.
> > > >
> > > >
> > > >
> > > > On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier <
> camille@apache.org>
> > > > wrote:
> > > >
> > > > > 2 servers is the same as 1 server wrt fault tolerance, so yes, you
> > are
> > > > > correct. If they want fault tolerance, they have to run 3 (or
> more).
> > > > >
> > > > > On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <ap...@elyograg.org>
> > > > wrote:
> > > > >
> > > > > > On 6/3/2016 1:44 PM, Nomar Morado wrote:
> > > > > > > Is there any settings to override the quorum rule? Would you
> know
> > > the
> > > > > > > rationale behind it? Ideally, you will want to operate the
> > > > application
> > > > > > > even if at least one data center is up.
> > > > > >
> > > > > > I do not know if the quorum rule can be overridden, or whether
> your
> > > > > > application can tell the difference between a loss of quorum and
> > > > > > zookeeper going down entirely.  I really don't know anything
> about
> > > > > > zookeeper client code or zookeeper internals.
> > > > > >
> > > > > > From what I understand, majority quorum is the only way to be
> > > > > > *completely* sure that cluster software like SolrCloud or your
> > > > > > application can handle write operations with confidence that they
> > are
> > > > > > applied correctly.  If you lose quorum, which will happen if only
> > one
> > > > DC
> > > > > > is operational, then your application should go read-only.  This
> is
> > > > what
> > > > > > SolrCloud does.
> > > > > >
> > > > > > I am a committer on the Apache Solr project, and Solr uses
> > zookeeper
> > > > > > when it is running in SolrCloud mode.  The cloud code is handled
> by
> > > > > > other people -- I don't know much about it.
> > > > > >
> > > > > > I joined this list because I wanted to have the ZK devs include a
> > > > > > clarification in zookeeper documentation -- oddly enough, related
> > to
> > > > the
> > > > > > very thing we are discussing.  I wanted to be sure that the
> > > > > > documentation explicitly mentioned that three serversare required
> > > for a
> > > > > > fault-tolerant setup.  Some SolrCloud users don't want to accept
> > this
> > > > as
> > > > > > a fact, and believe that two servers should be enough.
> > > > > >
> > > > > > Thanks,
> > > > > > Shawn
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Cheers
> > > Michael.
> > >
> >
>

Re: zookeeper deployment strategy for multi data centers

Posted by Dan Benediktson <db...@twitter.com.INVALID>.

Weights will at least let you do better: if you weight it, you can make it
so that datacenter A will survive even if datacenter B goes down, but not
the other way around. While not ideal, it's probably better than the
non-weighted alternative. (2, 2, 1, 1) weights might work fairly well - as
long as any three machines are up, or both machines in the preferred
datacenter, quorum can be achieved.

On Fri, Jun 3, 2016 at 3:23 PM, Camille Fournier <ca...@apache.org> wrote:

> You can't solve this with weights.
> On Jun 3, 2016 6:03 PM, "Michael Han" <ha...@cloudera.com> wrote:
>
> > ZK supports more than just majority quorum rule, there are also weights /
> > hierarchy of groups based quorum [1]. So probably one can assign more
> > weights to one out of two data center which can form a weight based
> quorum
> > even if another DC is failing?
> >
> > Another idea is to instead of forming a single ZK ensemble across DCs,
> > forming multiple ZK ensembles across DCs with one ensemble per DC. This
> > solution might be applicable for heavy read / light write workload while
> > providing certain degree of fault tolerance. Some relevant discussions
> [2].
> >
> > [1] https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html (
> > weight.x=nnnnn, group.x=nnnnn[:nnnnn]).
> > [2]
> >
> >
> https://www.quora.com/Has-anyone-deployed-a-ZooKeeper-ensemble-across-data-centers
> >
> > On Fri, Jun 3, 2016 at 2:16 PM, Alexander Shraer <sh...@gmail.com>
> > wrote:
> >
> > > > Is there any settings to override the quorum rule? Would you know the
> > > rationale behind it?
> > >
> > > The rule comes from a theoretical impossibility saying that you must
> > have n
> > > > 2f replicas
> > > to tolerate f failures, for any algorithm trying to solve consensus
> while
> > > being able to handle
> > > periods of asynchrony (unbounded message delays, processing times,
> etc).
> > > The earliest proof is probably here: paper
> > > <
> https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf
> > >.
> > > ZooKeeper is assuming this model, so the bound applies
> > > to it.
> > >
> > > The intuition is what's called a 'partition argument'. Essentially if
> > only
> > > 2f replicas were sufficient, you
> > > could arbitrarily divide them into 2 sets of f replicas, and create a
> > > situation where each set of f
> > > must go on independently without coordinating with the other set (split
> > > brain), when the links between the two sets are slow (i.e., a network
> > > partition),
> > > simply because the other set could also be down (the algorithm
> tolerates
> > f
> > > failures) and it can't distinguish the two situations.
> > > When n > 2f this can be avoided since one of the sets will have
> majority
> > > while the other set won't.
> > >
> > > The key here is that the links between the two data centers can
> > arbitrarily
> > > delay messages, so an automatic
> > > 'fail-over' where one data center decides that the other one is down is
> > > usually considered unsafe. If in your system
> > > you have a reliable way to know that the other data center is really in
> > > fact down (this is a synchrony assumption), you could do as Camille
> > > suggested and
> > > reconfigure the system to only include the remaining data center. This
> > > would still be very tricky to do since this reconfiguration
> > > would have to involve manually changing configuration files and
> rebooting
> > > servers, while somehow making sure that you're
> > > not loosing committed state. So not recommended.
> > >
> > >
> > >
> > > On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier <ca...@apache.org>
> > > wrote:
> > >
> > > > 2 servers is the same as 1 server wrt fault tolerance, so yes, you
> are
> > > > correct. If they want fault tolerance, they have to run 3 (or more).
> > > >
> > > > On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <ap...@elyograg.org>
> > > wrote:
> > > >
> > > > > On 6/3/2016 1:44 PM, Nomar Morado wrote:
> > > > > > Is there any settings to override the quorum rule? Would you know
> > the
> > > > > > rationale behind it? Ideally, you will want to operate the
> > > application
> > > > > > even if at least one data center is up.
> > > > >
> > > > > I do not know if the quorum rule can be overridden, or whether your
> > > > > application can tell the difference between a loss of quorum and
> > > > > zookeeper going down entirely.  I really don't know anything about
> > > > > zookeeper client code or zookeeper internals.
> > > > >
> > > > > From what I understand, majority quorum is the only way to be
> > > > > *completely* sure that cluster software like SolrCloud or your
> > > > > application can handle write operations with confidence that they
> are
> > > > > applied correctly.  If you lose quorum, which will happen if only
> one
> > > DC
> > > > > is operational, then your application should go read-only.  This is
> > > what
> > > > > SolrCloud does.
> > > > >
> > > > > I am a committer on the Apache Solr project, and Solr uses
> zookeeper
> > > > > when it is running in SolrCloud mode.  The cloud code is handled by
> > > > > other people -- I don't know much about it.
> > > > >
> > > > > I joined this list because I wanted to have the ZK devs include a
> > > > > clarification in zookeeper documentation -- oddly enough, related
> to
> > > the
> > > > > very thing we are discussing.  I wanted to be sure that the
> > > > > documentation explicitly mentioned that three serversare required
> > for a
> > > > > fault-tolerant setup.  Some SolrCloud users don't want to accept
> this
> > > as
> > > > > a fact, and believe that two servers should be enough.
> > > > >
> > > > > Thanks,
> > > > > Shawn
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Cheers
> > Michael.
> >
>

Re: zookeeper deployment strategy for multi data centers

Posted by Camille Fournier <ca...@apache.org>.

You can't solve this with weights.
On Jun 3, 2016 6:03 PM, "Michael Han" <ha...@cloudera.com> wrote:

> ZK supports more than just majority quorum rule, there are also weights /
> hierarchy of groups based quorum [1]. So probably one can assign more
> weights to one out of two data center which can form a weight based quorum
> even if another DC is failing?
>
> Another idea is to instead of forming a single ZK ensemble across DCs,
> forming multiple ZK ensembles across DCs with one ensemble per DC. This
> solution might be applicable for heavy read / light write workload while
> providing certain degree of fault tolerance. Some relevant discussions [2].
>
> [1] https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html (
> weight.x=nnnnn, group.x=nnnnn[:nnnnn]).
> [2]
>
> https://www.quora.com/Has-anyone-deployed-a-ZooKeeper-ensemble-across-data-centers
>
> On Fri, Jun 3, 2016 at 2:16 PM, Alexander Shraer <sh...@gmail.com>
> wrote:
>
> > > Is there any settings to override the quorum rule? Would you know the
> > rationale behind it?
> >
> > The rule comes from a theoretical impossibility saying that you must
> have n
> > > 2f replicas
> > to tolerate f failures, for any algorithm trying to solve consensus while
> > being able to handle
> > periods of asynchrony (unbounded message delays, processing times, etc).
> > The earliest proof is probably here: paper
> > <https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf
> >.
> > ZooKeeper is assuming this model, so the bound applies
> > to it.
> >
> > The intuition is what's called a 'partition argument'. Essentially if
> only
> > 2f replicas were sufficient, you
> > could arbitrarily divide them into 2 sets of f replicas, and create a
> > situation where each set of f
> > must go on independently without coordinating with the other set (split
> > brain), when the links between the two sets are slow (i.e., a network
> > partition),
> > simply because the other set could also be down (the algorithm tolerates
> f
> > failures) and it can't distinguish the two situations.
> > When n > 2f this can be avoided since one of the sets will have majority
> > while the other set won't.
> >
> > The key here is that the links between the two data centers can
> arbitrarily
> > delay messages, so an automatic
> > 'fail-over' where one data center decides that the other one is down is
> > usually considered unsafe. If in your system
> > you have a reliable way to know that the other data center is really in
> > fact down (this is a synchrony assumption), you could do as Camille
> > suggested and
> > reconfigure the system to only include the remaining data center. This
> > would still be very tricky to do since this reconfiguration
> > would have to involve manually changing configuration files and rebooting
> > servers, while somehow making sure that you're
> > not loosing committed state. So not recommended.
> >
> >
> >
> > On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier <ca...@apache.org>
> > wrote:
> >
> > > 2 servers is the same as 1 server wrt fault tolerance, so yes, you are
> > > correct. If they want fault tolerance, they have to run 3 (or more).
> > >
> > > On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <ap...@elyograg.org>
> > wrote:
> > >
> > > > On 6/3/2016 1:44 PM, Nomar Morado wrote:
> > > > > Is there any settings to override the quorum rule? Would you know
> the
> > > > > rationale behind it? Ideally, you will want to operate the
> > application
> > > > > even if at least one data center is up.
> > > >
> > > > I do not know if the quorum rule can be overridden, or whether your
> > > > application can tell the difference between a loss of quorum and
> > > > zookeeper going down entirely.  I really don't know anything about
> > > > zookeeper client code or zookeeper internals.
> > > >
> > > > From what I understand, majority quorum is the only way to be
> > > > *completely* sure that cluster software like SolrCloud or your
> > > > application can handle write operations with confidence that they are
> > > > applied correctly.  If you lose quorum, which will happen if only one
> > DC
> > > > is operational, then your application should go read-only.  This is
> > what
> > > > SolrCloud does.
> > > >
> > > > I am a committer on the Apache Solr project, and Solr uses zookeeper
> > > > when it is running in SolrCloud mode.  The cloud code is handled by
> > > > other people -- I don't know much about it.
> > > >
> > > > I joined this list because I wanted to have the ZK devs include a
> > > > clarification in zookeeper documentation -- oddly enough, related to
> > the
> > > > very thing we are discussing.  I wanted to be sure that the
> > > > documentation explicitly mentioned that three serversare required
> for a
> > > > fault-tolerant setup.  Some SolrCloud users don't want to accept this
> > as
> > > > a fact, and believe that two servers should be enough.
> > > >
> > > > Thanks,
> > > > Shawn
> > > >
> > > >
> > >
> >
>
>
>
> --
> Cheers
> Michael.
>

Re: zookeeper deployment strategy for multi data centers

Posted by Michael Han <ha...@cloudera.com>.

ZK supports more than just majority quorum rule, there are also weights /
hierarchy of groups based quorum [1]. So probably one can assign more
weights to one out of two data center which can form a weight based quorum
even if another DC is failing?

Another idea is to instead of forming a single ZK ensemble across DCs,
forming multiple ZK ensembles across DCs with one ensemble per DC. This
solution might be applicable for heavy read / light write workload while
providing certain degree of fault tolerance. Some relevant discussions [2].

[1] https://zookeeper.apache.org/doc/r3.4.5/zookeeperAdmin.html (
weight.x=nnnnn, group.x=nnnnn[:nnnnn]).
[2]
https://www.quora.com/Has-anyone-deployed-a-ZooKeeper-ensemble-across-data-centers

On Fri, Jun 3, 2016 at 2:16 PM, Alexander Shraer <sh...@gmail.com> wrote:

> > Is there any settings to override the quorum rule? Would you know the
> rationale behind it?
>
> The rule comes from a theoretical impossibility saying that you must have n
> > 2f replicas
> to tolerate f failures, for any algorithm trying to solve consensus while
> being able to handle
> periods of asynchrony (unbounded message delays, processing times, etc).
> The earliest proof is probably here: paper
> <https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf>.
> ZooKeeper is assuming this model, so the bound applies
> to it.
>
> The intuition is what's called a 'partition argument'. Essentially if only
> 2f replicas were sufficient, you
> could arbitrarily divide them into 2 sets of f replicas, and create a
> situation where each set of f
> must go on independently without coordinating with the other set (split
> brain), when the links between the two sets are slow (i.e., a network
> partition),
> simply because the other set could also be down (the algorithm tolerates f
> failures) and it can't distinguish the two situations.
> When n > 2f this can be avoided since one of the sets will have majority
> while the other set won't.
>
> The key here is that the links between the two data centers can arbitrarily
> delay messages, so an automatic
> 'fail-over' where one data center decides that the other one is down is
> usually considered unsafe. If in your system
> you have a reliable way to know that the other data center is really in
> fact down (this is a synchrony assumption), you could do as Camille
> suggested and
> reconfigure the system to only include the remaining data center. This
> would still be very tricky to do since this reconfiguration
> would have to involve manually changing configuration files and rebooting
> servers, while somehow making sure that you're
> not loosing committed state. So not recommended.
>
>
>
> On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier <ca...@apache.org>
> wrote:
>
> > 2 servers is the same as 1 server wrt fault tolerance, so yes, you are
> > correct. If they want fault tolerance, they have to run 3 (or more).
> >
> > On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <ap...@elyograg.org>
> wrote:
> >
> > > On 6/3/2016 1:44 PM, Nomar Morado wrote:
> > > > Is there any settings to override the quorum rule? Would you know the
> > > > rationale behind it? Ideally, you will want to operate the
> application
> > > > even if at least one data center is up.
> > >
> > > I do not know if the quorum rule can be overridden, or whether your
> > > application can tell the difference between a loss of quorum and
> > > zookeeper going down entirely.  I really don't know anything about
> > > zookeeper client code or zookeeper internals.
> > >
> > > From what I understand, majority quorum is the only way to be
> > > *completely* sure that cluster software like SolrCloud or your
> > > application can handle write operations with confidence that they are
> > > applied correctly.  If you lose quorum, which will happen if only one
> DC
> > > is operational, then your application should go read-only.  This is
> what
> > > SolrCloud does.
> > >
> > > I am a committer on the Apache Solr project, and Solr uses zookeeper
> > > when it is running in SolrCloud mode.  The cloud code is handled by
> > > other people -- I don't know much about it.
> > >
> > > I joined this list because I wanted to have the ZK devs include a
> > > clarification in zookeeper documentation -- oddly enough, related to
> the
> > > very thing we are discussing.  I wanted to be sure that the
> > > documentation explicitly mentioned that three serversare required for a
> > > fault-tolerant setup.  Some SolrCloud users don't want to accept this
> as
> > > a fact, and believe that two servers should be enough.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>



-- 
Cheers
Michael.

Re: zookeeper deployment strategy for multi data centers

Posted by Alexander Shraer <sh...@gmail.com>.

> Is there any settings to override the quorum rule? Would you know the
rationale behind it?

The rule comes from a theoretical impossibility saying that you must have n
> 2f replicas
to tolerate f failures, for any algorithm trying to solve consensus while
being able to handle
periods of asynchrony (unbounded message delays, processing times, etc).
The earliest proof is probably here: paper
<https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-chandra.pdf>.
ZooKeeper is assuming this model, so the bound applies
to it.

The intuition is what's called a 'partition argument'. Essentially if only
2f replicas were sufficient, you
could arbitrarily divide them into 2 sets of f replicas, and create a
situation where each set of f
must go on independently without coordinating with the other set (split
brain), when the links between the two sets are slow (i.e., a network
partition),
simply because the other set could also be down (the algorithm tolerates f
failures) and it can't distinguish the two situations.
When n > 2f this can be avoided since one of the sets will have majority
while the other set won't.

The key here is that the links between the two data centers can arbitrarily
delay messages, so an automatic
'fail-over' where one data center decides that the other one is down is
usually considered unsafe. If in your system
you have a reliable way to know that the other data center is really in
fact down (this is a synchrony assumption), you could do as Camille
suggested and
reconfigure the system to only include the remaining data center. This
would still be very tricky to do since this reconfiguration
would have to involve manually changing configuration files and rebooting
servers, while somehow making sure that you're
not loosing committed state. So not recommended.

On Fri, Jun 3, 2016 at 11:30 PM, Camille Fournier <ca...@apache.org>
wrote:

> 2 servers is the same as 1 server wrt fault tolerance, so yes, you are
> correct. If they want fault tolerance, they have to run 3 (or more).
>
> On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 6/3/2016 1:44 PM, Nomar Morado wrote:
> > > Is there any settings to override the quorum rule? Would you know the
> > > rationale behind it? Ideally, you will want to operate the application
> > > even if at least one data center is up.
> >
> > I do not know if the quorum rule can be overridden, or whether your
> > application can tell the difference between a loss of quorum and
> > zookeeper going down entirely.  I really don't know anything about
> > zookeeper client code or zookeeper internals.
> >
> > From what I understand, majority quorum is the only way to be
> > *completely* sure that cluster software like SolrCloud or your
> > application can handle write operations with confidence that they are
> > applied correctly.  If you lose quorum, which will happen if only one DC
> > is operational, then your application should go read-only.  This is what
> > SolrCloud does.
> >
> > I am a committer on the Apache Solr project, and Solr uses zookeeper
> > when it is running in SolrCloud mode.  The cloud code is handled by
> > other people -- I don't know much about it.
> >
> > I joined this list because I wanted to have the ZK devs include a
> > clarification in zookeeper documentation -- oddly enough, related to the
> > very thing we are discussing.  I wanted to be sure that the
> > documentation explicitly mentioned that three serversare required for a
> > fault-tolerant setup.  Some SolrCloud users don't want to accept this as
> > a fact, and believe that two servers should be enough.
> >
> > Thanks,
> > Shawn
> >
> >
>

Re: zookeeper deployment strategy for multi data centers

Posted by Camille Fournier <ca...@apache.org>.

2 servers is the same as 1 server wrt fault tolerance, so yes, you are
correct. If they want fault tolerance, they have to run 3 (or more).

On Fri, Jun 3, 2016 at 4:25 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 6/3/2016 1:44 PM, Nomar Morado wrote:
> > Is there any settings to override the quorum rule? Would you know the
> > rationale behind it? Ideally, you will want to operate the application
> > even if at least one data center is up.
>
> I do not know if the quorum rule can be overridden, or whether your
> application can tell the difference between a loss of quorum and
> zookeeper going down entirely.  I really don't know anything about
> zookeeper client code or zookeeper internals.
>
> From what I understand, majority quorum is the only way to be
> *completely* sure that cluster software like SolrCloud or your
> application can handle write operations with confidence that they are
> applied correctly.  If you lose quorum, which will happen if only one DC
> is operational, then your application should go read-only.  This is what
> SolrCloud does.
>
> I am a committer on the Apache Solr project, and Solr uses zookeeper
> when it is running in SolrCloud mode.  The cloud code is handled by
> other people -- I don't know much about it.
>
> I joined this list because I wanted to have the ZK devs include a
> clarification in zookeeper documentation -- oddly enough, related to the
> very thing we are discussing.  I wanted to be sure that the
> documentation explicitly mentioned that three serversare required for a
> fault-tolerant setup.  Some SolrCloud users don't want to accept this as
> a fact, and believe that two servers should be enough.
>
> Thanks,
> Shawn
>
>

Re: zookeeper deployment strategy for multi data centers

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/3/2016 1:44 PM, Nomar Morado wrote:
> Is there any settings to override the quorum rule? Would you know the
> rationale behind it? Ideally, you will want to operate the application
> even if at least one data center is up.

I do not know if the quorum rule can be overridden, or whether your
application can tell the difference between a loss of quorum and
zookeeper going down entirely.  I really don't know anything about
zookeeper client code or zookeeper internals.

From what I understand, majority quorum is the only way to be
*completely* sure that cluster software like SolrCloud or your
application can handle write operations with confidence that they are
applied correctly.  If you lose quorum, which will happen if only one DC
is operational, then your application should go read-only.  This is what
SolrCloud does.

I am a committer on the Apache Solr project, and Solr uses zookeeper
when it is running in SolrCloud mode.  The cloud code is handled by
other people -- I don't know much about it.

I joined this list because I wanted to have the ZK devs include a
clarification in zookeeper documentation -- oddly enough, related to the
very thing we are discussing.  I wanted to be sure that the
documentation explicitly mentioned that three serversare required for a
fault-tolerant setup.  Some SolrCloud users don't want to accept this as
a fact, and believe that two servers should be enough.

Thanks,
Shawn

Re: zookeeper deployment strategy for multi data centers

Posted by Camille Fournier <ca...@apache.org>.

I wish I had a better answer for you but you can't safely and automatically
have a setup across 2 datacenters where you can be guaranteed that the loss
of one data center won't cause the cluster to go down. So, what you want to
do, you cannot do. I wrote a bit about designing x-dc ZK clusters a while
ago, not sure if any of it is relevant to you but you can read it here:
http://www.elidedbranches.com/2012/12/building-global-highly-available.html

On Fri, Jun 3, 2016 at 3:55 PM, Nomar Morado <j3...@icloud.com>
wrote:

> Will be using ZK with Apache Kafka and don't know if I can get away of not
> using ZK
>
>
>
> Printing e-mails wastes valuable natural resources. Please don't print
> this message unless it is absolutely necessary. Thank you for thinking
> green!
>
> Sent from my iPhone
>
> > On Jun 3, 2016, at 3:51 PM, Camille Fournier <ca...@apache.org> wrote:
> >
> > You could put the remaining available node in read-only mode. You could
> > reconfigure the cluster to have the majority nodes in the remaining data
> > center, but it would require reconfiguration and restart of the nodes in
> > the living data center. But there's no automatic fix for this, and if you
> > can safely override the quorum rule for your application perhaps you
> don't
> > need to use ZK at all?
> >
> > On Fri, Jun 3, 2016 at 3:44 PM, Nomar Morado <j3...@icloud.com>
> > wrote:
> >
> >> Thanks Shawn.
> >>
> >> Is there any settings to override the quorum rule? Would you know the
> >> rationale behind it?
> >>
> >> Ideally, you will want to operate the application even if at least one
> >> data center is up.
> >>
> >> Thanks.
> >>
> >>
> >> Printing e-mails wastes valuable natural resources. Please don't print
> >> this message unless it is absolutely necessary. Thank you for thinking
> >> green!
> >>
> >> Sent from my iPhone
> >>
> >>>> On Jun 3, 2016, at 9:05 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> >>>>
> >>>> On 6/2/2016 4:06 PM, J316 Services wrote:
> >>>> We have two data centers and got two servers at each. At an event of a
> >>>> data center failure, with the quorum majority rule - the other
> >>>> surviving data center seems to be no use at all and we'll be out of
> >> luck.
> >>>
> >>> You are correct -- the scenario you've described is not fault tolerant.
> >>>
> >>> When setting up a geographically diverse zookeeper ensemble, there must
> >>> be at least three locations, so if there's a complete power or network
> >>> failure at one location, the other two can maintain quorum.  One
> >>> solution I saw discussed was a fifth tie-breaker server in a cloud
> >>> service like Amazon EC2, or you could go full-scale with two more
> >>> servers at a third datacenter.
> >>>
> >>> Thanks,
> >>> Shawn
> >>
>

Re: zookeeper deployment strategy for multi data centers

Posted by Nomar Morado <j3...@icloud.com>.

Will be using ZK with Apache Kafka and don't know if I can get away of not using ZK



Printing e-mails wastes valuable natural resources. Please don't print this message unless it is absolutely necessary. Thank you for thinking green!

Sent from my iPhone

> On Jun 3, 2016, at 3:51 PM, Camille Fournier <ca...@apache.org> wrote:
> 
> You could put the remaining available node in read-only mode. You could
> reconfigure the cluster to have the majority nodes in the remaining data
> center, but it would require reconfiguration and restart of the nodes in
> the living data center. But there's no automatic fix for this, and if you
> can safely override the quorum rule for your application perhaps you don't
> need to use ZK at all?
> 
> On Fri, Jun 3, 2016 at 3:44 PM, Nomar Morado <j3...@icloud.com>
> wrote:
> 
>> Thanks Shawn.
>> 
>> Is there any settings to override the quorum rule? Would you know the
>> rationale behind it?
>> 
>> Ideally, you will want to operate the application even if at least one
>> data center is up.
>> 
>> Thanks.
>> 
>> 
>> Printing e-mails wastes valuable natural resources. Please don't print
>> this message unless it is absolutely necessary. Thank you for thinking
>> green!
>> 
>> Sent from my iPhone
>> 
>>>> On Jun 3, 2016, at 9:05 AM, Shawn Heisey <ap...@elyograg.org> wrote:
>>>> 
>>>> On 6/2/2016 4:06 PM, J316 Services wrote:
>>>> We have two data centers and got two servers at each. At an event of a
>>>> data center failure, with the quorum majority rule - the other
>>>> surviving data center seems to be no use at all and we'll be out of
>> luck.
>>> 
>>> You are correct -- the scenario you've described is not fault tolerant.
>>> 
>>> When setting up a geographically diverse zookeeper ensemble, there must
>>> be at least three locations, so if there's a complete power or network
>>> failure at one location, the other two can maintain quorum.  One
>>> solution I saw discussed was a fifth tie-breaker server in a cloud
>>> service like Amazon EC2, or you could go full-scale with two more
>>> servers at a third datacenter.
>>> 
>>> Thanks,
>>> Shawn
>>

Re: zookeeper deployment strategy for multi data centers

Posted by Camille Fournier <ca...@apache.org>.

You could put the remaining available node in read-only mode. You could
reconfigure the cluster to have the majority nodes in the remaining data
center, but it would require reconfiguration and restart of the nodes in
the living data center. But there's no automatic fix for this, and if you
can safely override the quorum rule for your application perhaps you don't
need to use ZK at all?

On Fri, Jun 3, 2016 at 3:44 PM, Nomar Morado <j3...@icloud.com>
wrote:

> Thanks Shawn.
>
> Is there any settings to override the quorum rule? Would you know the
> rationale behind it?
>
> Ideally, you will want to operate the application even if at least one
> data center is up.
>
> Thanks.
>
>
> Printing e-mails wastes valuable natural resources. Please don't print
> this message unless it is absolutely necessary. Thank you for thinking
> green!
>
> Sent from my iPhone
>
> > On Jun 3, 2016, at 9:05 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> >
> >> On 6/2/2016 4:06 PM, J316 Services wrote:
> >> We have two data centers and got two servers at each. At an event of a
> >> data center failure, with the quorum majority rule - the other
> >> surviving data center seems to be no use at all and we'll be out of
> luck.
> >
> > You are correct -- the scenario you've described is not fault tolerant.
> >
> > When setting up a geographically diverse zookeeper ensemble, there must
> > be at least three locations, so if there's a complete power or network
> > failure at one location, the other two can maintain quorum.  One
> > solution I saw discussed was a fifth tie-breaker server in a cloud
> > service like Amazon EC2, or you could go full-scale with two more
> > servers at a third datacenter.
> >
> > Thanks,
> > Shawn
> >
>

Re: zookeeper deployment strategy for multi data centers

Posted by Nomar Morado <j3...@icloud.com>.

Thanks Shawn.

Is there any settings to override the quorum rule? Would you know the rationale behind it?

Ideally, you will want to operate the application even if at least one data center is up.

Thanks.


Printing e-mails wastes valuable natural resources. Please don't print this message unless it is absolutely necessary. Thank you for thinking green!

Sent from my iPhone

> On Jun 3, 2016, at 9:05 AM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 6/2/2016 4:06 PM, J316 Services wrote:
>> We have two data centers and got two servers at each. At an event of a
>> data center failure, with the quorum majority rule - the other
>> surviving data center seems to be no use at all and we'll be out of luck.
> 
> You are correct -- the scenario you've described is not fault tolerant.
> 
> When setting up a geographically diverse zookeeper ensemble, there must
> be at least three locations, so if there's a complete power or network
> failure at one location, the other two can maintain quorum.  One
> solution I saw discussed was a fifth tie-breaker server in a cloud
> service like Amazon EC2, or you could go full-scale with two more
> servers at a third datacenter.
> 
> Thanks,
> Shawn
>

Re: zookeeper deployment strategy for multi data centers

Posted by Shawn Heisey <ap...@elyograg.org>.

On 6/2/2016 4:06 PM, J316 Services wrote:
> We have two data centers and got two servers at each. At an event of a
> data center failure, with the quorum majority rule - the other
> surviving data center seems to be no use at all and we'll be out of luck. 

You are correct -- the scenario you've described is not fault tolerant.

When setting up a geographically diverse zookeeper ensemble, there must
be at least three locations, so if there's a complete power or network
failure at one location, the other two can maintain quorum.  One
solution I saw discussed was a fifth tie-breaker server in a cloud
service like Amazon EC2, or you could go full-scale with two more
servers at a third datacenter.

Thanks,
Shawn