You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Neil Conway <ne...@gmail.com> on 2017/04/17 16:44:54 UTC

[Design doc] RPC: Fault domains in Mesos

Folks,

I'd like to enhance Mesos to support a first-class notion of "fault
domains" -- i.e., identifying the "rack" and "region" (DC) where a
Mesos agent or master is located. The goal is to enable two main
features:

(1) To make it easier to write "rack-aware" Mesos frameworks that are
portable to different Mesos clusters.

(2) To improve the experience of configuring Mesos with a set of
masters and agents in one DC, and another pool of "remote" agents in a
different DC.

For more information, please see the design doc:

https://docs.google.com/document/d/1gEugdkLRbBsqsiFv3urRPRNrHwUC-i1HwfFfHR_MvC8

I'd love any feedback, either directly on the Google doc or via email.

Thanks,
Neil

Re: [Design doc] RPC: Fault domains in Mesos

Posted by Neil Conway <ne...@gmail.com>.

Hi Maxime,

Thanks for the feedback!

The proposed approach is definitely simplistic. The "Discussion"
section of the design doc describes some of the rationale for starting
with a very simple scheme: basically, because

(a) we want to assign clear semantics to the levels of the hierarchy
(regions are far away from each other and inter-region network links
have high latency; racks are close together and inter-rack network
links have low latency).

(b) we don't want to make life too difficult for framework authors.

(c) most server software (e.g., HDFS, Kafka, Cassandra, etc.) only
understands a simple hierarchy -- in many cases, just a single level
("racks"), or occasionally two levels ("racks" and "DCs").

Can you elaborate on the use-cases that you see for a more complex
hierarchy of fault domains? I'd be happy to chat off-list if you'd
prefer.

Thanks!

Neil

On Tue, Apr 18, 2017 at 1:33 AM, Maxime Brugidou
<ma...@gmail.com> wrote:
> Hi Neil,
>
> I really like the idea of incorporating the concept of fault domains in
> Mesos, however I feel like the implementation proposed is a bit narrow to be
> actually useful for most users.
>
> I feel like we could make the fault domains definition more generic. As an
> example in our setup we would like to have something like Region > Building
>> Cage > Pod > Rack. Failure domains would be hierarchically arranged
> (meaning one domain in a lower level can only be included in one domain
> above).
>
> As a concrete example, we could have the mesos masters be aware of the fault
> domain hierarchy (with a config map for example), and slaves would just need
> to declare their lowest-level domain (for example their rack id). Then
> frameworks could use this domain hierarchy at will. If they need to "spread"
> their tasks for a very highly available setup, they could first spread using
> the highest fault domain (like the region), then if they have enough tasks
> to launch they could spread within each sub-domain recursively until they
> run out of tasks to spread. We do not need to artificially limit the number
> of levels of fault domains and the name of the fault domains. Schedulers do
> not need to know the names either, just the hierarchy.
>
> Then, to provide the other feature of "remote" slaves that you describe, we
> could configure the mesos master to only send offers from a "default" local
> fault domain, and frameworks would need to advertise a certain capability to
> receive offers for other remote fault domains.
>
> I feel we could implement this by identifying a fault domain with a simple
> list of ids like ["US-WEST-1", "Building 2", "Cage 3", "POD 12", "Rack 3"]
> or ["US-EAST-2", "Building 1"]. Slaves would advertise their lowest-level
> fault domains and schedulers could use this arbitrarily as a hierarchical
> list.
>
> Thanks,
> Maxime
>
> On Mon, Apr 17, 2017 at 6:45 PM Neil Conway <ne...@gmail.com> wrote:
>>
>> Folks,
>>
>> I'd like to enhance Mesos to support a first-class notion of "fault
>> domains" -- i.e., identifying the "rack" and "region" (DC) where a
>> Mesos agent or master is located. The goal is to enable two main
>> features:
>>
>> (1) To make it easier to write "rack-aware" Mesos frameworks that are
>> portable to different Mesos clusters.
>>
>> (2) To improve the experience of configuring Mesos with a set of
>> masters and agents in one DC, and another pool of "remote" agents in a
>> different DC.
>>
>> For more information, please see the design doc:
>>
>>
>> https://docs.google.com/document/d/1gEugdkLRbBsqsiFv3urRPRNrHwUC-i1HwfFfHR_MvC8
>>
>> I'd love any feedback, either directly on the Google doc or via email.
>>
>> Thanks,
>> Neil

Re: [Design doc] RPC: Fault domains in Mesos

Posted by Neil Conway <ne...@gmail.com>.

Hi Maxime,

Thanks for the feedback!

The proposed approach is definitely simplistic. The "Discussion"
section of the design doc describes some of the rationale for starting
with a very simple scheme: basically, because

(a) we want to assign clear semantics to the levels of the hierarchy
(regions are far away from each other and inter-region network links
have high latency; racks are close together and inter-rack network
links have low latency).

(b) we don't want to make life too difficult for framework authors.

(c) most server software (e.g., HDFS, Kafka, Cassandra, etc.) only
understands a simple hierarchy -- in many cases, just a single level
("racks"), or occasionally two levels ("racks" and "DCs").

Can you elaborate on the use-cases that you see for a more complex
hierarchy of fault domains? I'd be happy to chat off-list if you'd
prefer.

Thanks!

Neil

On Tue, Apr 18, 2017 at 1:33 AM, Maxime Brugidou
<ma...@gmail.com> wrote:
> Hi Neil,
>
> I really like the idea of incorporating the concept of fault domains in
> Mesos, however I feel like the implementation proposed is a bit narrow to be
> actually useful for most users.
>
> I feel like we could make the fault domains definition more generic. As an
> example in our setup we would like to have something like Region > Building
>> Cage > Pod > Rack. Failure domains would be hierarchically arranged
> (meaning one domain in a lower level can only be included in one domain
> above).
>
> As a concrete example, we could have the mesos masters be aware of the fault
> domain hierarchy (with a config map for example), and slaves would just need
> to declare their lowest-level domain (for example their rack id). Then
> frameworks could use this domain hierarchy at will. If they need to "spread"
> their tasks for a very highly available setup, they could first spread using
> the highest fault domain (like the region), then if they have enough tasks
> to launch they could spread within each sub-domain recursively until they
> run out of tasks to spread. We do not need to artificially limit the number
> of levels of fault domains and the name of the fault domains. Schedulers do
> not need to know the names either, just the hierarchy.
>
> Then, to provide the other feature of "remote" slaves that you describe, we
> could configure the mesos master to only send offers from a "default" local
> fault domain, and frameworks would need to advertise a certain capability to
> receive offers for other remote fault domains.
>
> I feel we could implement this by identifying a fault domain with a simple
> list of ids like ["US-WEST-1", "Building 2", "Cage 3", "POD 12", "Rack 3"]
> or ["US-EAST-2", "Building 1"]. Slaves would advertise their lowest-level
> fault domains and schedulers could use this arbitrarily as a hierarchical
> list.
>
> Thanks,
> Maxime
>
> On Mon, Apr 17, 2017 at 6:45 PM Neil Conway <ne...@gmail.com> wrote:
>>
>> Folks,
>>
>> I'd like to enhance Mesos to support a first-class notion of "fault
>> domains" -- i.e., identifying the "rack" and "region" (DC) where a
>> Mesos agent or master is located. The goal is to enable two main
>> features:
>>
>> (1) To make it easier to write "rack-aware" Mesos frameworks that are
>> portable to different Mesos clusters.
>>
>> (2) To improve the experience of configuring Mesos with a set of
>> masters and agents in one DC, and another pool of "remote" agents in a
>> different DC.
>>
>> For more information, please see the design doc:
>>
>>
>> https://docs.google.com/document/d/1gEugdkLRbBsqsiFv3urRPRNrHwUC-i1HwfFfHR_MvC8
>>
>> I'd love any feedback, either directly on the Google doc or via email.
>>
>> Thanks,
>> Neil

Re: [Design doc] RPC: Fault domains in Mesos

Posted by Maxime Brugidou <ma...@gmail.com>.

Hi Neil,

I really like the idea of incorporating the concept of fault domains in
Mesos, however I feel like the implementation proposed is a bit narrow to
be actually useful for most users.

I feel like we could make the fault domains definition more generic. As an
example in our setup we would like to have something like Region > Building
> Cage > Pod > Rack. Failure domains would be hierarchically arranged
(meaning one domain in a lower level can only be included in one domain
above).

As a concrete example, we could have the mesos masters be aware of the
fault domain hierarchy (with a config map for example), and slaves would
just need to declare their lowest-level domain (for example their rack id).
Then frameworks could use this domain hierarchy at will. If they need to
"spread" their tasks for a very highly available setup, they could first
spread using the highest fault domain (like the region), then if they have
enough tasks to launch they could spread within each sub-domain recursively
until they run out of tasks to spread. We do not need to artificially limit
the number of levels of fault domains and the name of the fault domains.
Schedulers do not need to know the names either, just the hierarchy.

Then, to provide the other feature of "remote" slaves that you describe, we
could configure the mesos master to only send offers from a "default" local
fault domain, and frameworks would need to advertise a certain capability
to receive offers for other remote fault domains.

I feel we could implement this by identifying a fault domain with a simple
list of ids like ["US-WEST-1", "Building 2", "Cage 3", "POD 12", "Rack 3"]
or ["US-EAST-2", "Building 1"]. Slaves would advertise their lowest-level
fault domains and schedulers could use this arbitrarily as a hierarchical
list.

Thanks,
Maxime

On Mon, Apr 17, 2017 at 6:45 PM Neil Conway <ne...@gmail.com> wrote:

> Folks,
>
> I'd like to enhance Mesos to support a first-class notion of "fault
> domains" -- i.e., identifying the "rack" and "region" (DC) where a
> Mesos agent or master is located. The goal is to enable two main
> features:
>
> (1) To make it easier to write "rack-aware" Mesos frameworks that are
> portable to different Mesos clusters.
>
> (2) To improve the experience of configuring Mesos with a set of
> masters and agents in one DC, and another pool of "remote" agents in a
> different DC.
>
> For more information, please see the design doc:
>
>
> https://docs.google.com/document/d/1gEugdkLRbBsqsiFv3urRPRNrHwUC-i1HwfFfHR_MvC8
>
> I'd love any feedback, either directly on the Google doc or via email.
>
> Thanks,
> Neil
>

Re: [Design doc] RPC: Fault domains in Mesos

Posted by Neil Conway <ne...@gmail.com>.

Folks,

Thanks to everyone for their feedback! Based on discussions with
members of the Mesos community, we've made a few changes to this
proposal. To summarize:

(1) Renamed "rack" to "zone", both to be a bit more abstract and to
match the terminology used by most public cloud providers. That is, a
fault domain now consists of a zone and a region.

(2) To accommodate future kinds of domains, the DomainInfo message now
has a nested "FaultDomain" field. New types of domains (e.g., latency
domains, power domains) might be represented in the future via
additional fields in DomainInfo, but such extensions are out of the
scope of the current proposal.

(3) Clarified that allowing an agent to transition from "no configured
domain" to "configured domain" will require an agent drain in the MVP,
and added some discussion of the implementation/framework API
challenges around supporting domain opt-in w/o.

The review chain for the MVP of this feature are up now (MESOS-7607).

Neil

On Mon, Apr 17, 2017 at 9:44 AM, Neil Conway <ne...@gmail.com> wrote:
> Folks,
>
> I'd like to enhance Mesos to support a first-class notion of "fault
> domains" -- i.e., identifying the "rack" and "region" (DC) where a
> Mesos agent or master is located. The goal is to enable two main
> features:
>
> (1) To make it easier to write "rack-aware" Mesos frameworks that are
> portable to different Mesos clusters.
>
> (2) To improve the experience of configuring Mesos with a set of
> masters and agents in one DC, and another pool of "remote" agents in a
> different DC.
>
> For more information, please see the design doc:
>
> https://docs.google.com/document/d/1gEugdkLRbBsqsiFv3urRPRNrHwUC-i1HwfFfHR_MvC8
>
> I'd love any feedback, either directly on the Google doc or via email.
>
> Thanks,
> Neil

Re: [Design doc] RPC: Fault domains in Mesos

Posted by Neil Conway <ne...@gmail.com>.

Folks,

Thanks to everyone for their feedback! Based on discussions with
members of the Mesos community, we've made a few changes to this
proposal. To summarize:

(1) Renamed "rack" to "zone", both to be a bit more abstract and to
match the terminology used by most public cloud providers. That is, a
fault domain now consists of a zone and a region.

(2) To accommodate future kinds of domains, the DomainInfo message now
has a nested "FaultDomain" field. New types of domains (e.g., latency
domains, power domains) might be represented in the future via
additional fields in DomainInfo, but such extensions are out of the
scope of the current proposal.

(3) Clarified that allowing an agent to transition from "no configured
domain" to "configured domain" will require an agent drain in the MVP,
and added some discussion of the implementation/framework API
challenges around supporting domain opt-in w/o.

The review chain for the MVP of this feature are up now (MESOS-7607).

Neil

On Mon, Apr 17, 2017 at 9:44 AM, Neil Conway <ne...@gmail.com> wrote:
> Folks,
>
> I'd like to enhance Mesos to support a first-class notion of "fault
> domains" -- i.e., identifying the "rack" and "region" (DC) where a
> Mesos agent or master is located. The goal is to enable two main
> features:
>
> (1) To make it easier to write "rack-aware" Mesos frameworks that are
> portable to different Mesos clusters.
>
> (2) To improve the experience of configuring Mesos with a set of
> masters and agents in one DC, and another pool of "remote" agents in a
> different DC.
>
> For more information, please see the design doc:
>
> https://docs.google.com/document/d/1gEugdkLRbBsqsiFv3urRPRNrHwUC-i1HwfFfHR_MvC8
>
> I'd love any feedback, either directly on the Google doc or via email.
>
> Thanks,
> Neil