You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by lwq Adolph <ke...@gmail.com> on 2016/01/10 08:27:13 UTC

deploy mesos cluster on aws

Hi everyone:
 My future mesos cluster will be at least 100 nodes.So optimization of
mesos is important.May you share your experience on using mesos in
production environment.It can contain following topics:
1. monitor tools of mesos cluster
2. optimization of mesos parameters

Thanks very much

-- 
Thanks & Best Regards
卢文泉 | Adolph Lu
TEL：+86 15651006559
Linker Networks(http://www.linkernetworks.com/)

Re: deploy mesos cluster on aws

Posted by Sharma Podila <sp...@netflix.com>.

We have been running Mesos on AWS starting around Mesos 0.15. Running 100s
of agent nodes isn't an issue at all. We currently autoscale the agent
cluster (few to several 100s) based on usage, using a custom framework that
uses Fenzo library. We run batch and service style workloads. I am glad to
provide additional info if you have specific questions.

We use 3 Mesos masters (spread across 3 zones of an AWS region). Existing
infrastructure provides a 5 node Zokeeper cluster to use for leader
election.

We leverage existing monitoring tools at Netflix, mostly based on Atlas. We
have a few alerts such as no ZK leader for a while, no resource offers for
too long, etc. , that tie into PagerDuty. Other alerts are at a higher
level, based on expected behavior of our framework scheduler.

Since we deploy immutable AMIs, our Mesos master upgrades involve deploying
new ASG with upgraded Mesos masters and then destroying the old ASG. Agent
upgrades also involve bringing up new ASGs with coordinated drain-off or
job migration. This strategy mostly works with ease, except when there is a
breaking change across versions (e.g., new master can't talk to old agent,
or vice versa. This happened once so far, when ZK node content changed from
protobuf to json). Additional thought will need to be put in after Mesos
goes 1.0 and defines the long term version compatibility/stability more
formally. I understand this strategy may not appeal to environments with
strict caps on total #instances.

Our Mesos agent command line contains several custom attributes that
provide parameters such as the EC2 instance zone, instanceId, instance
type, etc., that are useful for any constraints that the jobs can put in
terms of task placement.

Our framework runs multiple instances with leader election. We use Mesos
framework registration with a long (1 week) timeout for re-registration to
account for any delays in re-registering.

On Sat, Jan 9, 2016 at 11:27 PM, lwq Adolph <ke...@gmail.com> wrote:

> Hi everyone:
>  My future mesos cluster will be at least 100 nodes.So optimization of
> mesos is important.May you share your experience on using mesos in
> production environment.It can contain following topics:
> 1. monitor tools of mesos cluster
> 2. optimization of mesos parameters
>
> Thanks very much
>
> --
> Thanks & Best Regards
> 卢文泉 | Adolph Lu
> TEL：+86 15651006559
> Linker Networks(http://www.linkernetworks.com/)
>

Re: deploy mesos cluster on aws

Posted by lwq Adolph <ke...@gmail.com>.

Thanks very much, that's helpful.
I will test mesos performance on aws with 100 nodes.
Your monitor tools and using expervience really helps me!

On Sun, Jan 10, 2016 at 11:43 PM, Rodrick Brown <ro...@orchard-app.com>
wrote:

> We run 100% on AWS and have been running Mesos in production since version
> 0.19
> Our cluster consists of 3 dedicated zookeeper nodes (M3.2lx), 3 dedicated
> masters (M3.2lx), 8 dedicated slaves (M4.4xl) and 2 haproxy (M4.Medium)
> instances used in conjunction with marathon-lb for routing requests into
> backend services running on Mesos.
>
> We use Terraform a hashicorp tool for building the physical cluster nodes
> and Ansible for configuring Mesos, Chronos, and Marathon and Mesos-dns. For
> monitoring needs we leverage Datadog which has built in integration for
> tracking various stat in the cluster like CPU, Disk, Mem, Roles etc..
>
> As of optimization we currently run two different workloads ELT
> (Spark/MR/Hadoop) and Scala based microservices. I've since started using
> different attributes to prevent my batch oriented jobs from consuming too
> many resources and at times blocking on my realtime microservices so
> instead of running all services across all nodes I use constraints on both
> Marathon and Chronos to fix this and basically partitioned my server into
> two groups.
>
> The only reason issue we ran into while running in AWS was sizing issues
> of our masters. Initially since I knew from the go I would use my masters
> as dedicated nodes I started with m3.medium which end up being way too
> small andwe would see issues with noisy neighbors % cpu steal was always
> high ~50% which would cause huge latency and timeouts between my masters,
> slaves and zookeeper. After replacing the m3 mediums with m4.2lx this issue
> has since went away.
>
> Let me know if you have any specifics.
>
> --RB
>
> On Jan 10 2016, at 2:27 am, lwq Adolph <ke...@gmail.com> wrote:
>> Hi everyone:
>>  My future mesos cluster will be at least 100 nodes.So optimization of
>> mesos is important.May you share your experience on using mesos in
>> production environment.It can contain following topics:
>> 1. monitor tools of mesos cluster
>> 2. optimization of mesos parameters
>>
>> Thanks very much
>>
>> --
>> Thanks & Best Regards
>> 卢文泉 | Adolph Lu
>> TEL：+86 15651006559
>> Linker Networks(http://www.linkernetworks.com/)
>>
>
> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
> for the use of the addressee only. If you are not an intended recipient of
> this communication, please delete it immediately and notify the sender by
> return email. Unauthorized reading, dissemination, distribution or copying
> of this communication is prohibited. This communication does not constitute
> an offer to sell or a solicitation of an indication of interest to purchase
> any loan, security or any other financial product or instrument, nor is it
> an offer to sell or a solicitation of an indication of interest to purchase
> any products or services to any persons who are prohibited from receiving
> such information under applicable law. The contents of this communication
> may not be accurate or complete and are subject to change without notice.
> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
> "Orchard") makes no representation regarding the accuracy or completeness
> of the information contained herein. The intended recipient is advised to
> consult its own professional advisors, including those specializing in
> legal, tax and accounting matters. Orchard does not provide legal, tax or
> accounting advice.
>



-- 
Thanks & Best Regards
卢文泉 | Adolph Lu
TEL：+86 15651006559
Linker Networks(http://www.linkernetworks.com/)

Re: deploy mesos cluster on aws

Posted by Rodrick Brown <ro...@orchard-app.com>.

Yeah we cheat using an ELB instance.  

  

I use fixed ports in my marathon configs and define the listeners in ELB that
points to 2 haproxy instances each running marathon-lb.

  

When microservice X is started by Mesos its always reachable by its known port
i.e. 31100 which is defined on the ELB that routes to the microservice via
marathon-lb generated configs.

  

> On Jan 10 2016, at 10:46 pm, Jeff Schroeder
&lt;jeffschroeder@computer.org&gt; wrote:  
  
  
On Sunday, January 10, 2016, Rodrick Brown &lt;[rodrick@orchard-
app.com](mailto:rodrick@orchard-app.com)&gt; wrote:  

>

>> We run 100% on AWS and have been running Mesos in production since version
0.19

>>

>> Our cluster consists of 3 dedicated zookeeper nodes (M3.2lx), 3 dedicated
masters (M3.2lx), 8 dedicated slaves (M4.4xl) and 2 haproxy (M4.Medium)
instances used in conjunction with marathon-lb for routing requests into
backend services running on Mesos.  

>

>  

>

> Question. You have 2 haproxy nodes also running marathon-lb, or 2 haproxy
nodes with 2 more for marathon-lb? Do you use something like keepalived to
provide some sort of a vip, or just cheat and use ELB to balance to your
haproxy nodes?

>

>  

>

> Thanks!

>

>>  

>

>  
\--  
Text by Jeff, typos by iPhone  


-- 
*NOTICE TO RECIPIENTS*: This communication is confidential and intended for 
the use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an 
offer to sell or a solicitation of an indication of interest to purchase 
any loan, security or any other financial product or instrument, nor is it 
an offer to sell or a solicitation of an indication of interest to purchase 
any products or services to any persons who are prohibited from receiving 
such information under applicable law. The contents of this communication 
may not be accurate or complete and are subject to change without notice. 
As such, Orchard App, Inc. (including its subsidiaries and affiliates, 
"Orchard") makes no representation regarding the accuracy or completeness 
of the information contained herein. The intended recipient is advised to 
consult its own professional advisors, including those specializing in 
legal, tax and accounting matters. Orchard does not provide legal, tax or 
accounting advice.

Re: deploy mesos cluster on aws

Posted by Jeff Schroeder <je...@computer.org>.

On Sunday, January 10, 2016, Rodrick Brown <ro...@orchard-app.com> wrote:

> We run 100% on AWS and have been running Mesos in production since version
> 0.19
> Our cluster consists of 3 dedicated zookeeper nodes (M3.2lx), 3 dedicated
> masters (M3.2lx), 8 dedicated slaves (M4.4xl) and 2 haproxy (M4.Medium)
> instances used in conjunction with marathon-lb for routing requests into
> backend services running on Mesos.
>

Question. You have 2 haproxy nodes also running marathon-lb, or 2 haproxy
nodes with 2 more for marathon-lb? Do you use something like keepalived to
provide some sort of a vip, or just cheat and use ELB to balance to your
haproxy nodes?

Thanks!

-- 
Text by Jeff, typos by iPhone

Re: deploy mesos cluster on aws

Posted by Rodrick Brown <ro...@orchard-app.com>.

We run 100% on AWS and have been running Mesos in production since version
0.19

Our cluster consists of 3 dedicated zookeeper nodes (M3.2lx), 3 dedicated
masters (M3.2lx), 8 dedicated slaves (M4.4xl) and 2 haproxy (M4.Medium)
instances used in conjunction with marathon-lb for routing requests into
backend services running on Mesos.  

We use Terraform a hashicorp tool for building the physical cluster nodes and
Ansible for configuring Mesos, Chronos, and Marathon and Mesos-dns. For
monitoring needs we leverage Datadog which has built in integration for
tracking various stat in the cluster like CPU, Disk, Mem, Roles etc..

As of optimization we currently run two different workloads ELT
(Spark/MR/Hadoop) and Scala based microservices. I've since started using
different attributes to prevent my batch oriented jobs from consuming too many
resources and at times blocking on my realtime microservices so instead of
running all services across all nodes I use constraints on both Marathon and
Chronos to fix this and basically partitioned my server into two groups.  

The only reason issue we ran into while running in AWS was sizing issues of
our masters. Initially since I knew from the go I would use my masters as
dedicated nodes I started with m3.medium which end up being way too small
andwe would see issues with noisy neighbors % cpu steal was always high ~50%
which would cause huge latency and timeouts between my masters, slaves and
zookeeper. After replacing the m3 mediums with m4.2lx this issue has since
went away.

Let me know if you have any specifics.

\--RB

> On Jan 10 2016, at 2:27 am, lwq Adolph &lt;kenan3015@gmail.com&gt; wrote:  

>

> Hi everyone:

>

>  My future mesos cluster will be at least 100 nodes.So optimization of mesos
is important.May you share your experience on using mesos in production
environment.It can contain following topics:

>

> 1\. monitor tools of mesos cluster

>

> 2\. optimization of mesos parameters

>

>  

>

> Thanks very much  

>

>  

>

> \--  

>

> Thanks &amp; Best Regards

>

> 卢文泉 | Adolph Lu

>

> TEL：+86 15651006559

>

> Linker Networks(<http://www.linkernetworks.com/>)

-- 
*NOTICE TO RECIPIENTS*: This communication is confidential and intended for 
the use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an 
offer to sell or a solicitation of an indication of interest to purchase 
any loan, security or any other financial product or instrument, nor is it 
an offer to sell or a solicitation of an indication of interest to purchase 
any products or services to any persons who are prohibited from receiving 
such information under applicable law. The contents of this communication 
may not be accurate or complete and are subject to change without notice. 
As such, Orchard App, Inc. (including its subsidiaries and affiliates, 
"Orchard") makes no representation regarding the accuracy or completeness 
of the information contained herein. The intended recipient is advised to 
consult its own professional advisors, including those specializing in 
legal, tax and accounting matters. Orchard does not provide legal, tax or 
accounting advice.