You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Kiril Menshikov <km...@gmail.com> on 2016/12/16 14:05:20 UTC

Mesos on AWS

​Hi,

Does any body try to run Mesos on AWS instances? Can you give me
recommendations.

I am developing elastic (scale aws instances on demand) Mesos cluster.
Currently I have 3 master instances. I run about 1000 tasks simultaneously.
I see delays and health check problems.

~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).

At the moment I increase time out in ZooKeeper cluster. What can I do to
decrease timeouts?

Also how can I increase performance? The main bottleneck is what I have the
big amount of tasks(run simultaneously) for an hour after I shutdown them
or restart (depends how good them perform).

-Kiril​

Re: Mesos on AWS

Posted by Kiril Menshikov <km...@gmail.com>.
Alex, unfortunately I lost my previous logs. But I plan to make heavy performance test of the systems. So if I get some thing I’ll come back.

Actually, I have old setup and can do separate test on the holidays.

Thanks,
-Kiril

> On Dec 21, 2016, at 22:57, Alex Rukletsov <al...@mesosphere.com> wrote:
> 
> Kiril—
> 
> from what you described it does not sound like the problem is the Linux distribution. It may be your AWS configuration. However, if a combination of health checks and heavy loaded agent leads to the agent termination — I would like to investigate this issue. Please come back—with logs!—if you see the issue again.
> 
> On Tue, Dec 20, 2016 at 3:46 PM, Kiril Menshikov <kmenshikov@gmail.com <ma...@gmail.com>> wrote:
> ​Hey,
> 
> Sorry for delayed response. I reinstalled my AWS infrastructure. Now I install everything on RedHat linux. Before I use Amazon Linux.
> 
> I tested with single master (m4.large). Everything works perfect. I am not sure if it was Amazon Linux or my old configurations.
> 
> Thanks,
> ​-Kirils
> 
> On 18 December 2016 at 14:03, Guillermo Rodriguez <guimo@spritekin.com <ma...@spritekin.com>> wrote:
> Hi,
> I run my mesos cluster in AWS, betewwn 40 to 100 m4.2xlarge instances at any time. Between 200 and 1500 jobs anytime. Slaves run as spot instances. 
>  
> So, the only moment I get a TASK_LOST is when I lose a spot instance due to being outbid.
>  
> I guess you may also lose instances due to an AWS autoscaler scale-in procedure, for example, if it decides the cluster is inderutilised then it can kill any instane in your cluster, not necessarilly the least used one. That's the reason we decided to develop our customised autoscaler that detects and kills specific instances based on our own rules.
>  
> So, are you using spot fleets or spot innstances? Have you setup your scale-in procedures correctly?
>  
> Also, if you are running fine grained tiny jobs (400 jobs in a 10xlarge means 0.1 CPUs and 400MB RAM each), I recommend you avoid an m4.10xlarge instance and run xlarge instances instead. Same price and if you lose one you just lose 1/10th of your jobs.
>  
> Luck!
>  
>  
>  
>  
>  
> From: "haosdent" <haosdent@gmail.com <ma...@gmail.com>>
> Sent: Saturday, December 17, 2016 6:12 PM
> To: "user" <user@mesos.apache.org <ma...@mesos.apache.org>>
> Subject: Re: Mesos on AWS
>  
> >  sometimes Mesos agent is launched but master doesn’t show them. 
> It sounds like the Master Master could not connect to your Agents. May you mind paste your Mesos Master log? Any information show Mesos agents are disconnected in it?
>  
> On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <kmenshikov@gmail.com <ma...@gmail.com>> wrote:
> I have my own framework. Sometimes I get TASK_LOST status with message slave lost during health check.
>  
> Also I found sometimes Mesos agent is launched but master doesn’t show them. From agent I see that it found master and connected. After agent restart it start working.  
>  
> -Kiril
>  
>  
>> On Dec 16, 2016, at 21:58, Zameer Manji <zmanji@apache.org <ma...@apache.org>> wrote:
>>  
>> Hey,
>>  
>> Could you detail on what you mean by "delays and health check problems"? Are you using your own framework or an existing one? How are you launching the tasks?
>>  
>> Could you share logs from Mesos that show timeouts to ZK?
>>  
>> For reference, I operate a large Mesos cluster and I have never encountered problems when running 1k tasks concurrently so I think sharing data would help everyone debug this problem.
>>  
>> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <kmenshikov@gmail.com <ma...@gmail.com>> wrote:
>> ?Hi,
>>  
>> Does any body try to run Mesos on AWS instances? Can you give me recommendations.
>>  
>> I am developing elastic (scale aws instances on demand) Mesos cluster. Currently I have 3 master instances. I run about 1000 tasks simultaneously. I see delays and health check problems. 
>>  
>> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).
>>  
>> At the moment I increase time out in ZooKeeper cluster. What can I do to decrease timeouts?
>>  
>> Also how can I increase performance? The main bottleneck is what I have the big amount of tasks(run simultaneously) for an hour after I shutdown them or restart (depends how good them perform).
>>  
>> -Kiril?
>>  
>> --
>> Zameer Manji
>  
>  
> --
> Best Regards,
> Haosdent Huang
> 
> 
> 
> -- 
> Thanks,
> -Kiril
> Phone +37126409291 <tel:+371%2026%20409%20291>
> Riga, Latvia
> Skype perimetr122
> 


Re: Mesos on AWS

Posted by Alex Rukletsov <al...@mesosphere.com>.
Kiril—

from what you described it does not sound like the problem is the Linux
distribution. It may be your AWS configuration. However, if a combination
of health checks and heavy loaded agent leads to the agent termination — I
would like to investigate this issue. Please come back—with logs!—if you
see the issue again.

On Tue, Dec 20, 2016 at 3:46 PM, Kiril Menshikov <km...@gmail.com>
wrote:

> ​Hey,
>
> Sorry for delayed response. I reinstalled my AWS infrastructure. Now I
> install everything on RedHat linux. Before I use Amazon Linux.
>
> I tested with single master (m4.large). Everything works perfect. I am not
> sure if it was Amazon Linux or my old configurations.
>
> Thanks,
> ​-Kirils
>
> On 18 December 2016 at 14:03, Guillermo Rodriguez <gu...@spritekin.com>
> wrote:
>
>> Hi,
>> I run my mesos cluster in AWS, betewwn 40 to 100 m4.2xlarge instances at
>> any time. Between 200 and 1500 jobs anytime. Slaves run as spot instances.
>>
>> So, the only moment I get a TASK_LOST is when I lose a spot instance due
>> to being outbid.
>>
>> I guess you may also lose instances due to an AWS autoscaler scale-in
>> procedure, for example, if it decides the cluster is inderutilised then it
>> can kill any instane in your cluster, not necessarilly the least used one.
>> That's the reason we decided to develop our customised autoscaler that
>> detects and kills specific instances based on our own rules.
>>
>> So, are you using spot fleets or spot innstances? Have you setup your
>> scale-in procedures correctly?
>>
>> Also, if you are running fine grained tiny jobs (400 jobs in a 10xlarge
>> means 0.1 CPUs and 400MB RAM each), I recommend you avoid an m4.10xlarge
>> instance and run xlarge instances instead. Same price and if you lose one
>> you just lose 1/10th of your jobs.
>>
>> Luck!
>>
>>
>>
>>
>>
>> ------------------------------
>> *From*: "haosdent" <ha...@gmail.com>
>> *Sent*: Saturday, December 17, 2016 6:12 PM
>> *To*: "user" <us...@mesos.apache.org>
>> *Subject*: Re: Mesos on AWS
>>
>> >  sometimes Mesos agent is launched but master doesn’t show them.
>> It sounds like the Master Master could not connect to your Agents. May
>> you mind paste your Mesos Master log? Any information show Mesos agents are
>> disconnected in it?
>>
>> On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <km...@gmail.com>
>> wrote:
>>>
>>> I have my own framework. Sometimes I get TASK_LOST status with message
>>> slave lost during health check.
>>>
>>> Also I found sometimes Mesos agent is launched but master doesn’t show
>>> them. From agent I see that it found master and connected. After agent
>>> restart it start working.
>>>
>>> -Kiril
>>>
>>>
>>>
>>> On Dec 16, 2016, at 21:58, Zameer Manji <zm...@apache.org> wrote:
>>>
>>> Hey,
>>>
>>> Could you detail on what you mean by "delays and health check problems"?
>>> Are you using your own framework or an existing one? How are you launching
>>> the tasks?
>>>
>>> Could you share logs from Mesos that show timeouts to ZK?
>>>
>>> For reference, I operate a large Mesos cluster and I have never
>>> encountered problems when running 1k tasks concurrently so I think sharing
>>> data would help everyone debug this problem.
>>>
>>> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <km...@gmail.com>
>>> wrote:
>>>>
>>>> ?Hi,
>>>>
>>>> Does any body try to run Mesos on AWS instances? Can you give me
>>>> recommendations.
>>>>
>>>> I am developing elastic (scale aws instances on demand) Mesos cluster.
>>>> Currently I have 3 master instances. I run about 1000 tasks simultaneously.
>>>> I see delays and health check problems.
>>>>
>>>> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).
>>>>
>>>> At the moment I increase time out in ZooKeeper cluster. What can I do
>>>> to decrease timeouts?
>>>>
>>>> Also how can I increase performance? The main bottleneck is what I have
>>>> the big amount of tasks(run simultaneously) for an hour after I shutdown
>>>> them or restart (depends how good them perform).
>>>>
>>>> -Kiril?
>>>>
>>>> --
>>>> Zameer Manji
>>>>
>>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Thanks,
> -Kiril
> Phone +37126409291 <+371%2026%20409%20291>
> Riga, Latvia
> Skype perimetr122
>

Re: Mesos on AWS

Posted by Kiril Menshikov <km...@gmail.com>.
​Hey,

Sorry for delayed response. I reinstalled my AWS infrastructure. Now I
install everything on RedHat linux. Before I use Amazon Linux.

I tested with single master (m4.large). Everything works perfect. I am not
sure if it was Amazon Linux or my old configurations.

Thanks,
​-Kirils

On 18 December 2016 at 14:03, Guillermo Rodriguez <gu...@spritekin.com>
wrote:

> Hi,
> I run my mesos cluster in AWS, betewwn 40 to 100 m4.2xlarge instances at
> any time. Between 200 and 1500 jobs anytime. Slaves run as spot instances.
>
> So, the only moment I get a TASK_LOST is when I lose a spot instance due
> to being outbid.
>
> I guess you may also lose instances due to an AWS autoscaler scale-in
> procedure, for example, if it decides the cluster is inderutilised then it
> can kill any instane in your cluster, not necessarilly the least used one.
> That's the reason we decided to develop our customised autoscaler that
> detects and kills specific instances based on our own rules.
>
> So, are you using spot fleets or spot innstances? Have you setup your
> scale-in procedures correctly?
>
> Also, if you are running fine grained tiny jobs (400 jobs in a 10xlarge
> means 0.1 CPUs and 400MB RAM each), I recommend you avoid an m4.10xlarge
> instance and run xlarge instances instead. Same price and if you lose one
> you just lose 1/10th of your jobs.
>
> Luck!
>
>
>
>
>
> ------------------------------
> *From*: "haosdent" <ha...@gmail.com>
> *Sent*: Saturday, December 17, 2016 6:12 PM
> *To*: "user" <us...@mesos.apache.org>
> *Subject*: Re: Mesos on AWS
>
> >  sometimes Mesos agent is launched but master doesn’t show them.
> It sounds like the Master Master could not connect to your Agents. May you
> mind paste your Mesos Master log? Any information show Mesos agents are
> disconnected in it?
>
> On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <km...@gmail.com>
> wrote:
>>
>> I have my own framework. Sometimes I get TASK_LOST status with message
>> slave lost during health check.
>>
>> Also I found sometimes Mesos agent is launched but master doesn’t show
>> them. From agent I see that it found master and connected. After agent
>> restart it start working.
>>
>> -Kiril
>>
>>
>>
>> On Dec 16, 2016, at 21:58, Zameer Manji <zm...@apache.org> wrote:
>>
>> Hey,
>>
>> Could you detail on what you mean by "delays and health check problems"?
>> Are you using your own framework or an existing one? How are you launching
>> the tasks?
>>
>> Could you share logs from Mesos that show timeouts to ZK?
>>
>> For reference, I operate a large Mesos cluster and I have never
>> encountered problems when running 1k tasks concurrently so I think sharing
>> data would help everyone debug this problem.
>>
>> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <km...@gmail.com>
>> wrote:
>>>
>>> ?Hi,
>>>
>>> Does any body try to run Mesos on AWS instances? Can you give me
>>> recommendations.
>>>
>>> I am developing elastic (scale aws instances on demand) Mesos cluster.
>>> Currently I have 3 master instances. I run about 1000 tasks simultaneously.
>>> I see delays and health check problems.
>>>
>>> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).
>>>
>>> At the moment I increase time out in ZooKeeper cluster. What can I do to
>>> decrease timeouts?
>>>
>>> Also how can I increase performance? The main bottleneck is what I have
>>> the big amount of tasks(run simultaneously) for an hour after I shutdown
>>> them or restart (depends how good them perform).
>>>
>>> -Kiril?
>>>
>>> --
>>> Zameer Manji
>>>
>>
>
> --
> Best Regards,
> Haosdent Huang
>



-- 
Thanks,
-Kiril
Phone +37126409291
Riga, Latvia
Skype perimetr122

Re: Mesos on AWS

Posted by Guillermo Rodriguez <gu...@spritekin.com>.
Hi,
 I run my mesos cluster in AWS, betewwn 40 to 100 m4.2xlarge instances at any time. Between 200 and 1500 jobs anytime. Slaves run as spot instances. 
  
 So, the only moment I get a TASK_LOST is when I lose a spot instance due to being outbid.
  
 I guess you may also lose instances due to an AWS autoscaler scale-in procedure, for example, if it decides the cluster is inderutilised then it can kill any instane in your cluster, not necessarilly the least used one. That's the reason we decided to develop our customised autoscaler that detects and kills specific instances based on our own rules.
  
 So, are you using spot fleets or spot innstances? Have you setup your scale-in procedures correctly?
  
 Also, if you are running fine grained tiny jobs (400 jobs in a 10xlarge means 0.1 CPUs and 400MB RAM each), I recommend you avoid an m4.10xlarge instance and run xlarge instances instead. Same price and if you lose one you just lose 1/10th of your jobs.
  
 Luck!
  
  
  
  
  

----------------------------------------
 From: "haosdent" <ha...@gmail.com>
Sent: Saturday, December 17, 2016 6:12 PM
To: "user" <us...@mesos.apache.org>
Subject: Re: Mesos on AWS   
  >  sometimes Mesos agent is launched but master doesn't show them. 
It sounds like the Master Master could not connect to your Agents. May you mind paste your Mesos Master log? Any information show Mesos agents are disconnected in it?

   On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <km...@gmail.com> wrote:    I have my own framework. Sometimes I get TASK_LOST status with message slave lost during health check.
  
 Also I found sometimes Mesos agent is launched but master doesn't show them. From agent I see that it found master and connected. After agent restart it start working.  
  
 -Kiril 
    
     On Dec 16, 2016, at 21:58, Zameer Manji <zm...@apache.org> wrote:
    Hey,  
 Could you detail on what you mean by "delays and health check problems"? Are you using your own framework or an existing one? How are you launching the tasks?
  
 Could you share logs from Mesos that show timeouts to ZK?
  
 For reference, I operate a large Mesos cluster and I have never encountered problems when running 1k tasks concurrently so I think sharing data would help everyone debug this problem.

   On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <km...@gmail.com> wrote:    ?Hi,
  
 Does any body try to run Mesos on AWS instances? Can you give me recommendations.
  
 I am developing elastic (scale aws instances on demand) Mesos cluster. Currently I have 3 master instances. I run about 1000 tasks simultaneously. I see delays and health check problems. 
  
 ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).
  
 At the moment I increase time out in ZooKeeper cluster. What can I do to decrease timeouts?
  
 Also how can I increase performance? The main bottleneck is what I have the big amount of tasks(run simultaneously) for an hour after I shutdown them or restart (depends how good them perform).
  
 -Kiril?    
--     Zameer Manji 

    
--  Best Regards, Haosdent Huang



Re: Mesos on AWS

Posted by haosdent <ha...@gmail.com>.
>  sometimes Mesos agent is launched but master doesn’t show them.
It sounds like the Master Master could not connect to your Agents. May you
mind paste your Mesos Master log? Any information show Mesos agents are
disconnected in it?

On Sat, Dec 17, 2016 at 4:08 AM, Kiril Menshikov <km...@gmail.com>
wrote:

> I have my own framework. Sometimes I get TASK_LOST status with message
> slave lost during health check.
>
> Also I found sometimes Mesos agent is launched but master doesn’t show
> them. From agent I see that it found master and connected. After agent
> restart it start working.
>
> -Kiril
>
>
> On Dec 16, 2016, at 21:58, Zameer Manji <zm...@apache.org> wrote:
>
> Hey,
>
> Could you detail on what you mean by "delays and health check problems"?
> Are you using your own framework or an existing one? How are you launching
> the tasks?
>
> Could you share logs from Mesos that show timeouts to ZK?
>
> For reference, I operate a large Mesos cluster and I have never
> encountered problems when running 1k tasks concurrently so I think sharing
> data would help everyone debug this problem.
>
> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <km...@gmail.com>
> wrote:
>
>> ​Hi,
>>
>> Does any body try to run Mesos on AWS instances? Can you give me
>> recommendations.
>>
>> I am developing elastic (scale aws instances on demand) Mesos cluster.
>> Currently I have 3 master instances. I run about 1000 tasks simultaneously.
>> I see delays and health check problems.
>>
>> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).
>>
>> At the moment I increase time out in ZooKeeper cluster. What can I do to
>> decrease timeouts?
>>
>> Also how can I increase performance? The main bottleneck is what I have
>> the big amount of tasks(run simultaneously) for an hour after I shutdown
>> them or restart (depends how good them perform).
>>
>> -Kiril​
>>
>> --
>> Zameer Manji
>>
>
>


-- 
Best Regards,
Haosdent Huang

Re: Mesos on AWS

Posted by Kiril Menshikov <km...@gmail.com>.
I have my own framework. Sometimes I get TASK_LOST status with message slave lost during health check.

Also I found sometimes Mesos agent is launched but master doesn’t show them. From agent I see that it found master and connected. After agent restart it start working.  

-Kiril


> On Dec 16, 2016, at 21:58, Zameer Manji <zm...@apache.org> wrote:
> 
> Hey,
> 
> Could you detail on what you mean by "delays and health check problems"? Are you using your own framework or an existing one? How are you launching the tasks?
> 
> Could you share logs from Mesos that show timeouts to ZK?
> 
> For reference, I operate a large Mesos cluster and I have never encountered problems when running 1k tasks concurrently so I think sharing data would help everyone debug this problem.
> 
> On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <kmenshikov@gmail.com <ma...@gmail.com>> wrote:
> ​Hi,
> 
> Does any body try to run Mesos on AWS instances? Can you give me recommendations.
> 
> I am developing elastic (scale aws instances on demand) Mesos cluster. Currently I have 3 master instances. I run about 1000 tasks simultaneously. I see delays and health check problems. 
> 
> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).
> 
> At the moment I increase time out in ZooKeeper cluster. What can I do to decrease timeouts?
> 
> Also how can I increase performance? The main bottleneck is what I have the big amount of tasks(run simultaneously) for an hour after I shutdown them or restart (depends how good them perform).
> 
> -Kiril​
> 
> -- 
> Zameer Manji


Re: Mesos on AWS

Posted by Zameer Manji <zm...@apache.org>.
Hey,

Could you detail on what you mean by "delays and health check problems"?
Are you using your own framework or an existing one? How are you launching
the tasks?

Could you share logs from Mesos that show timeouts to ZK?

For reference, I operate a large Mesos cluster and I have never encountered
problems when running 1k tasks concurrently so I think sharing data would
help everyone debug this problem.

On Fri, Dec 16, 2016 at 6:05 AM, Kiril Menshikov <km...@gmail.com>
wrote:

> ​Hi,
>
> Does any body try to run Mesos on AWS instances? Can you give me
> recommendations.
>
> I am developing elastic (scale aws instances on demand) Mesos cluster.
> Currently I have 3 master instances. I run about 1000 tasks simultaneously.
> I see delays and health check problems.
>
> ~400 tasks fits in one m4.10xlarge instance. (160GB RAM, 40 CPU).
>
> At the moment I increase time out in ZooKeeper cluster. What can I do to
> decrease timeouts?
>
> Also how can I increase performance? The main bottleneck is what I have
> the big amount of tasks(run simultaneously) for an hour after I shutdown
> them or restart (depends how good them perform).
>
> -Kiril​
>
> --
> Zameer Manji
>