You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Mike Michel <mi...@mmbash.de> on 2015/12/30 12:43:23 UTC

make slaves not getting tasks anymore

Hi,

 

i need to update slaves from time to time and looking for a way to take them
out of the cluster but without killing the running tasks. I need to wait
until all tasks are done and during this time no new tasks should be started
on this slave. My first idea was to set a constraint "status:online" for
every task i start and then change the attribute of the slave to "offline",
restart slave process while executer still runs the tasks but it seems if
you change the attributes of a slave it can not connect to the cluster
without rm -rf /tmp before which will kill all tasks.

 

Also the maintenance mode seems not to be an option:

 

"When maintenance is triggered by the operator, all agents on the machine
are told to shutdown. These agents are subsequently removed from the master
which causes tasks to be updated as TASK_LOST. Any agents from machines in
maintenance are also prevented from registering with the master."



 

Is there another way?

 

 

Cheers

 

Mike


Re: make slaves not getting tasks anymore

Posted by Shuai Lin <li...@gmail.com>.
>
> I need to wait until all tasks are done and during this time no new tasks
> should be started on this slave


This is  exactly what maintenance mode is designed for. But to achieve
this, it requires the cooperation of the framework. When the operator adds
a maintenance schedule for a slave, mesos master would first send "inverse
offers" to all frameworks that have tasks running on that slave, and the
frameworks are "assumed to" move the tasks away to other slaves.

But the framework can ignore the inverse offers as well, for example, I
can't find any code to handle it in marathon code.




> Also the maintenance mode seems not to be an option:

When maintenance is triggered by the operator, all agents on the machine
> are told to shutdown


Be aware that the maintenance process is a two-phase process:

- the first step is "adding the maintenance schedule", the operator tells
master "I would take slaveX down for maintenance in 1 hour, please ask the
frameworks to move their tasks to other slaves", as I described above
- the second step is "starting the maintenance", the operator tells the
master "I'm taking this slave down RIGHT NOW". The master would kill all
tasks on that slave and asks the mesos-slave process to exit, as described
in the paragrah you quoted in the original mesasge.

In a word, it mostly depends on the frameworks you use.


On Wed, Dec 30, 2015 at 7:43 PM, Mike Michel <mi...@mmbash.de> wrote:

> Hi,
>
>
>
> i need to update slaves from time to time and looking for a way to take
> them out of the cluster but without killing the running tasks. I need to
> wait until all tasks are done and during this time no new tasks should be
> started on this slave. My first idea was to set a constraint
> „status:online“ for every task i start and then change the attribute of the
> slave to „offline“, restart slave process while executer still runs the
> tasks but it seems if you change the attributes of a slave it can not
> connect to the cluster without rm -rf /tmp before which will kill all tasks.
>
>
>
> Also the maintenance mode seems not to be an option:
>
>
>
> „When maintenance is triggered by the operator, all agents on the machine
> are told to shutdown. These agents are subsequently removed from the master
> which causes tasks to be updated as TASK_LOST. Any agents from machines
> in maintenance are also prevented from registering with the master.“
>
>
>
> Is there another way?
>
>
>
>
>
> Cheers
>
>
>
> Mike
>

AW: make slaves not getting tasks anymore

Posted by Mike Michel <mi...@mmbash.de>.
Whitelist seems to be the best option right now. I will try that.

 

thanks

 

Von: Jeremy Olexa [mailto:jolexa@spscommerce.com] 
Gesendet: Mittwoch, 30. Dezember 2015 17:22
An: user@mesos.apache.org
Betreff: Re: make slaves not getting tasks anymore

 

Hi Mike,

 

Yes, there is another way besides the maintenance primitives that aren't fully complete yet (IMO). If you wish to not schedule anymore jobs, you can remove that host from the whitelist on the masters. You might have to engineer this for your setup abit, but this is what we do:

 

1) All slaves are discovered and explicitly added to the whitelist

2) On demand (by the operator), a node is REMOVED from the whitelist for some time, currently we add the node back after a timeout of 1 hour

3) Wait for jobs to finish on that node, or send SIGUSR1 to mesos-slave process to force job termination

 

Of course, there is also satellite, which does all this for you :) https://github.com/twosigma/satellite/

 

Hope that helps,
-Jeremy



  _____  

From: Mike Michel <mike.michel@mmbash.de <ma...@mmbash.de> >
Sent: Wednesday, December 30, 2015 5:43 AM
To: user@mesos.apache.org <ma...@mesos.apache.org> 
Subject: make slaves not getting tasks anymore 

 

Hi,

 

i need to update slaves from time to time and looking for a way to take them out of the cluster but without killing the running tasks. I need to wait until all tasks are done and during this time no new tasks should be started on this slave. My first idea was to set a constraint „status:online“ for every task i start and then change the attribute of the slave to „offline“, restart slave process while executer still runs the tasks but it seems if you change the attributes of a slave it can not connect to the cluster without rm -rf /tmp before which will kill all tasks.

 

Also the maintenance mode seems not to be an option:

 

„When maintenance is triggered by the operator, all agents on the machine are told to shutdown. These agents are subsequently removed from the master which causes tasks to be updated as TASK_LOST. Any agents from machines in maintenance are also prevented from registering with the master.“

 

Is there another way?

 

 

Cheers

 

Mike


Re: make slaves not getting tasks anymore

Posted by Jeremy Olexa <jo...@spscommerce.com>.
Hi Mike,


Yes, there is another way besides the maintenance primitives that aren't fully complete yet (IMO). If you wish to not schedule anymore jobs, you can remove that host from the whitelist on the masters. You might have to engineer this for your setup abit, but this is what we do:


1) All slaves are discovered and explicitly added to the whitelist

2) On demand (by the operator), a node is REMOVED from the whitelist for some time, currently we add the node back after a timeout of 1 hour

3) Wait for jobs to finish on that node, or send SIGUSR1 to mesos-slave process to force job termination


Of course, there is also satellite, which does all this for you :) https://github.com/twosigma/satellite/

Hope that helps,
-Jeremy


________________________________
From: Mike Michel <mi...@mmbash.de>
Sent: Wednesday, December 30, 2015 5:43 AM
To: user@mesos.apache.org
Subject: make slaves not getting tasks anymore


Hi,



i need to update slaves from time to time and looking for a way to take them out of the cluster but without killing the running tasks. I need to wait until all tasks are done and during this time no new tasks should be started on this slave. My first idea was to set a constraint "status:online" for every task i start and then change the attribute of the slave to "offline", restart slave process while executer still runs the tasks but it seems if you change the attributes of a slave it can not connect to the cluster without rm -rf /tmp before which will kill all tasks.



Also the maintenance mode seems not to be an option:



"When maintenance is triggered by the operator, all agents on the machine are told to shutdown. These agents are subsequently removed from the master which causes tasks to be updated as TASK_LOST. Any agents from machines in maintenance are also prevented from registering with the master."




Is there another way?





Cheers



Mike

AW: make slaves not getting tasks anymore

Posted by Mike Michel <mi...@mmbash.de>.
I am using marathon and from shuai lins answer it still seems that maintenance mode is not the right option for me. I don’t want to marathon move the tasks to another node (phase 1) without user action (restart the task) and it should also not just kill the tasks (phase 2).

 

To be concrete: I need to update docker and want to tell users that they need to restart their tasks to be moved to a node with the latest docker version. 

 

With MESOS-1739 my „first idea“ would work.

 

Von: Klaus Ma [mailto:klaus1982.cn@gmail.com] 
Gesendet: Mittwoch, 30. Dezember 2015 13:24
An: user@mesos.apache.org
Betreff: Re: make slaves not getting tasks anymore

 

Hi Mike,

 

Which framework are you using? How about Maintenance's scheduling feature? My understanding is that framework show not dispatch task to the maintenance agent; so Operator can wait for all tasks finished before taking any action.

 

For "When maintenance is triggered by the operator", it's used when there're some tasks took too long time to finish; so Operator can task action to shut them down.

 

For the agent restart with new attributes, there's a JIRA (MESOS-1739) about it.




----

Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer 
Platform Symphony/DCOS Development & Support, STG, IBM GCG 
+86-10-8245 4084 |  <ma...@gmail.com> klaus1982.cn@gmail.com |  <http://k82.me> http://k82.me

 

On Wed, Dec 30, 2015 at 7:43 PM, Mike Michel <mike.michel@mmbash.de <ma...@mmbash.de> > wrote:

Hi,

 

i need to update slaves from time to time and looking for a way to take them out of the cluster but without killing the running tasks. I need to wait until all tasks are done and during this time no new tasks should be started on this slave. My first idea was to set a constraint „status:online“ for every task i start and then change the attribute of the slave to „offline“, restart slave process while executer still runs the tasks but it seems if you change the attributes of a slave it can not connect to the cluster without rm -rf /tmp before which will kill all tasks.

 

Also the maintenance mode seems not to be an option:

 

„When maintenance is triggered by the operator, all agents on the machine are told to shutdown. These agents are subsequently removed from the master which causes tasks to be updated as TASK_LOST. Any agents from machines in maintenance are also prevented from registering with the master.“

 

Is there another way?

 

 

Cheers

 

Mike

 


Re: make slaves not getting tasks anymore

Posted by Klaus Ma <kl...@gmail.com>.
Hi Mike,

Which framework are you using? How about Maintenance's scheduling feature?
My understanding is that framework show not dispatch task to the
maintenance agent; so Operator can wait for all tasks finished before
taking any action.

For "When maintenance is triggered by the operator", it's used when
there're some tasks took too long time to finish; so Operator can task
action to shut them down.

For the agent restart with new attributes, there's a JIRA (MESOS-1739)
about it.

----
Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982.cn@gmail.com | http://k82.me

On Wed, Dec 30, 2015 at 7:43 PM, Mike Michel <mi...@mmbash.de> wrote:

> Hi,
>
>
>
> i need to update slaves from time to time and looking for a way to take
> them out of the cluster but without killing the running tasks. I need to
> wait until all tasks are done and during this time no new tasks should be
> started on this slave. My first idea was to set a constraint
> „status:online“ for every task i start and then change the attribute of the
> slave to „offline“, restart slave process while executer still runs the
> tasks but it seems if you change the attributes of a slave it can not
> connect to the cluster without rm -rf /tmp before which will kill all tasks.
>
>
>
> Also the maintenance mode seems not to be an option:
>
>
>
> „When maintenance is triggered by the operator, all agents on the machine
> are told to shutdown. These agents are subsequently removed from the master
> which causes tasks to be updated as TASK_LOST. Any agents from machines
> in maintenance are also prevented from registering with the master.“
>
>
>
> Is there another way?
>
>
>
>
>
> Cheers
>
>
>
> Mike
>

Re: make slaves not getting tasks anymore

Posted by Dick Davies <di...@hellooperator.net>.
It sounds like you want to use checkpointing, that should keep the
tasks alive as you update
the mesos slave process itself.

On 30 December 2015 at 11:43, Mike Michel <mi...@mmbash.de> wrote:
> Hi,
>
>
>
> i need to update slaves from time to time and looking for a way to take them
> out of the cluster but without killing the running tasks. I need to wait
> until all tasks are done and during this time no new tasks should be started
> on this slave. My first idea was to set a constraint „status:online“ for
> every task i start and then change the attribute of the slave to „offline“,
> restart slave process while executer still runs the tasks but it seems if
> you change the attributes of a slave it can not connect to the cluster
> without rm -rf /tmp before which will kill all tasks.
>
>
>
> Also the maintenance mode seems not to be an option:
>
>
>
> „When maintenance is triggered by the operator, all agents on the machine
> are told to shutdown. These agents are subsequently removed from the master
> which causes tasks to be updated as TASK_LOST. Any agents from machines in
> maintenance are also prevented from registering with the master.“
>
>
>
> Is there another way?
>
>
>
>
>
> Cheers
>
>
>
> Mike