You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Neil Conway <ne...@gmail.com> on 2016/06/21 17:58:43 UTC

Improving support for partitioned tasks

Currently, Mesos implements a hardcoded policy for handling
partitioned agents and tasks:

* agents are deemed to be partitioned when they fail health checks
(~75 seconds by default)
* partitioned agents are removed from the cluster. Frameworks receive
TASK_LOST for all tasks running on the removed agent.
* when the agent reconnects, the master instructs it to shutdown and
terminate all of its tasks.

This is problematic: framework authors would like to implement their
own partition-handling logic. To improve this situation, this design
doc proposes changing how the Mesos master handles partitions:

https://issues.apache.org/jira/browse/MESOS-5659

Feedback is very welcome! In particular, if you're working on a
framework that would like to implement custom partition-handling
logic, I'd be curious to hear a bit more about the framework behavior
you'd like to provide, and whether you can implement that behavior
using the functionality proposed in the design doc.

Thanks,
Neil