You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Bernerd Schaefer <be...@soundcloud.com> on 2013/09/18 11:16:07 UTC

Reliable Task Launching

This is somewhat related to the existing thread about messaging
reliability, though specific to launching tasks.

I noticed that both Chronos[1] and Marathon[2] assume that after
"launchTasks" is called, the task will be launched. However, as best as I
can tell, Mesos makes no such guarantee -- or, rather, libprocess makes no
guarantee that a sent message will be received.

Am I correct in thinking that to reliably launch tasks, one should do
something like the scheduler's doReliableRegistration[3], i.e., ask the
driver to launch a task until a task update message is received for it?

One problem I see with this, is that the master replies to repeated launch
task messages[4] with "TASK_LOST". There are some cases where this would
work effectively:

- Framework sends task launch message. Master is down and never receives
it. Master comes back up/fails over. Framework resends message, and
receives TASK_LOST.

At that point, the framework could re-queue the task to be launched against
a new offer. There, however, are other cases where this creates a race
condition:

1. Framework sends task launch message. Master begins to launch task.
Framework resends message. The framework will receive both TASK_LOST and
TASK_STARTING/RUNNING, in any order.

2. Framework sends task launch message. Master launches task, but dies
before notifying framework. Master recovers/fails over. Framework resends
message. It will receive both TASK_LOST and TASK_STARTING (or RUNNING), in
any order.

For case 1, I *think* this could be addressed if the master -- instead of
only checking the validity of the offer -- also checked if the task already
exists in the framework. If so it could either ignore the request, or send
a status update message with the current state.

I'm unsure about addressing case 2, since as far as I understand the task
list is lazily rebuilt from slave messages after a crash/fail-over.

Perhaps someone here has some more experience mesos and its frameworks, can
lend some insight here. Is this even a problem in practice?

Thanks,

Bernerd
Engineer @ Soundcloud

[1]:
https://github.com/airbnb/chronos/blob/master/src/main/scala/com/airbnb/scheduler/mesos/MesosJobFramework.scala#L198-L203
[2]:
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/MarathonScheduler.scala#L57-L59
[3]: https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L285
[4]: https://github.com/apache/mesos/blob/master/src/master/master.cpp#L902

Re: Reliable Task Launching

Posted by Bernerd <be...@soundcloud.com>.

> Aurora imposes a cap on how long it will tolerate a task in a 'transient' state.  If aurora doesn't hear back about a transient task after this delay, it internally marks it as LOST and uses its state machine to handle side-effects and follow-up actions.

Good to know! That's what I thought a framework would need to do.

> Hopefully you'll get to see this soon, as we're preparing to release our code very soon as a part of our proposal for apache incubation [1]!

I can't wait!

Bernerd

Re: Reliable Task Launching

Posted by Bill Farner <bi...@twitter.com>.

Aurora imposes a cap on how long it will tolerate a task in a 'transient'
state.  If aurora doesn't hear back about a transient task after this
delay, it internally marks it as LOST and uses its state machine to handle
side-effects and follow-up actions.

Hopefully you'll get to see this soon, as we're preparing to release our
code very soon as a part of our proposal for apache incubation [1]!


-=Bill

[1]
http://apache-incubator-general.996316.n3.nabble.com/PROPOSAL-Aurora-for-Incubation-td36288.html

-=Bill


On Wed, Sep 18, 2013 at 10:38 AM, Bill Farner <bi...@twitter.com> wrote:

> Aurora imposes a cap on how long it will tolerate a task in a 'transient'
> state.  If aurora doesn't hear back about a transient task after this
> delay, it internally marks it as LOST and uses its state machine to handle
> side-effects and follow-up actions.
>
> Hopefully you'll get to see this soon, as we're preparing to release our
> code very soon as a part of our proposal for apache incubation [1]!
>
>
> -=Bill
>
> [1]
> http://apache-incubator-general.996316.n3.nabble.com/PROPOSAL-Aurora-for-Incubation-td36288.html
>
> On Wed, Sep 18, 2013 at 10:27 AM, Vinod Kone <vi...@gmail.com> wrote:
>
>> Hey Bernerd,
>>
>> All great questions and observations. Mesos currently doesn't guarantee
>> reliable delivery of messages. So the onus is on the frameworks to do the
>> retries. That said, as you correctly observed, there are race conditions
>> where a framework might get conflicting status updates when it retries.
>> I'll let someone from Twitter framework (Aurora) team answer how they deal
>> with these issues.
>>
>> From mesos side we plan to implement a stateful scheduler driver which
>> would help answer some of these questions more easily. We are in the
>> process of roadmapping features for the next quarter, so if this is
>> something that is important to you please let us know and we will
>> prioritize accordingly. Maybe you can send in some patches too :)
>>
>> Cheers,
>>
>>
>>
>> On Wed, Sep 18, 2013 at 9:12 AM, Bernerd Schaefer <bernerd@soundcloud.com
>> > wrote:
>>
>>> Thanks, Tobi
>>>
>>>  In Marathon's case there are a couple of things in place to ensure
>>>> reliable launching.
>>>> Marathon tracks tasks via the Mesos state abstraction (backed by ZK) so
>>>> it can recover after getting disconnected.
>>>> It only considers a task running after it received a TASK_RUNNING from
>>>> Mesos:
>>>>
>>>> https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/MarathonScheduler.scala#L107-L109
>>>>
>>>
>>> Is there more information about the state abstraction? Also, unless I'm
>>> missing something, the flow is: mark task as starting, call launchTasks,
>>> wait for confirmation of running. So if the launchTasks message is not
>>> delivered, the task will stay in the "starting" state, even though Mesos
>>> doesn't know about it. Is there another component that re-queues tasks that
>>> seem to be stuck in starting?
>>>
>>> Bernerd
>>>
>>>
>>
>

Re: Reliable Task Launching

Posted by Vinod Kone <vi...@gmail.com>.

Hey Bernerd,

All great questions and observations. Mesos currently doesn't guarantee
reliable delivery of messages. So the onus is on the frameworks to do the
retries. That said, as you correctly observed, there are race conditions
where a framework might get conflicting status updates when it retries.
I'll let someone from Twitter framework (Aurora) team answer how they deal
with these issues.

>From mesos side we plan to implement a stateful scheduler driver which
would help answer some of these questions more easily. We are in the
process of roadmapping features for the next quarter, so if this is
something that is important to you please let us know and we will
prioritize accordingly. Maybe you can send in some patches too :)

Cheers,



On Wed, Sep 18, 2013 at 9:12 AM, Bernerd Schaefer <be...@soundcloud.com>wrote:

> Thanks, Tobi
>
> In Marathon's case there are a couple of things in place to ensure
>> reliable launching.
>> Marathon tracks tasks via the Mesos state abstraction (backed by ZK) so
>> it can recover after getting disconnected.
>> It only considers a task running after it received a TASK_RUNNING from
>> Mesos:
>>
>> https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/MarathonScheduler.scala#L107-L109
>>
>
> Is there more information about the state abstraction? Also, unless I'm
> missing something, the flow is: mark task as starting, call launchTasks,
> wait for confirmation of running. So if the launchTasks message is not
> delivered, the task will stay in the "starting" state, even though Mesos
> doesn't know about it. Is there another component that re-queues tasks that
> seem to be stuck in starting?
>
> Bernerd
>
>

Re: Reliable Task Launching

Posted by Benjamin Hindman <be...@gmail.com>.

Hey Bernerd,

There isn't great documentation on the state abstraction but the code is
here<https://github.com/apache/mesos/tree/master/src/java/src/org/apache/mesos/state>
(and
it's included in the mesos.jar). The best documentation is really it's use
in Chronos and Marathon.

Also, we're working on the logical equivalent of a StatefulSchedulerDriver
that does state management of your tasks for you (including reconciliation
with the master), so you'll still just call launchTasks. The goal is to
make writing framework schedulers really easy.

Ben.

P.S. If you're interested I'm happy to elaborate on why we pushed state to
the framework schedulers.

On Wed, Sep 18, 2013 at 9:12 AM, Bernerd Schaefer <be...@soundcloud.com>wrote:

> Thanks, Tobi
>
> In Marathon's case there are a couple of things in place to ensure
>> reliable launching.
>> Marathon tracks tasks via the Mesos state abstraction (backed by ZK) so
>> it can recover after getting disconnected.
>> It only considers a task running after it received a TASK_RUNNING from
>> Mesos:
>>
>> https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/MarathonScheduler.scala#L107-L109
>>
>
> Is there more information about the state abstraction? Also, unless I'm
> missing something, the flow is: mark task as starting, call launchTasks,
> wait for confirmation of running. So if the launchTasks message is not
> delivered, the task will stay in the "starting" state, even though Mesos
> doesn't know about it. Is there another component that re-queues tasks that
> seem to be stuck in starting?
>
> Bernerd
>
>

Re: Reliable Task Launching

Posted by Bernerd Schaefer <be...@soundcloud.com>.

Thanks, Tobi

In Marathon's case there are a couple of things in place to ensure reliable
> launching.
> Marathon tracks tasks via the Mesos state abstraction (backed by ZK) so it
> can recover after getting disconnected.
> It only considers a task running after it received a TASK_RUNNING from
> Mesos:
>
> https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/MarathonScheduler.scala#L107-L109
>

Is there more information about the state abstraction? Also, unless I'm
missing something, the flow is: mark task as starting, call launchTasks,
wait for confirmation of running. So if the launchTasks message is not
delivered, the task will stay in the "starting" state, even though Mesos
doesn't know about it. Is there another component that re-queues tasks that
seem to be stuck in starting?

Bernerd

Re: Reliable Task Launching

Posted by Tobias Knaup <to...@knaup.me>.

Hi Bernerd,

In Marathon's case there are a couple of things in place to ensure reliable
launching.
Marathon tracks tasks via the Mesos state abstraction (backed by ZK) so it
can recover after getting disconnected.
It only considers a task running after it received a TASK_RUNNING from
Mesos:
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/MarathonScheduler.scala#L107-L109

To avoid the TASK_LOST + TASK_RUNNING problem you described, Marathon adds
a timestamp to all task IDs:
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/tasks/TaskIDUtil.scala#L16

Hope this helps!

Tobi

On Wed, Sep 18, 2013 at 4:58 AM, Bernerd Schaefer <be...@soundcloud.com>wrote:

>
> 1) When my Framework has need to launch a job it calls
>> Scheduler.launchTask. But at this point it does not actually run any jobs.
>> If the launchTask returns successfully my Framework assumes at that point
>> that it has "control" over the resources and that the executor is ready to
>> run jobs.
>>
>> 2) I then pass the mesos slave/executor details to my internal job
>> framework and when ready I use Scheduler.sendFrameworkMessage() to actually
>> run my job after my internal scheduler does the setup work.
>>
>> 3) If step #2 fails for any reason, my Framework assumes the
>> slave/executor is no longer available and submits another request and waits
>> for Framework to find another match.
>>
>
>  Given that neither launchTask nor sendFrameworkMessage offer reliable
> delivery, how do you detect failure? Timeouts? Something else?
>
> Bernerd
> Engineer @ Soundcloud
>

Re: Reliable Task Launching

Posted by Bernerd Schaefer <be...@soundcloud.com>.

> 1) When my Framework has need to launch a job it calls
> Scheduler.launchTask. But at this point it does not actually run any jobs.
> If the launchTask returns successfully my Framework assumes at that point
> that it has "control" over the resources and that the executor is ready to
> run jobs.
>
> 2) I then pass the mesos slave/executor details to my internal job
> framework and when ready I use Scheduler.sendFrameworkMessage() to actually
> run my job after my internal scheduler does the setup work.
>
> 3) If step #2 fails for any reason, my Framework assumes the
> slave/executor is no longer available and submits another request and waits
> for Framework to find another match.
>

 Given that neither launchTask nor sendFrameworkMessage offer reliable
delivery, how do you detect failure? Timeouts? Something else?

Bernerd
Engineer @ Soundcloud

Re: Reliable Task Launching

Posted by Sam Taha <ta...@gmail.com>.

For JobServer integration with Mesos I take a two phase approach to
launching jobs on the slaves:

1) When my Framework has need to launch a job it calls
Scheduler.launchTask. But at this point it does not actually run any jobs.
If the launchTask returns successfully my Framework assumes at that point
that it has "control" over the resources and that the executor is ready to
run jobs.

2) I then pass the mesos slave/executor details to my internal job
framework and when ready I use Scheduler.sendFrameworkMessage() to actually
run my job after my internal scheduler does the setup work.

3) If step #2 fails for any reason, my Framework assumes the slave/executor
is no longer available and submits another request and waits for Framework
to find another match.

After the job completes, my Framework detects this and can potentially
recycle the executor and use it for another job if resources are a match
for the next job.

Thanks,
Sam Taha

http://www.grandlogic.com


On Wed, Sep 18, 2013 at 5:16 AM, Bernerd Schaefer <be...@soundcloud.com>wrote:

> This is somewhat related to the existing thread about messaging
> reliability, though specific to launching tasks.
>
> I noticed that both Chronos[1] and Marathon[2] assume that after
> "launchTasks" is called, the task will be launched. However, as best as I
> can tell, Mesos makes no such guarantee -- or, rather, libprocess makes no
> guarantee that a sent message will be received.
>
> Am I correct in thinking that to reliably launch tasks, one should do
> something like the scheduler's doReliableRegistration[3], i.e., ask the
> driver to launch a task until a task update message is received for it?
>
> One problem I see with this, is that the master replies to repeated launch
> task messages[4] with "TASK_LOST". There are some cases where this would
> work effectively:
>
> - Framework sends task launch message. Master is down and never receives
> it. Master comes back up/fails over. Framework resends message, and
> receives TASK_LOST.
>
> At that point, the framework could re-queue the task to be launched
> against a new offer. There, however, are other cases where this creates a
> race condition:
>
> 1. Framework sends task launch message. Master begins to launch task.
> Framework resends message. The framework will receive both TASK_LOST and
> TASK_STARTING/RUNNING, in any order.
>
>  2. Framework sends task launch message. Master launches task, but dies
> before notifying framework. Master recovers/fails over. Framework resends
> message. It will receive both TASK_LOST and TASK_STARTING (or RUNNING), in
> any order.
>
> For case 1, I *think* this could be addressed if the master -- instead of
> only checking the validity of the offer -- also checked if the task already
> exists in the framework. If so it could either ignore the request, or send
> a status update message with the current state.
>
> I'm unsure about addressing case 2, since as far as I understand the task
> list is lazily rebuilt from slave messages after a crash/fail-over.
>
> Perhaps someone here has some more experience mesos and its frameworks,
> can lend some insight here. Is this even a problem in practice?
>
> Thanks,
>
> Bernerd
> Engineer @ Soundcloud
>
> [1]:
> https://github.com/airbnb/chronos/blob/master/src/main/scala/com/airbnb/scheduler/mesos/MesosJobFramework.scala#L198-L203
> [2]:
> https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/MarathonScheduler.scala#L57-L59
> [3]: https://github.com/apache/mesos/blob/master/src/sched/sched.cpp#L285
> [4]:
> https://github.com/apache/mesos/blob/master/src/master/master.cpp#L902
>