You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@myriad.apache.org by Swapnil Daingade <sw...@gmail.com> on 2015/07/31 10:07:48 UTC

Help with Task failure and executor failure

Hi All,

I am looking to verify if my understanding of Task failures and executor
failures in Mesos is correct.

I am assuming the following

* Mesos trusts custom executor to report task status.
  If a task completes/fails, but executor does not call
 ExecutorDriver.sendStatusUpdate() with TASK_COMPLETE/TASK_FAILED then
Mesos will assume that the task is still running.

* Mesos does not use task status sent using call to ExecutorDriver.
sendStatusUpdate as a heartbeat.
  For E.g. in MyriadExecutor we report the NMTask status as TASK_RUNNING
after launching the
 NM. We report TASK_COMPLETE/TASK_FAILED only after the process has
terminated. There is no call to ExecutorDriver.sendStatusUpdate() in
between. I am assuming that this does not cause Mesos to think that the
task has been lost after some timeout interval.

* If an executor dies, Mesos thinks all tasks launched by that executor are
lost. Scheduler will receive one call to executorLost() and
statusUpdate()'s with state set to TASK_LOST for every Task launched by
that executor.

Please let me know if any of my assumptions are incorrect.

Regards
Swapnil

Re: Help with Task failure and executor failure

Posted by Swapnil Daingade <sw...@gmail.com>.
Thank You Adam.
This is really helpful in validating some of the choices made in the Myriad
HA design which I'll submit for review shortly.

Regards
Swapnil



On Fri, Jul 31, 2015 at 3:07 AM, Adam Bordelon <ad...@mesosphere.io> wrote:

> 1) Mesos trusts custom executor to report task status.
> Correct, as long as the executor is still running.
>
> 2) Mesos does not use task status as a heartbeat.
> Correct. A task could start RUNNING, then provide no other status updates
> for months and Mesos will assume it is still running as long as there were
> no terminal status updates sent, the executor is still running, and the
> slave is still connected.
> However, you can optionally add health checks (HTTP or command) to your
> tasks, and Mesos will report the health back in periodic status updates.
> But it's up to your framework to determine how to interpret an "unhealthy"
> state.
>
> 3) If an executor dies, Mesos thinks all tasks launched by that executor
> are lost.
> Correct. However, there is a long-standing issue (MESOS-313) that
> executorLost is never actually passed onto the scheduler. You will get a
> TASK_LOST for each task though.
>
> On Fri, Jul 31, 2015 at 1:07 AM, Swapnil Daingade <
> swapnil.daingade@gmail.com> wrote:
>
> > Hi All,
> >
> > I am looking to verify if my understanding of Task failures and executor
> > failures in Mesos is correct.
> >
> > I am assuming the following
> >
> > * Mesos trusts custom executor to report task status.
> >   If a task completes/fails, but executor does not call
> >  ExecutorDriver.sendStatusUpdate() with TASK_COMPLETE/TASK_FAILED then
> > Mesos will assume that the task is still running.
> >
> > * Mesos does not use task status sent using call to ExecutorDriver.
> > sendStatusUpdate as a heartbeat.
> >   For E.g. in MyriadExecutor we report the NMTask status as TASK_RUNNING
> > after launching the
> >  NM. We report TASK_COMPLETE/TASK_FAILED only after the process has
> > terminated. There is no call to ExecutorDriver.sendStatusUpdate() in
> > between. I am assuming that this does not cause Mesos to think that the
> > task has been lost after some timeout interval.
> >
> > * If an executor dies, Mesos thinks all tasks launched by that executor
> are
> > lost. Scheduler will receive one call to executorLost() and
> > statusUpdate()'s with state set to TASK_LOST for every Task launched by
> > that executor.
> >
> > Please let me know if any of my assumptions are incorrect.
> >
> > Regards
> > Swapnil
> >
>

Re: Help with Task failure and executor failure

Posted by Adam Bordelon <ad...@mesosphere.io>.
1) Mesos trusts custom executor to report task status.
Correct, as long as the executor is still running.

2) Mesos does not use task status as a heartbeat.
Correct. A task could start RUNNING, then provide no other status updates
for months and Mesos will assume it is still running as long as there were
no terminal status updates sent, the executor is still running, and the
slave is still connected.
However, you can optionally add health checks (HTTP or command) to your
tasks, and Mesos will report the health back in periodic status updates.
But it's up to your framework to determine how to interpret an "unhealthy"
state.

3) If an executor dies, Mesos thinks all tasks launched by that executor
are lost.
Correct. However, there is a long-standing issue (MESOS-313) that
executorLost is never actually passed onto the scheduler. You will get a
TASK_LOST for each task though.

On Fri, Jul 31, 2015 at 1:07 AM, Swapnil Daingade <
swapnil.daingade@gmail.com> wrote:

> Hi All,
>
> I am looking to verify if my understanding of Task failures and executor
> failures in Mesos is correct.
>
> I am assuming the following
>
> * Mesos trusts custom executor to report task status.
>   If a task completes/fails, but executor does not call
>  ExecutorDriver.sendStatusUpdate() with TASK_COMPLETE/TASK_FAILED then
> Mesos will assume that the task is still running.
>
> * Mesos does not use task status sent using call to ExecutorDriver.
> sendStatusUpdate as a heartbeat.
>   For E.g. in MyriadExecutor we report the NMTask status as TASK_RUNNING
> after launching the
>  NM. We report TASK_COMPLETE/TASK_FAILED only after the process has
> terminated. There is no call to ExecutorDriver.sendStatusUpdate() in
> between. I am assuming that this does not cause Mesos to think that the
> task has been lost after some timeout interval.
>
> * If an executor dies, Mesos thinks all tasks launched by that executor are
> lost. Scheduler will receive one call to executorLost() and
> statusUpdate()'s with state set to TASK_LOST for every Task launched by
> that executor.
>
> Please let me know if any of my assumptions are incorrect.
>
> Regards
> Swapnil
>