You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Itamar Ostricher <it...@yowza3d.com> on 2015/01/21 10:22:05 UTC

Trying to debug an issue in mesos task tracking

I'm using a custom internal framework, loosely based on MesosSubmit.
The phenomenon I'm seeing is something like this:
1. Task X is assigned to slave S.
2. I know this task should run for ~10minutes.
3. On the master dashboard, I see that task X is in the "Running" state for
several *hours*.
4. I SSH into slave S, and see that task X is *not* running. According to
the local logs on that slave, task X finished a long time ago, and seemed
to finish OK.
5. According to the scheduler logs, it never got any update from task X
after the Staging->Running update.

The phenomenon occurs pretty often, but it's not consistent or
deterministic.

I'd appreciate your input on how to go about debugging it, and/or implement
a workaround to avoid wasted resources.

I'm pretty sure the executor on the slave sends the TASK_FINISHED status
update (how can I verify that beyond my own logging?).
I'm pretty sure the scheduler never receives that update (again, how can I
verify that beyond my own logging?).
I have no idea if the master got the update and passed it through (how can
I check that?).
My scheduler and executor are written in Python.

As for a workaround - setting a timeout on a task should do the trick. I
did not see any timeout field in the TaskInfo message. Does mesos support
the concept of per-task timeouts? Or should I implement my own task
tracking and timeouting mechanism in the scheduler?

Re: Trying to debug an issue in mesos task tracking

Posted by Sharma Podila <sp...@netflix.com>.
I deal with Java programs running in my executor that spawn various
"service/daemon threads". So, I tend to explicitly call TASK_FINISHED and
call System.exit() (with a sleep to allow Mesos to communicate the task
update) when I know the task is complete instead of waiting for natural
exit of all threads.

Of course, this may not apply to your situation, but, just in case...


On Mon, Jan 26, 2015 at 4:43 AM, Itamar Ostricher <it...@yowza3d.com>
wrote:

> Thanks Alex.
> I agree that it looks like it's not mesos-related. It's probably some
> dead-lock.
>
> On Mon, Jan 26, 2015 at 1:31 PM, Alex Rukletsov <al...@mesosphere.io>
> wrote:
>
>> Itamar,
>>
>> you are right, Mesos executor and containerizer cannot distinguish
>> between "busy" and "stuck" processes. However, since you use your own
>> custom executor, you may want to implement a sort of health checks. It
>> depends on what your task processes are doing.
>>
>> There are hundreds of reasons why an OS process may "get stuck"; it
>> doesn't look like it's Mesos-related in this case.
>>
>> On Sat, Jan 24, 2015 at 9:17 PM, Itamar Ostricher <it...@yowza3d.com>
>> wrote:
>> > Alex, Sharma, thanks for your input!
>> >
>> > Trying to recreate the issue with a small cluster for the last few
>> days, I
>> > was not able to observe a scenario that I can be sure that my executor
>> sent
>> > the TASK_FINISHED update, but the scheduler did not receive it.
>> > I did observe multiple times a scenario that a task seemed to be
>> "stuck" in
>> > TASK_RUNNING state, but when I SSH'ed into the slave that has the task,
>> I
>> > always saw that the process related to that task is still running (by
>> > grepping `ps aux`). Most of the times, it seemed that the process did
>> the
>> > work (by examining the logs produced by the PID), but for some reason
>> it was
>> > "stuck" without exiting cleanly. Some times it seemed that the process
>> > didn't do any work (an empty log file with the PID). All times, as soon
>> as I
>> > killed the PID, a TASK_FAILED update was sent and received successfully.
>> >
>> > So, it seems that the problem is in processes spawned by my executor,
>> but I
>> > don't fully understand why this happens.
>> > Any ideas why a process would do some work (either 1% (just creating a
>> log
>> > file) or 99% (doing everything but not exiting) and "get stuck"?
>> >
>> > On Fri, Jan 23, 2015 at 1:01 PM, Alex Rukletsov <al...@mesosphere.io>
>> wrote:
>> >>
>> >> Itamar,
>> >>
>> >> beyond checking master and slave logs, could you pleasse verify your
>> >> executor does send the TASK_FINISHED update? You may want to add some
>> >> logging and the check executor log. Mesos guarantees the delivery of
>> >> status updates, so I suspect the problem is on the executor's side.
>> >>
>> >> On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila <sp...@netflix.com>
>> >> wrote:
>> >> > Have you checked the mesos-slave and mesos-master logs for that task
>> id?
>> >> > There should be logs in there for task state updates, including
>> >> > FINISHED.
>> >> > There can be specific cases where sometimes the task status is not
>> >> > reliably
>> >> > sent to your scheduler (due to mesos-master restarts, leader election
>> >> > changes, etc.). There is a task reconciliation support in Mesos. A
>> >> > periodic
>> >> > call to reconcile tasks from the scheduler can be helpful. There are
>> >> > also
>> >> > newer enhancements coming to the task reconciliation. In the mean
>> time,
>> >> > there are other strategies such as what I use, which is periodic
>> >> > heartbeats
>> >> > from my custom executor to my scheduler (out of band). The timeouts
>> for
>> >> > task
>> >> > runtimes are similar to heartbeats, except, you need a priori
>> knowledge
>> >> > of
>> >> > all tasks' runtimes.
>> >> >
>> >> > Task runtime limits are not support inherently, as far as I know.
>> Your
>> >> > executor can implement it, and that may be one simple way to do it.
>> That
>> >> > could also be a good way to implement shell's rlimit*, in general.
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher <
>> itamar@yowza3d.com>
>> >> > wrote:
>> >> >>
>> >> >> I'm using a custom internal framework, loosely based on MesosSubmit.
>> >> >> The phenomenon I'm seeing is something like this:
>> >> >> 1. Task X is assigned to slave S.
>> >> >> 2. I know this task should run for ~10minutes.
>> >> >> 3. On the master dashboard, I see that task X is in the "Running"
>> state
>> >> >> for several *hours*.
>> >> >> 4. I SSH into slave S, and see that task X is *not* running.
>> According
>> >> >> to
>> >> >> the local logs on that slave, task X finished a long time ago, and
>> >> >> seemed to
>> >> >> finish OK.
>> >> >> 5. According to the scheduler logs, it never got any update from
>> task X
>> >> >> after the Staging->Running update.
>> >> >>
>> >> >> The phenomenon occurs pretty often, but it's not consistent or
>> >> >> deterministic.
>> >> >>
>> >> >> I'd appreciate your input on how to go about debugging it, and/or
>> >> >> implement a workaround to avoid wasted resources.
>> >> >>
>> >> >> I'm pretty sure the executor on the slave sends the TASK_FINISHED
>> >> >> status
>> >> >> update (how can I verify that beyond my own logging?).
>> >> >> I'm pretty sure the scheduler never receives that update (again, how
>> >> >> can I
>> >> >> verify that beyond my own logging?).
>> >> >> I have no idea if the master got the update and passed it through
>> (how
>> >> >> can
>> >> >> I check that?).
>> >> >> My scheduler and executor are written in Python.
>> >> >>
>> >> >> As for a workaround - setting a timeout on a task should do the
>> trick.
>> >> >> I
>> >> >> did not see any timeout field in the TaskInfo message. Does mesos
>> >> >> support
>> >> >> the concept of per-task timeouts? Or should I implement my own task
>> >> >> tracking
>> >> >> and timeouting mechanism in the scheduler?
>> >> >
>> >> >
>> >
>> >
>>
>
>

Re: Trying to debug an issue in mesos task tracking

Posted by Itamar Ostricher <it...@yowza3d.com>.
Thanks Alex.
I agree that it looks like it's not mesos-related. It's probably some
dead-lock.

On Mon, Jan 26, 2015 at 1:31 PM, Alex Rukletsov <al...@mesosphere.io> wrote:

> Itamar,
>
> you are right, Mesos executor and containerizer cannot distinguish
> between "busy" and "stuck" processes. However, since you use your own
> custom executor, you may want to implement a sort of health checks. It
> depends on what your task processes are doing.
>
> There are hundreds of reasons why an OS process may "get stuck"; it
> doesn't look like it's Mesos-related in this case.
>
> On Sat, Jan 24, 2015 at 9:17 PM, Itamar Ostricher <it...@yowza3d.com>
> wrote:
> > Alex, Sharma, thanks for your input!
> >
> > Trying to recreate the issue with a small cluster for the last few days,
> I
> > was not able to observe a scenario that I can be sure that my executor
> sent
> > the TASK_FINISHED update, but the scheduler did not receive it.
> > I did observe multiple times a scenario that a task seemed to be "stuck"
> in
> > TASK_RUNNING state, but when I SSH'ed into the slave that has the task, I
> > always saw that the process related to that task is still running (by
> > grepping `ps aux`). Most of the times, it seemed that the process did the
> > work (by examining the logs produced by the PID), but for some reason it
> was
> > "stuck" without exiting cleanly. Some times it seemed that the process
> > didn't do any work (an empty log file with the PID). All times, as soon
> as I
> > killed the PID, a TASK_FAILED update was sent and received successfully.
> >
> > So, it seems that the problem is in processes spawned by my executor,
> but I
> > don't fully understand why this happens.
> > Any ideas why a process would do some work (either 1% (just creating a
> log
> > file) or 99% (doing everything but not exiting) and "get stuck"?
> >
> > On Fri, Jan 23, 2015 at 1:01 PM, Alex Rukletsov <al...@mesosphere.io>
> wrote:
> >>
> >> Itamar,
> >>
> >> beyond checking master and slave logs, could you pleasse verify your
> >> executor does send the TASK_FINISHED update? You may want to add some
> >> logging and the check executor log. Mesos guarantees the delivery of
> >> status updates, so I suspect the problem is on the executor's side.
> >>
> >> On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila <sp...@netflix.com>
> >> wrote:
> >> > Have you checked the mesos-slave and mesos-master logs for that task
> id?
> >> > There should be logs in there for task state updates, including
> >> > FINISHED.
> >> > There can be specific cases where sometimes the task status is not
> >> > reliably
> >> > sent to your scheduler (due to mesos-master restarts, leader election
> >> > changes, etc.). There is a task reconciliation support in Mesos. A
> >> > periodic
> >> > call to reconcile tasks from the scheduler can be helpful. There are
> >> > also
> >> > newer enhancements coming to the task reconciliation. In the mean
> time,
> >> > there are other strategies such as what I use, which is periodic
> >> > heartbeats
> >> > from my custom executor to my scheduler (out of band). The timeouts
> for
> >> > task
> >> > runtimes are similar to heartbeats, except, you need a priori
> knowledge
> >> > of
> >> > all tasks' runtimes.
> >> >
> >> > Task runtime limits are not support inherently, as far as I know. Your
> >> > executor can implement it, and that may be one simple way to do it.
> That
> >> > could also be a good way to implement shell's rlimit*, in general.
> >> >
> >> >
> >> >
> >> > On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher <itamar@yowza3d.com
> >
> >> > wrote:
> >> >>
> >> >> I'm using a custom internal framework, loosely based on MesosSubmit.
> >> >> The phenomenon I'm seeing is something like this:
> >> >> 1. Task X is assigned to slave S.
> >> >> 2. I know this task should run for ~10minutes.
> >> >> 3. On the master dashboard, I see that task X is in the "Running"
> state
> >> >> for several *hours*.
> >> >> 4. I SSH into slave S, and see that task X is *not* running.
> According
> >> >> to
> >> >> the local logs on that slave, task X finished a long time ago, and
> >> >> seemed to
> >> >> finish OK.
> >> >> 5. According to the scheduler logs, it never got any update from
> task X
> >> >> after the Staging->Running update.
> >> >>
> >> >> The phenomenon occurs pretty often, but it's not consistent or
> >> >> deterministic.
> >> >>
> >> >> I'd appreciate your input on how to go about debugging it, and/or
> >> >> implement a workaround to avoid wasted resources.
> >> >>
> >> >> I'm pretty sure the executor on the slave sends the TASK_FINISHED
> >> >> status
> >> >> update (how can I verify that beyond my own logging?).
> >> >> I'm pretty sure the scheduler never receives that update (again, how
> >> >> can I
> >> >> verify that beyond my own logging?).
> >> >> I have no idea if the master got the update and passed it through
> (how
> >> >> can
> >> >> I check that?).
> >> >> My scheduler and executor are written in Python.
> >> >>
> >> >> As for a workaround - setting a timeout on a task should do the
> trick.
> >> >> I
> >> >> did not see any timeout field in the TaskInfo message. Does mesos
> >> >> support
> >> >> the concept of per-task timeouts? Or should I implement my own task
> >> >> tracking
> >> >> and timeouting mechanism in the scheduler?
> >> >
> >> >
> >
> >
>

Re: Trying to debug an issue in mesos task tracking

Posted by Alex Rukletsov <al...@mesosphere.io>.
Itamar,

you are right, Mesos executor and containerizer cannot distinguish
between "busy" and "stuck" processes. However, since you use your own
custom executor, you may want to implement a sort of health checks. It
depends on what your task processes are doing.

There are hundreds of reasons why an OS process may "get stuck"; it
doesn't look like it's Mesos-related in this case.

On Sat, Jan 24, 2015 at 9:17 PM, Itamar Ostricher <it...@yowza3d.com> wrote:
> Alex, Sharma, thanks for your input!
>
> Trying to recreate the issue with a small cluster for the last few days, I
> was not able to observe a scenario that I can be sure that my executor sent
> the TASK_FINISHED update, but the scheduler did not receive it.
> I did observe multiple times a scenario that a task seemed to be "stuck" in
> TASK_RUNNING state, but when I SSH'ed into the slave that has the task, I
> always saw that the process related to that task is still running (by
> grepping `ps aux`). Most of the times, it seemed that the process did the
> work (by examining the logs produced by the PID), but for some reason it was
> "stuck" without exiting cleanly. Some times it seemed that the process
> didn't do any work (an empty log file with the PID). All times, as soon as I
> killed the PID, a TASK_FAILED update was sent and received successfully.
>
> So, it seems that the problem is in processes spawned by my executor, but I
> don't fully understand why this happens.
> Any ideas why a process would do some work (either 1% (just creating a log
> file) or 99% (doing everything but not exiting) and "get stuck"?
>
> On Fri, Jan 23, 2015 at 1:01 PM, Alex Rukletsov <al...@mesosphere.io> wrote:
>>
>> Itamar,
>>
>> beyond checking master and slave logs, could you pleasse verify your
>> executor does send the TASK_FINISHED update? You may want to add some
>> logging and the check executor log. Mesos guarantees the delivery of
>> status updates, so I suspect the problem is on the executor's side.
>>
>> On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila <sp...@netflix.com>
>> wrote:
>> > Have you checked the mesos-slave and mesos-master logs for that task id?
>> > There should be logs in there for task state updates, including
>> > FINISHED.
>> > There can be specific cases where sometimes the task status is not
>> > reliably
>> > sent to your scheduler (due to mesos-master restarts, leader election
>> > changes, etc.). There is a task reconciliation support in Mesos. A
>> > periodic
>> > call to reconcile tasks from the scheduler can be helpful. There are
>> > also
>> > newer enhancements coming to the task reconciliation. In the mean time,
>> > there are other strategies such as what I use, which is periodic
>> > heartbeats
>> > from my custom executor to my scheduler (out of band). The timeouts for
>> > task
>> > runtimes are similar to heartbeats, except, you need a priori knowledge
>> > of
>> > all tasks' runtimes.
>> >
>> > Task runtime limits are not support inherently, as far as I know. Your
>> > executor can implement it, and that may be one simple way to do it. That
>> > could also be a good way to implement shell's rlimit*, in general.
>> >
>> >
>> >
>> > On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher <it...@yowza3d.com>
>> > wrote:
>> >>
>> >> I'm using a custom internal framework, loosely based on MesosSubmit.
>> >> The phenomenon I'm seeing is something like this:
>> >> 1. Task X is assigned to slave S.
>> >> 2. I know this task should run for ~10minutes.
>> >> 3. On the master dashboard, I see that task X is in the "Running" state
>> >> for several *hours*.
>> >> 4. I SSH into slave S, and see that task X is *not* running. According
>> >> to
>> >> the local logs on that slave, task X finished a long time ago, and
>> >> seemed to
>> >> finish OK.
>> >> 5. According to the scheduler logs, it never got any update from task X
>> >> after the Staging->Running update.
>> >>
>> >> The phenomenon occurs pretty often, but it's not consistent or
>> >> deterministic.
>> >>
>> >> I'd appreciate your input on how to go about debugging it, and/or
>> >> implement a workaround to avoid wasted resources.
>> >>
>> >> I'm pretty sure the executor on the slave sends the TASK_FINISHED
>> >> status
>> >> update (how can I verify that beyond my own logging?).
>> >> I'm pretty sure the scheduler never receives that update (again, how
>> >> can I
>> >> verify that beyond my own logging?).
>> >> I have no idea if the master got the update and passed it through (how
>> >> can
>> >> I check that?).
>> >> My scheduler and executor are written in Python.
>> >>
>> >> As for a workaround - setting a timeout on a task should do the trick.
>> >> I
>> >> did not see any timeout field in the TaskInfo message. Does mesos
>> >> support
>> >> the concept of per-task timeouts? Or should I implement my own task
>> >> tracking
>> >> and timeouting mechanism in the scheduler?
>> >
>> >
>
>

Re: Trying to debug an issue in mesos task tracking

Posted by Itamar Ostricher <it...@yowza3d.com>.
Alex, Sharma, thanks for your input!

Trying to recreate the issue with a small cluster for the last few days, I
was not able to observe a scenario that I can be sure that my executor sent
the TASK_FINISHED update, but the scheduler did not receive it.
I did observe multiple times a scenario that a task seemed to be "stuck" in
TASK_RUNNING state, but when I SSH'ed into the slave that has the task, I
always saw that the process related to that task is still running (by
grepping `ps aux`). Most of the times, it seemed that the process did the
work (by examining the logs produced by the PID), but for some reason it
was "stuck" without exiting cleanly. Some times it seemed that the process
didn't do any work (an empty log file with the PID). All times, as soon as
I killed the PID, a TASK_FAILED update was sent and received successfully.

So, it seems that the problem is in processes spawned by my executor, but I
don't fully understand why this happens.
Any ideas why a process would do some work (either 1% (just creating a log
file) or 99% (doing everything but not exiting) and "get stuck"?

On Fri, Jan 23, 2015 at 1:01 PM, Alex Rukletsov <al...@mesosphere.io> wrote:

> Itamar,
>
> beyond checking master and slave logs, could you pleasse verify your
> executor does send the TASK_FINISHED update? You may want to add some
> logging and the check executor log. Mesos guarantees the delivery of
> status updates, so I suspect the problem is on the executor's side.
>
> On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila <sp...@netflix.com>
> wrote:
> > Have you checked the mesos-slave and mesos-master logs for that task id?
> > There should be logs in there for task state updates, including FINISHED.
> > There can be specific cases where sometimes the task status is not
> reliably
> > sent to your scheduler (due to mesos-master restarts, leader election
> > changes, etc.). There is a task reconciliation support in Mesos. A
> periodic
> > call to reconcile tasks from the scheduler can be helpful. There are also
> > newer enhancements coming to the task reconciliation. In the mean time,
> > there are other strategies such as what I use, which is periodic
> heartbeats
> > from my custom executor to my scheduler (out of band). The timeouts for
> task
> > runtimes are similar to heartbeats, except, you need a priori knowledge
> of
> > all tasks' runtimes.
> >
> > Task runtime limits are not support inherently, as far as I know. Your
> > executor can implement it, and that may be one simple way to do it. That
> > could also be a good way to implement shell's rlimit*, in general.
> >
> >
> >
> > On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher <it...@yowza3d.com>
> > wrote:
> >>
> >> I'm using a custom internal framework, loosely based on MesosSubmit.
> >> The phenomenon I'm seeing is something like this:
> >> 1. Task X is assigned to slave S.
> >> 2. I know this task should run for ~10minutes.
> >> 3. On the master dashboard, I see that task X is in the "Running" state
> >> for several *hours*.
> >> 4. I SSH into slave S, and see that task X is *not* running. According
> to
> >> the local logs on that slave, task X finished a long time ago, and
> seemed to
> >> finish OK.
> >> 5. According to the scheduler logs, it never got any update from task X
> >> after the Staging->Running update.
> >>
> >> The phenomenon occurs pretty often, but it's not consistent or
> >> deterministic.
> >>
> >> I'd appreciate your input on how to go about debugging it, and/or
> >> implement a workaround to avoid wasted resources.
> >>
> >> I'm pretty sure the executor on the slave sends the TASK_FINISHED status
> >> update (how can I verify that beyond my own logging?).
> >> I'm pretty sure the scheduler never receives that update (again, how
> can I
> >> verify that beyond my own logging?).
> >> I have no idea if the master got the update and passed it through (how
> can
> >> I check that?).
> >> My scheduler and executor are written in Python.
> >>
> >> As for a workaround - setting a timeout on a task should do the trick. I
> >> did not see any timeout field in the TaskInfo message. Does mesos
> support
> >> the concept of per-task timeouts? Or should I implement my own task
> tracking
> >> and timeouting mechanism in the scheduler?
> >
> >
>

Re: Trying to debug an issue in mesos task tracking

Posted by Alex Rukletsov <al...@mesosphere.io>.
Itamar,

beyond checking master and slave logs, could you pleasse verify your
executor does send the TASK_FINISHED update? You may want to add some
logging and the check executor log. Mesos guarantees the delivery of
status updates, so I suspect the problem is on the executor's side.

On Wed, Jan 21, 2015 at 6:58 PM, Sharma Podila <sp...@netflix.com> wrote:
> Have you checked the mesos-slave and mesos-master logs for that task id?
> There should be logs in there for task state updates, including FINISHED.
> There can be specific cases where sometimes the task status is not reliably
> sent to your scheduler (due to mesos-master restarts, leader election
> changes, etc.). There is a task reconciliation support in Mesos. A periodic
> call to reconcile tasks from the scheduler can be helpful. There are also
> newer enhancements coming to the task reconciliation. In the mean time,
> there are other strategies such as what I use, which is periodic heartbeats
> from my custom executor to my scheduler (out of band). The timeouts for task
> runtimes are similar to heartbeats, except, you need a priori knowledge of
> all tasks' runtimes.
>
> Task runtime limits are not support inherently, as far as I know. Your
> executor can implement it, and that may be one simple way to do it. That
> could also be a good way to implement shell's rlimit*, in general.
>
>
>
> On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher <it...@yowza3d.com>
> wrote:
>>
>> I'm using a custom internal framework, loosely based on MesosSubmit.
>> The phenomenon I'm seeing is something like this:
>> 1. Task X is assigned to slave S.
>> 2. I know this task should run for ~10minutes.
>> 3. On the master dashboard, I see that task X is in the "Running" state
>> for several *hours*.
>> 4. I SSH into slave S, and see that task X is *not* running. According to
>> the local logs on that slave, task X finished a long time ago, and seemed to
>> finish OK.
>> 5. According to the scheduler logs, it never got any update from task X
>> after the Staging->Running update.
>>
>> The phenomenon occurs pretty often, but it's not consistent or
>> deterministic.
>>
>> I'd appreciate your input on how to go about debugging it, and/or
>> implement a workaround to avoid wasted resources.
>>
>> I'm pretty sure the executor on the slave sends the TASK_FINISHED status
>> update (how can I verify that beyond my own logging?).
>> I'm pretty sure the scheduler never receives that update (again, how can I
>> verify that beyond my own logging?).
>> I have no idea if the master got the update and passed it through (how can
>> I check that?).
>> My scheduler and executor are written in Python.
>>
>> As for a workaround - setting a timeout on a task should do the trick. I
>> did not see any timeout field in the TaskInfo message. Does mesos support
>> the concept of per-task timeouts? Or should I implement my own task tracking
>> and timeouting mechanism in the scheduler?
>
>

Re: Trying to debug an issue in mesos task tracking

Posted by Sharma Podila <sp...@netflix.com>.
Have you checked the mesos-slave and mesos-master logs for that task id?
There should be logs in there for task state updates, including FINISHED.
There can be specific cases where sometimes the task status is not reliably
sent to your scheduler (due to mesos-master restarts, leader election
changes, etc.). There is a task reconciliation support in Mesos. A periodic
call to reconcile tasks from the scheduler can be helpful. There are also
newer enhancements coming to the task reconciliation. In the mean time,
there are other strategies such as what I use, which is periodic heartbeats
from my custom executor to my scheduler (out of band). The timeouts for
task runtimes are similar to heartbeats, except, you need a priori
knowledge of all tasks' runtimes.

Task runtime limits are not support inherently, as far as I know. Your
executor can implement it, and that may be one simple way to do it. That
could also be a good way to implement shell's rlimit*, in general.



On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher <it...@yowza3d.com>
wrote:

> I'm using a custom internal framework, loosely based on MesosSubmit.
> The phenomenon I'm seeing is something like this:
> 1. Task X is assigned to slave S.
> 2. I know this task should run for ~10minutes.
> 3. On the master dashboard, I see that task X is in the "Running" state
> for several *hours*.
> 4. I SSH into slave S, and see that task X is *not* running. According to
> the local logs on that slave, task X finished a long time ago, and seemed
> to finish OK.
> 5. According to the scheduler logs, it never got any update from task X
> after the Staging->Running update.
>
> The phenomenon occurs pretty often, but it's not consistent or
> deterministic.
>
> I'd appreciate your input on how to go about debugging it, and/or
> implement a workaround to avoid wasted resources.
>
> I'm pretty sure the executor on the slave sends the TASK_FINISHED status
> update (how can I verify that beyond my own logging?).
> I'm pretty sure the scheduler never receives that update (again, how can I
> verify that beyond my own logging?).
> I have no idea if the master got the update and passed it through (how can
> I check that?).
> My scheduler and executor are written in Python.
>
> As for a workaround - setting a timeout on a task should do the trick. I
> did not see any timeout field in the TaskInfo message. Does mesos support
> the concept of per-task timeouts? Or should I implement my own task
> tracking and timeouting mechanism in the scheduler?
>