You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Olivier Sallou <ol...@irisa.fr> on 2014/10/17 17:24:07 UTC

how to debug task lost in custom scheduler?

Hi,
I have installed mesos on a single host master/slave config (for
devpt/test).

Mesos works fine with frameworks I tested (aurora...).

I try to create my own scheduler/executor in python, based on example
given with sources, but I cannot get my task executed.

Executor is not executed (I have added debug logs in a file to check,
and no file is created), but I see no error in master logs (console) nor
slave logs.

In master I can see:

I1017 16:50:30.601210 25794 master.cpp:3559] Sending 1 offers to
framework 20141017-141022-16777343-5050-25774-0047
I1017 16:50:30.608912 25789 master.cpp:2169] Processing reply for
offers: [ 20141017-141022-16777343-5050-25774-97 ] on slave
20141017-141022-16777343-5050-25774-0 at slave(1)@127.0.0.1:5051
(localhost) for framework 20141017-141022-16777343-5050-25774-0047
I1017 16:50:30.609207 25789 hierarchical_allocator_process.hpp:563]
Recovered cpus(*):8; mem(*):6900; disk(*):215925; ports(*):[31000-32000]
(total allocatable: cpus(*):8; mem(*):6900; disk(*):215925;
ports(*):[31000-32000]) on slave 20141017-141022-16777343-5050-25774-0
from framework 20141017-141022-16777343-5050-25774-0047

My reply to the offer is received, but in my scheduler I receive an
update status of TASK_LOST.

I do not see how to debug this, I see no information why my task is lost
(there is enough cpu/mem, I ask 2 cpu, and 2024 mem), and it seems that
it is rejected at master level.

Any hint on how to analyse this?

Thanks

-- 
gpg key id: 4096R/326D8438  (keyring.debian.org)
Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438



Re: how to debug task lost in custom scheduler?

Posted by Olivier Sallou <ol...@irisa.fr>.
On 10/17/2014 07:31 PM, Vinod Kone wrote:
> Can you grep for TASK_LOST in master and slave logs and paste the output
> here?
I do not see any TASK_LOST in any master/slave log, this is one of the
reason I do not understand.

I only found console log, I do not see any "file" log.

For information, mesos is not installed system-wide but locally from
source, I execute from the build directory.
>
> On Fri, Oct 17, 2014 at 8:24 AM, Olivier Sallou <ol...@irisa.fr>
> wrote:
>
>> Hi,
>> I have installed mesos on a single host master/slave config (for
>> devpt/test).
>>
>> Mesos works fine with frameworks I tested (aurora...).
>>
>> I try to create my own scheduler/executor in python, based on example
>> given with sources, but I cannot get my task executed.
>>
>> Executor is not executed (I have added debug logs in a file to check,
>> and no file is created), but I see no error in master logs (console) nor
>> slave logs.
>>
>> In master I can see:
>>
>> I1017 16:50:30.601210 25794 master.cpp:3559] Sending 1 offers to
>> framework 20141017-141022-16777343-5050-25774-0047
>> I1017 16:50:30.608912 25789 master.cpp:2169] Processing reply for
>> offers: [ 20141017-141022-16777343-5050-25774-97 ] on slave
>> 20141017-141022-16777343-5050-25774-0 at slave(1)@127.0.0.1:5051
>> (localhost) for framework 20141017-141022-16777343-5050-25774-0047
>> I1017 16:50:30.609207 25789 hierarchical_allocator_process.hpp:563]
>> Recovered cpus(*):8; mem(*):6900; disk(*):215925; ports(*):[31000-32000]
>> (total allocatable: cpus(*):8; mem(*):6900; disk(*):215925;
>> ports(*):[31000-32000]) on slave 20141017-141022-16777343-5050-25774-0
>> from framework 20141017-141022-16777343-5050-25774-0047
>>
>> My reply to the offer is received, but in my scheduler I receive an
>> update status of TASK_LOST.
>>
>> I do not see how to debug this, I see no information why my task is lost
>> (there is enough cpu/mem, I ask 2 cpu, and 2024 mem), and it seems that
>> it is rejected at master level.
>>
>> Any hint on how to analyse this?
>>
>> Thanks
>>
>> --
>> gpg key id: 4096R/326D8438  (keyring.debian.org)
>> Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438
>>
>>
>>

-- 
Olivier Sallou
IRISA / University of Rennes 1
Campus de Beaulieu, 35000 RENNES - FRANCE
Tel: 02.99.84.71.95

gpg key id: 4096R/326D8438  (keyring.debian.org)
Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438


Re: how to debug task lost in custom scheduler?

Posted by Olivier Sallou <ol...@irisa.fr>.
On 10/20/2014 05:20 PM, Alex Rukletsov wrote:
> It looks like you try to set both command and executor. This is not
> allowed, since setting a command implies using the CommandExecutor aka
> mesos-executor. If you task is a command, do not specify the executor in
> your TaskInfo: mesos will do it for you. See
> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto line
> 579.
>
> Btw, you should observe something like "Task <id> should have either
> CommandInfo or ExecutorInfo set but not both" in your logs.
ok, thanks, I could get it work (at least I see my job).

There is a lack of documentation on API per language.  :-(

Thanks for your help

Olivier
>
> On Mon, Oct 20, 2014 at 5:13 PM, Olivier Sallou <ol...@irisa.fr>
> wrote:
>
>> On 10/20/2014 08:11 AM, Olivier Sallou wrote:
>>> On 10/18/2014 12:55 PM, Alex Rukletsov wrote:
>>>> Hi Oliver,
>>>>
>>>> you can get a TASK_LOST if import directives in your executor fail. Do
>> you
>>>> have mesos python eggs installed or available through PYTHONPATH? Could
>> you
>>>> please also paste the output of stderr and stdout of the lost task (you
>> can
>>>> access them via mesos webUI → sandbox)?
>>> I do not see the task at all on webUI. Python eggs are available from
>>> PYTHONPATH. My eggs are in MESOS_BUILD_DIR.
>>> If I execute directly my executor, I have no "python" error, only a
>>> MISSING SLAVE ID (but this is correct as mesos adds this env at runtime).
>>>
>>> I see that task is lost because, in my scheduler, in the statusUpdate
>>> method, I print the task status (value = 5). Message is empty.
>>>
>>> nothing in webUI, nothing in console logs.... as my executor is not
>>> executed, it means that mesos (master or slave) give me this error
>>> status, but I have no additional info about the reason.
>>>
>>> I have used and adapted the examples given with sources
>>> (src/examples/python).
>> Taking as example the python code in src/examples/python, I could
>> progress a little.
>>
>> Though there is no additional error log, I found an issue with setting
>> the "command" parameter.
>>
>> If I comment the "command" parameter, my executor is executed (it fails
>> but that's fine for the moment).
>>
>> In my task, I was setting: task.command.value = "something to execute on
>> node"
>>
>> Setting command creates a silent error.
>>
>> My TaskInfo was like:
>> .....
>> executor {
>>   executor_id {
>>     value: "default"
>>   }
>>   command {
>>     value: "....../test-executor"
>>   }
>>   name: "Test Executor (Python)"
>>   source: "python_test"
>> }
>> command {
>>   value: "ls -l"
>> }
>>
>> So I wonder:
>>
>> 1) why the error is silent on master side
>>
>> 2) how do I set the command to execute in the TaskInfo object ?
>>> Olivier
>>>> On Fri, Oct 17, 2014 at 7:31 PM, Vinod Kone <vi...@gmail.com>
>> wrote:
>>>>> Can you grep for TASK_LOST in master and slave logs and paste the
>> output
>>>>> here?
>>>>>
>>>>> On Fri, Oct 17, 2014 at 8:24 AM, Olivier Sallou <
>> olivier.sallou@irisa.fr>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I have installed mesos on a single host master/slave config (for
>>>>>> devpt/test).
>>>>>>
>>>>>> Mesos works fine with frameworks I tested (aurora...).
>>>>>>
>>>>>> I try to create my own scheduler/executor in python, based on example
>>>>>> given with sources, but I cannot get my task executed.
>>>>>>
>>>>>> Executor is not executed (I have added debug logs in a file to check,
>>>>>> and no file is created), but I see no error in master logs (console)
>> nor
>>>>>> slave logs.
>>>>>>
>>>>>> In master I can see:
>>>>>>
>>>>>> I1017 16:50:30.601210 25794 master.cpp:3559] Sending 1 offers to
>>>>>> framework 20141017-141022-16777343-5050-25774-0047
>>>>>> I1017 16:50:30.608912 25789 master.cpp:2169] Processing reply for
>>>>>> offers: [ 20141017-141022-16777343-5050-25774-97 ] on slave
>>>>>> 20141017-141022-16777343-5050-25774-0 at slave(1)@127.0.0.1:5051
>>>>>> (localhost) for framework 20141017-141022-16777343-5050-25774-0047
>>>>>> I1017 16:50:30.609207 25789 hierarchical_allocator_process.hpp:563]
>>>>>> Recovered cpus(*):8; mem(*):6900; disk(*):215925;
>> ports(*):[31000-32000]
>>>>>> (total allocatable: cpus(*):8; mem(*):6900; disk(*):215925;
>>>>>> ports(*):[31000-32000]) on slave 20141017-141022-16777343-5050-25774-0
>>>>>> from framework 20141017-141022-16777343-5050-25774-0047
>>>>>>
>>>>>> My reply to the offer is received, but in my scheduler I receive an
>>>>>> update status of TASK_LOST.
>>>>>>
>>>>>> I do not see how to debug this, I see no information why my task is
>> lost
>>>>>> (there is enough cpu/mem, I ask 2 cpu, and 2024 mem), and it seems
>> that
>>>>>> it is rejected at master level.
>>>>>>
>>>>>> Any hint on how to analyse this?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> --
>>>>>> gpg key id: 4096R/326D8438  (keyring.debian.org)
>>>>>> Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438
>>>>>>
>>>>>>
>>>>>>
>> --
>> Olivier Sallou
>> IRISA / University of Rennes 1
>> Campus de Beaulieu, 35000 RENNES - FRANCE
>> Tel: 02.99.84.71.95
>>
>> gpg key id: 4096R/326D8438  (keyring.debian.org)
>> Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438
>>
>>

-- 
Olivier Sallou
IRISA / University of Rennes 1
Campus de Beaulieu, 35000 RENNES - FRANCE
Tel: 02.99.84.71.95

gpg key id: 4096R/326D8438  (keyring.debian.org)
Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438


Re: how to debug task lost in custom scheduler?

Posted by Alex Rukletsov <al...@mesosphere.io>.
It looks like you try to set both command and executor. This is not
allowed, since setting a command implies using the CommandExecutor aka
mesos-executor. If you task is a command, do not specify the executor in
your TaskInfo: mesos will do it for you. See
https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto line
579.

Btw, you should observe something like "Task <id> should have either
CommandInfo or ExecutorInfo set but not both" in your logs.

On Mon, Oct 20, 2014 at 5:13 PM, Olivier Sallou <ol...@irisa.fr>
wrote:

>
> On 10/20/2014 08:11 AM, Olivier Sallou wrote:
> > On 10/18/2014 12:55 PM, Alex Rukletsov wrote:
> >> Hi Oliver,
> >>
> >> you can get a TASK_LOST if import directives in your executor fail. Do
> you
> >> have mesos python eggs installed or available through PYTHONPATH? Could
> you
> >> please also paste the output of stderr and stdout of the lost task (you
> can
> >> access them via mesos webUI → sandbox)?
> > I do not see the task at all on webUI. Python eggs are available from
> > PYTHONPATH. My eggs are in MESOS_BUILD_DIR.
> > If I execute directly my executor, I have no "python" error, only a
> > MISSING SLAVE ID (but this is correct as mesos adds this env at runtime).
> >
> > I see that task is lost because, in my scheduler, in the statusUpdate
> > method, I print the task status (value = 5). Message is empty.
> >
> > nothing in webUI, nothing in console logs.... as my executor is not
> > executed, it means that mesos (master or slave) give me this error
> > status, but I have no additional info about the reason.
> >
> > I have used and adapted the examples given with sources
> > (src/examples/python).
> Taking as example the python code in src/examples/python, I could
> progress a little.
>
> Though there is no additional error log, I found an issue with setting
> the "command" parameter.
>
> If I comment the "command" parameter, my executor is executed (it fails
> but that's fine for the moment).
>
> In my task, I was setting: task.command.value = "something to execute on
> node"
>
> Setting command creates a silent error.
>
> My TaskInfo was like:
> .....
> executor {
>   executor_id {
>     value: "default"
>   }
>   command {
>     value: "....../test-executor"
>   }
>   name: "Test Executor (Python)"
>   source: "python_test"
> }
> command {
>   value: "ls -l"
> }
>
> So I wonder:
>
> 1) why the error is silent on master side
>
> 2) how do I set the command to execute in the TaskInfo object ?
> >
> > Olivier
> >> On Fri, Oct 17, 2014 at 7:31 PM, Vinod Kone <vi...@gmail.com>
> wrote:
> >>
> >>> Can you grep for TASK_LOST in master and slave logs and paste the
> output
> >>> here?
> >>>
> >>> On Fri, Oct 17, 2014 at 8:24 AM, Olivier Sallou <
> olivier.sallou@irisa.fr>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>> I have installed mesos on a single host master/slave config (for
> >>>> devpt/test).
> >>>>
> >>>> Mesos works fine with frameworks I tested (aurora...).
> >>>>
> >>>> I try to create my own scheduler/executor in python, based on example
> >>>> given with sources, but I cannot get my task executed.
> >>>>
> >>>> Executor is not executed (I have added debug logs in a file to check,
> >>>> and no file is created), but I see no error in master logs (console)
> nor
> >>>> slave logs.
> >>>>
> >>>> In master I can see:
> >>>>
> >>>> I1017 16:50:30.601210 25794 master.cpp:3559] Sending 1 offers to
> >>>> framework 20141017-141022-16777343-5050-25774-0047
> >>>> I1017 16:50:30.608912 25789 master.cpp:2169] Processing reply for
> >>>> offers: [ 20141017-141022-16777343-5050-25774-97 ] on slave
> >>>> 20141017-141022-16777343-5050-25774-0 at slave(1)@127.0.0.1:5051
> >>>> (localhost) for framework 20141017-141022-16777343-5050-25774-0047
> >>>> I1017 16:50:30.609207 25789 hierarchical_allocator_process.hpp:563]
> >>>> Recovered cpus(*):8; mem(*):6900; disk(*):215925;
> ports(*):[31000-32000]
> >>>> (total allocatable: cpus(*):8; mem(*):6900; disk(*):215925;
> >>>> ports(*):[31000-32000]) on slave 20141017-141022-16777343-5050-25774-0
> >>>> from framework 20141017-141022-16777343-5050-25774-0047
> >>>>
> >>>> My reply to the offer is received, but in my scheduler I receive an
> >>>> update status of TASK_LOST.
> >>>>
> >>>> I do not see how to debug this, I see no information why my task is
> lost
> >>>> (there is enough cpu/mem, I ask 2 cpu, and 2024 mem), and it seems
> that
> >>>> it is rejected at master level.
> >>>>
> >>>> Any hint on how to analyse this?
> >>>>
> >>>> Thanks
> >>>>
> >>>> --
> >>>> gpg key id: 4096R/326D8438  (keyring.debian.org)
> >>>> Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438
> >>>>
> >>>>
> >>>>
>
> --
> Olivier Sallou
> IRISA / University of Rennes 1
> Campus de Beaulieu, 35000 RENNES - FRANCE
> Tel: 02.99.84.71.95
>
> gpg key id: 4096R/326D8438  (keyring.debian.org)
> Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438
>
>

Re: how to debug task lost in custom scheduler?

Posted by Olivier Sallou <ol...@irisa.fr>.
On 10/20/2014 08:11 AM, Olivier Sallou wrote:
> On 10/18/2014 12:55 PM, Alex Rukletsov wrote:
>> Hi Oliver,
>>
>> you can get a TASK_LOST if import directives in your executor fail. Do you
>> have mesos python eggs installed or available through PYTHONPATH? Could you
>> please also paste the output of stderr and stdout of the lost task (you can
>> access them via mesos webUI → sandbox)?
> I do not see the task at all on webUI. Python eggs are available from
> PYTHONPATH. My eggs are in MESOS_BUILD_DIR.
> If I execute directly my executor, I have no "python" error, only a
> MISSING SLAVE ID (but this is correct as mesos adds this env at runtime).
>
> I see that task is lost because, in my scheduler, in the statusUpdate
> method, I print the task status (value = 5). Message is empty.
>
> nothing in webUI, nothing in console logs.... as my executor is not
> executed, it means that mesos (master or slave) give me this error
> status, but I have no additional info about the reason.
>
> I have used and adapted the examples given with sources
> (src/examples/python).
Taking as example the python code in src/examples/python, I could
progress a little.

Though there is no additional error log, I found an issue with setting
the "command" parameter.

If I comment the "command" parameter, my executor is executed (it fails
but that's fine for the moment).

In my task, I was setting: task.command.value = "something to execute on
node"

Setting command creates a silent error.

My TaskInfo was like:
.....
executor {
  executor_id {
    value: "default"
  }
  command {
    value: "....../test-executor"
  }
  name: "Test Executor (Python)"
  source: "python_test"
}
command {
  value: "ls -l"
}

So I wonder:

1) why the error is silent on master side

2) how do I set the command to execute in the TaskInfo object ?
>
> Olivier
>> On Fri, Oct 17, 2014 at 7:31 PM, Vinod Kone <vi...@gmail.com> wrote:
>>
>>> Can you grep for TASK_LOST in master and slave logs and paste the output
>>> here?
>>>
>>> On Fri, Oct 17, 2014 at 8:24 AM, Olivier Sallou <ol...@irisa.fr>
>>> wrote:
>>>
>>>> Hi,
>>>> I have installed mesos on a single host master/slave config (for
>>>> devpt/test).
>>>>
>>>> Mesos works fine with frameworks I tested (aurora...).
>>>>
>>>> I try to create my own scheduler/executor in python, based on example
>>>> given with sources, but I cannot get my task executed.
>>>>
>>>> Executor is not executed (I have added debug logs in a file to check,
>>>> and no file is created), but I see no error in master logs (console) nor
>>>> slave logs.
>>>>
>>>> In master I can see:
>>>>
>>>> I1017 16:50:30.601210 25794 master.cpp:3559] Sending 1 offers to
>>>> framework 20141017-141022-16777343-5050-25774-0047
>>>> I1017 16:50:30.608912 25789 master.cpp:2169] Processing reply for
>>>> offers: [ 20141017-141022-16777343-5050-25774-97 ] on slave
>>>> 20141017-141022-16777343-5050-25774-0 at slave(1)@127.0.0.1:5051
>>>> (localhost) for framework 20141017-141022-16777343-5050-25774-0047
>>>> I1017 16:50:30.609207 25789 hierarchical_allocator_process.hpp:563]
>>>> Recovered cpus(*):8; mem(*):6900; disk(*):215925; ports(*):[31000-32000]
>>>> (total allocatable: cpus(*):8; mem(*):6900; disk(*):215925;
>>>> ports(*):[31000-32000]) on slave 20141017-141022-16777343-5050-25774-0
>>>> from framework 20141017-141022-16777343-5050-25774-0047
>>>>
>>>> My reply to the offer is received, but in my scheduler I receive an
>>>> update status of TASK_LOST.
>>>>
>>>> I do not see how to debug this, I see no information why my task is lost
>>>> (there is enough cpu/mem, I ask 2 cpu, and 2024 mem), and it seems that
>>>> it is rejected at master level.
>>>>
>>>> Any hint on how to analyse this?
>>>>
>>>> Thanks
>>>>
>>>> --
>>>> gpg key id: 4096R/326D8438  (keyring.debian.org)
>>>> Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438
>>>>
>>>>
>>>>

-- 
Olivier Sallou
IRISA / University of Rennes 1
Campus de Beaulieu, 35000 RENNES - FRANCE
Tel: 02.99.84.71.95

gpg key id: 4096R/326D8438  (keyring.debian.org)
Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438


Re: how to debug task lost in custom scheduler?

Posted by Olivier Sallou <ol...@irisa.fr>.
On 10/18/2014 12:55 PM, Alex Rukletsov wrote:
> Hi Oliver,
>
> you can get a TASK_LOST if import directives in your executor fail. Do you
> have mesos python eggs installed or available through PYTHONPATH? Could you
> please also paste the output of stderr and stdout of the lost task (you can
> access them via mesos webUI → sandbox)?
I do not see the task at all on webUI. Python eggs are available from
PYTHONPATH. My eggs are in MESOS_BUILD_DIR.
If I execute directly my executor, I have no "python" error, only a
MISSING SLAVE ID (but this is correct as mesos adds this env at runtime).

I see that task is lost because, in my scheduler, in the statusUpdate
method, I print the task status (value = 5). Message is empty.

nothing in webUI, nothing in console logs.... as my executor is not
executed, it means that mesos (master or slave) give me this error
status, but I have no additional info about the reason.

I have used and adapted the examples given with sources
(src/examples/python).

Olivier
>
> On Fri, Oct 17, 2014 at 7:31 PM, Vinod Kone <vi...@gmail.com> wrote:
>
>> Can you grep for TASK_LOST in master and slave logs and paste the output
>> here?
>>
>> On Fri, Oct 17, 2014 at 8:24 AM, Olivier Sallou <ol...@irisa.fr>
>> wrote:
>>
>>> Hi,
>>> I have installed mesos on a single host master/slave config (for
>>> devpt/test).
>>>
>>> Mesos works fine with frameworks I tested (aurora...).
>>>
>>> I try to create my own scheduler/executor in python, based on example
>>> given with sources, but I cannot get my task executed.
>>>
>>> Executor is not executed (I have added debug logs in a file to check,
>>> and no file is created), but I see no error in master logs (console) nor
>>> slave logs.
>>>
>>> In master I can see:
>>>
>>> I1017 16:50:30.601210 25794 master.cpp:3559] Sending 1 offers to
>>> framework 20141017-141022-16777343-5050-25774-0047
>>> I1017 16:50:30.608912 25789 master.cpp:2169] Processing reply for
>>> offers: [ 20141017-141022-16777343-5050-25774-97 ] on slave
>>> 20141017-141022-16777343-5050-25774-0 at slave(1)@127.0.0.1:5051
>>> (localhost) for framework 20141017-141022-16777343-5050-25774-0047
>>> I1017 16:50:30.609207 25789 hierarchical_allocator_process.hpp:563]
>>> Recovered cpus(*):8; mem(*):6900; disk(*):215925; ports(*):[31000-32000]
>>> (total allocatable: cpus(*):8; mem(*):6900; disk(*):215925;
>>> ports(*):[31000-32000]) on slave 20141017-141022-16777343-5050-25774-0
>>> from framework 20141017-141022-16777343-5050-25774-0047
>>>
>>> My reply to the offer is received, but in my scheduler I receive an
>>> update status of TASK_LOST.
>>>
>>> I do not see how to debug this, I see no information why my task is lost
>>> (there is enough cpu/mem, I ask 2 cpu, and 2024 mem), and it seems that
>>> it is rejected at master level.
>>>
>>> Any hint on how to analyse this?
>>>
>>> Thanks
>>>
>>> --
>>> gpg key id: 4096R/326D8438  (keyring.debian.org)
>>> Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438
>>>
>>>
>>>

-- 
Olivier Sallou
IRISA / University of Rennes 1
Campus de Beaulieu, 35000 RENNES - FRANCE
Tel: 02.99.84.71.95

gpg key id: 4096R/326D8438  (keyring.debian.org)
Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438



Re: how to debug task lost in custom scheduler?

Posted by Alex Rukletsov <al...@mesosphere.io>.
Hi Oliver,

you can get a TASK_LOST if import directives in your executor fail. Do you
have mesos python eggs installed or available through PYTHONPATH? Could you
please also paste the output of stderr and stdout of the lost task (you can
access them via mesos webUI → sandbox)?

On Fri, Oct 17, 2014 at 7:31 PM, Vinod Kone <vi...@gmail.com> wrote:

> Can you grep for TASK_LOST in master and slave logs and paste the output
> here?
>
> On Fri, Oct 17, 2014 at 8:24 AM, Olivier Sallou <ol...@irisa.fr>
> wrote:
>
> > Hi,
> > I have installed mesos on a single host master/slave config (for
> > devpt/test).
> >
> > Mesos works fine with frameworks I tested (aurora...).
> >
> > I try to create my own scheduler/executor in python, based on example
> > given with sources, but I cannot get my task executed.
> >
> > Executor is not executed (I have added debug logs in a file to check,
> > and no file is created), but I see no error in master logs (console) nor
> > slave logs.
> >
> > In master I can see:
> >
> > I1017 16:50:30.601210 25794 master.cpp:3559] Sending 1 offers to
> > framework 20141017-141022-16777343-5050-25774-0047
> > I1017 16:50:30.608912 25789 master.cpp:2169] Processing reply for
> > offers: [ 20141017-141022-16777343-5050-25774-97 ] on slave
> > 20141017-141022-16777343-5050-25774-0 at slave(1)@127.0.0.1:5051
> > (localhost) for framework 20141017-141022-16777343-5050-25774-0047
> > I1017 16:50:30.609207 25789 hierarchical_allocator_process.hpp:563]
> > Recovered cpus(*):8; mem(*):6900; disk(*):215925; ports(*):[31000-32000]
> > (total allocatable: cpus(*):8; mem(*):6900; disk(*):215925;
> > ports(*):[31000-32000]) on slave 20141017-141022-16777343-5050-25774-0
> > from framework 20141017-141022-16777343-5050-25774-0047
> >
> > My reply to the offer is received, but in my scheduler I receive an
> > update status of TASK_LOST.
> >
> > I do not see how to debug this, I see no information why my task is lost
> > (there is enough cpu/mem, I ask 2 cpu, and 2024 mem), and it seems that
> > it is rejected at master level.
> >
> > Any hint on how to analyse this?
> >
> > Thanks
> >
> > --
> > gpg key id: 4096R/326D8438  (keyring.debian.org)
> > Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438
> >
> >
> >
>

Re: how to debug task lost in custom scheduler?

Posted by Vinod Kone <vi...@gmail.com>.
Can you grep for TASK_LOST in master and slave logs and paste the output
here?

On Fri, Oct 17, 2014 at 8:24 AM, Olivier Sallou <ol...@irisa.fr>
wrote:

> Hi,
> I have installed mesos on a single host master/slave config (for
> devpt/test).
>
> Mesos works fine with frameworks I tested (aurora...).
>
> I try to create my own scheduler/executor in python, based on example
> given with sources, but I cannot get my task executed.
>
> Executor is not executed (I have added debug logs in a file to check,
> and no file is created), but I see no error in master logs (console) nor
> slave logs.
>
> In master I can see:
>
> I1017 16:50:30.601210 25794 master.cpp:3559] Sending 1 offers to
> framework 20141017-141022-16777343-5050-25774-0047
> I1017 16:50:30.608912 25789 master.cpp:2169] Processing reply for
> offers: [ 20141017-141022-16777343-5050-25774-97 ] on slave
> 20141017-141022-16777343-5050-25774-0 at slave(1)@127.0.0.1:5051
> (localhost) for framework 20141017-141022-16777343-5050-25774-0047
> I1017 16:50:30.609207 25789 hierarchical_allocator_process.hpp:563]
> Recovered cpus(*):8; mem(*):6900; disk(*):215925; ports(*):[31000-32000]
> (total allocatable: cpus(*):8; mem(*):6900; disk(*):215925;
> ports(*):[31000-32000]) on slave 20141017-141022-16777343-5050-25774-0
> from framework 20141017-141022-16777343-5050-25774-0047
>
> My reply to the offer is received, but in my scheduler I receive an
> update status of TASK_LOST.
>
> I do not see how to debug this, I see no information why my task is lost
> (there is enough cpu/mem, I ask 2 cpu, and 2024 mem), and it seems that
> it is rejected at master level.
>
> Any hint on how to analyse this?
>
> Thanks
>
> --
> gpg key id: 4096R/326D8438  (keyring.debian.org)
> Key fingerprint = 5FB4 6F83 D3B9 5204 6335  D26D 78DC 68DB 326D 8438
>
>
>