You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by "Chawla,Sumit " <su...@gmail.com> on 2017/05/19 04:44:17 UTC

Mesos Executor Failing

Hi

I am facing a peculiar issue on one of the slave nodes of our cluster.  I
have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
with exit code 0.

ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
exited caused by one of the running tasks) Reason: Unknown executor exit
code (0)


I cannot seem to find anything in mesos-slave.logs, and there is nothing
being written to stdout/stderr.  Are there any debugging utitlities that i
can use to debug what can be getting wrong on that particular slave?

I tried running following but got stuck at:


/mesos-containerizer launch
--command='{"environment":{},"shell":true,"value":"ls -ltr"}'
--directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S77/frameworks/e6745c67-32e8-41ad-b6eb-8fa4d2539da7-0312/executors/e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S77/runs/45aa784c-f485-46a6-aeb8-997e82b80c4f
--help=false --pipe_read=0 --pipe_write=0 --user=smi

Failed to synchronize with slave (it's probably exited)


Would apprecite pointing to any debugging methods/documentation to diagnose
these kind of problems.

Regards
Sumit Chawla

Re: Mesos Executor Failing

Posted by "Chawla,Sumit " <su...@gmail.com>.

Hi Joseph

The error code is being reported as 0, and there is not much else in the
logs.

Regards
Sumit Chawla


On Wed, May 24, 2017 at 12:21 AM, Joseph Wu <jo...@mesosphere.io> wrote:

> There isn't a tool for this.  Can you check if the Mesos agent is being
> restarted (or crashing) when you launch a task?  And perhaps upload some
> logs around the time of the task launch.
>
> There is a mismatch between the exit codes you've reported though.  When
> you see that log line in the sandbox logs, the exit code will be "1"
> (failure), rather than "0" (success).
>
> On Mon, May 22, 2017 at 9:30 PM, Chawla,Sumit <su...@gmail.com>
> wrote:
>
>> Hi Joseph
>>
>> I am using 0.27.0.  Is there any diagnosis tool or command line that i
>> can run to ascertain that why its happening?
>>
>> Regards
>> Sumit Chawla
>>
>>
>> On Fri, May 19, 2017 at 2:31 PM, Joseph Wu <jo...@mesosphere.io> wrote:
>>
>>> What version of Mesos are you using?  (Just based on the word "slave" in
>>> that error message, I'm guessing 0.28 or older.)
>>>
>>> The "Failed to synchronize" error is something that can occur while the
>>> agent is launching the executor.  During the launch, the agent will create
>>> a pipe to the executor subprocess; and the executor makes a blocking read
>>> on this pipe.  The agent will write a value to the pipe to signal the
>>> executor to proceed.  If the agent restarts or the pipe breaks at this
>>> point in the launch, then you'll see this error message.
>>>
>>> On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumit <su...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I am facing a peculiar issue on one of the slave nodes of our cluster.
>>>> I have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
>>>> with exit code 0.
>>>>
>>>> ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
>>>> exited caused by one of the running tasks) Reason: Unknown executor
>>>> exit code (0)
>>>>
>>>>
>>>> I cannot seem to find anything in mesos-slave.logs, and there is
>>>> nothing being written to stdout/stderr.  Are there any debugging utitlities
>>>> that i can use to debug what can be getting wrong on that particular slave?
>>>>
>>>>
>>>> I tried running following but got stuck at:
>>>>
>>>>
>>>> /mesos-containerizer launch --command='{"environment":{},"shell":true,"value":"ls
>>>> -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-b6eb-8f
>>>> a4d2539da7-S77/frameworks/e6745c67-32e8-41ad-b6eb-8fa4d2539d
>>>> a7-0312/executors/e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S77/
>>>> runs/45aa784c-f485-46a6-aeb8-997e82b80c4f --help=false --pipe_read=0
>>>> --pipe_write=0 --user=smi
>>>>
>>>> Failed to synchronize with slave (it's probably exited)
>>>>
>>>>
>>>> Would apprecite pointing to any debugging methods/documentation to
>>>> diagnose these kind of problems.
>>>>
>>>> Regards
>>>> Sumit Chawla
>>>>
>>>>
>>>
>>
>

Re: Mesos Executor Failing

Posted by "Chawla,Sumit " <su...@gmail.com>.

Hi Joseph

The error code is being reported as 0, and there is not much else in the
logs.

Regards
Sumit Chawla


On Wed, May 24, 2017 at 12:21 AM, Joseph Wu <jo...@mesosphere.io> wrote:

> There isn't a tool for this.  Can you check if the Mesos agent is being
> restarted (or crashing) when you launch a task?  And perhaps upload some
> logs around the time of the task launch.
>
> There is a mismatch between the exit codes you've reported though.  When
> you see that log line in the sandbox logs, the exit code will be "1"
> (failure), rather than "0" (success).
>
> On Mon, May 22, 2017 at 9:30 PM, Chawla,Sumit <su...@gmail.com>
> wrote:
>
>> Hi Joseph
>>
>> I am using 0.27.0.  Is there any diagnosis tool or command line that i
>> can run to ascertain that why its happening?
>>
>> Regards
>> Sumit Chawla
>>
>>
>> On Fri, May 19, 2017 at 2:31 PM, Joseph Wu <jo...@mesosphere.io> wrote:
>>
>>> What version of Mesos are you using?  (Just based on the word "slave" in
>>> that error message, I'm guessing 0.28 or older.)
>>>
>>> The "Failed to synchronize" error is something that can occur while the
>>> agent is launching the executor.  During the launch, the agent will create
>>> a pipe to the executor subprocess; and the executor makes a blocking read
>>> on this pipe.  The agent will write a value to the pipe to signal the
>>> executor to proceed.  If the agent restarts or the pipe breaks at this
>>> point in the launch, then you'll see this error message.
>>>
>>> On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumit <su...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> I am facing a peculiar issue on one of the slave nodes of our cluster.
>>>> I have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
>>>> with exit code 0.
>>>>
>>>> ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
>>>> exited caused by one of the running tasks) Reason: Unknown executor
>>>> exit code (0)
>>>>
>>>>
>>>> I cannot seem to find anything in mesos-slave.logs, and there is
>>>> nothing being written to stdout/stderr.  Are there any debugging utitlities
>>>> that i can use to debug what can be getting wrong on that particular slave?
>>>>
>>>>
>>>> I tried running following but got stuck at:
>>>>
>>>>
>>>> /mesos-containerizer launch --command='{"environment":{},"shell":true,"value":"ls
>>>> -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-b6eb-8f
>>>> a4d2539da7-S77/frameworks/e6745c67-32e8-41ad-b6eb-8fa4d2539d
>>>> a7-0312/executors/e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S77/
>>>> runs/45aa784c-f485-46a6-aeb8-997e82b80c4f --help=false --pipe_read=0
>>>> --pipe_write=0 --user=smi
>>>>
>>>> Failed to synchronize with slave (it's probably exited)
>>>>
>>>>
>>>> Would apprecite pointing to any debugging methods/documentation to
>>>> diagnose these kind of problems.
>>>>
>>>> Regards
>>>> Sumit Chawla
>>>>
>>>>
>>>
>>
>

Re: Mesos Executor Failing

Posted by Joseph Wu <jo...@mesosphere.io>.

There isn't a tool for this.  Can you check if the Mesos agent is being
restarted (or crashing) when you launch a task?  And perhaps upload some
logs around the time of the task launch.

There is a mismatch between the exit codes you've reported though.  When
you see that log line in the sandbox logs, the exit code will be "1"
(failure), rather than "0" (success).

On Mon, May 22, 2017 at 9:30 PM, Chawla,Sumit <su...@gmail.com>
wrote:

> Hi Joseph
>
> I am using 0.27.0.  Is there any diagnosis tool or command line that i can
> run to ascertain that why its happening?
>
> Regards
> Sumit Chawla
>
>
> On Fri, May 19, 2017 at 2:31 PM, Joseph Wu <jo...@mesosphere.io> wrote:
>
>> What version of Mesos are you using?  (Just based on the word "slave" in
>> that error message, I'm guessing 0.28 or older.)
>>
>> The "Failed to synchronize" error is something that can occur while the
>> agent is launching the executor.  During the launch, the agent will create
>> a pipe to the executor subprocess; and the executor makes a blocking read
>> on this pipe.  The agent will write a value to the pipe to signal the
>> executor to proceed.  If the agent restarts or the pipe breaks at this
>> point in the launch, then you'll see this error message.
>>
>> On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumit <su...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> I am facing a peculiar issue on one of the slave nodes of our cluster.
>>> I have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
>>> with exit code 0.
>>>
>>> ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
>>> exited caused by one of the running tasks) Reason: Unknown executor
>>> exit code (0)
>>>
>>>
>>> I cannot seem to find anything in mesos-slave.logs, and there is nothing
>>> being written to stdout/stderr.  Are there any debugging utitlities that i
>>> can use to debug what can be getting wrong on that particular slave?
>>>
>>> I tried running following but got stuck at:
>>>
>>>
>>> /mesos-containerizer launch --command='{"environment":{},"shell":true,"value":"ls
>>> -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-b6eb-8f
>>> a4d2539da7-S77/frameworks/e6745c67-32e8-41ad-b6eb-8fa4d2539d
>>> a7-0312/executors/e6745c67-32e8-41ad-b6eb-8fa4d2539da7-
>>> S77/runs/45aa784c-f485-46a6-aeb8-997e82b80c4f --help=false
>>> --pipe_read=0 --pipe_write=0 --user=smi
>>>
>>> Failed to synchronize with slave (it's probably exited)
>>>
>>>
>>> Would apprecite pointing to any debugging methods/documentation to
>>> diagnose these kind of problems.
>>>
>>> Regards
>>> Sumit Chawla
>>>
>>>
>>
>

Re: Mesos Executor Failing

Posted by Joseph Wu <jo...@mesosphere.io>.

There isn't a tool for this.  Can you check if the Mesos agent is being
restarted (or crashing) when you launch a task?  And perhaps upload some
logs around the time of the task launch.

There is a mismatch between the exit codes you've reported though.  When
you see that log line in the sandbox logs, the exit code will be "1"
(failure), rather than "0" (success).

On Mon, May 22, 2017 at 9:30 PM, Chawla,Sumit <su...@gmail.com>
wrote:

> Hi Joseph
>
> I am using 0.27.0.  Is there any diagnosis tool or command line that i can
> run to ascertain that why its happening?
>
> Regards
> Sumit Chawla
>
>
> On Fri, May 19, 2017 at 2:31 PM, Joseph Wu <jo...@mesosphere.io> wrote:
>
>> What version of Mesos are you using?  (Just based on the word "slave" in
>> that error message, I'm guessing 0.28 or older.)
>>
>> The "Failed to synchronize" error is something that can occur while the
>> agent is launching the executor.  During the launch, the agent will create
>> a pipe to the executor subprocess; and the executor makes a blocking read
>> on this pipe.  The agent will write a value to the pipe to signal the
>> executor to proceed.  If the agent restarts or the pipe breaks at this
>> point in the launch, then you'll see this error message.
>>
>> On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumit <su...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> I am facing a peculiar issue on one of the slave nodes of our cluster.
>>> I have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
>>> with exit code 0.
>>>
>>> ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
>>> exited caused by one of the running tasks) Reason: Unknown executor
>>> exit code (0)
>>>
>>>
>>> I cannot seem to find anything in mesos-slave.logs, and there is nothing
>>> being written to stdout/stderr.  Are there any debugging utitlities that i
>>> can use to debug what can be getting wrong on that particular slave?
>>>
>>> I tried running following but got stuck at:
>>>
>>>
>>> /mesos-containerizer launch --command='{"environment":{},"shell":true,"value":"ls
>>> -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-b6eb-8f
>>> a4d2539da7-S77/frameworks/e6745c67-32e8-41ad-b6eb-8fa4d2539d
>>> a7-0312/executors/e6745c67-32e8-41ad-b6eb-8fa4d2539da7-
>>> S77/runs/45aa784c-f485-46a6-aeb8-997e82b80c4f --help=false
>>> --pipe_read=0 --pipe_write=0 --user=smi
>>>
>>> Failed to synchronize with slave (it's probably exited)
>>>
>>>
>>> Would apprecite pointing to any debugging methods/documentation to
>>> diagnose these kind of problems.
>>>
>>> Regards
>>> Sumit Chawla
>>>
>>>
>>
>

Re: Mesos Executor Failing

Posted by "Chawla,Sumit " <su...@gmail.com>.

Hi Joseph

I am using 0.27.0.  Is there any diagnosis tool or command line that i can
run to ascertain that why its happening?

Regards
Sumit Chawla


On Fri, May 19, 2017 at 2:31 PM, Joseph Wu <jo...@mesosphere.io> wrote:

> What version of Mesos are you using?  (Just based on the word "slave" in
> that error message, I'm guessing 0.28 or older.)
>
> The "Failed to synchronize" error is something that can occur while the
> agent is launching the executor.  During the launch, the agent will create
> a pipe to the executor subprocess; and the executor makes a blocking read
> on this pipe.  The agent will write a value to the pipe to signal the
> executor to proceed.  If the agent restarts or the pipe breaks at this
> point in the launch, then you'll see this error message.
>
> On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumit <su...@gmail.com>
> wrote:
>
>> Hi
>>
>> I am facing a peculiar issue on one of the slave nodes of our cluster.  I
>> have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
>> with exit code 0.
>>
>> ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
>> exited caused by one of the running tasks) Reason: Unknown executor exit
>> code (0)
>>
>>
>> I cannot seem to find anything in mesos-slave.logs, and there is nothing
>> being written to stdout/stderr.  Are there any debugging utitlities that i
>> can use to debug what can be getting wrong on that particular slave?
>>
>> I tried running following but got stuck at:
>>
>>
>> /mesos-containerizer launch --command='{"environment":{},"shell":true,"value":"ls
>> -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-b6eb-
>> 8fa4d2539da7-S77/frameworks/e6745c67-32e8-41ad-b6eb-
>> 8fa4d2539da7-0312/executors/e6745c67-32e8-41ad-b6eb-
>> 8fa4d2539da7-S77/runs/45aa784c-f485-46a6-aeb8-997e82b80c4f --help=false
>> --pipe_read=0 --pipe_write=0 --user=smi
>>
>> Failed to synchronize with slave (it's probably exited)
>>
>>
>> Would apprecite pointing to any debugging methods/documentation to
>> diagnose these kind of problems.
>>
>> Regards
>> Sumit Chawla
>>
>>
>

Re: Mesos Executor Failing

Posted by "Chawla,Sumit " <su...@gmail.com>.

Hi Joseph

I am using 0.27.0.  Is there any diagnosis tool or command line that i can
run to ascertain that why its happening?

Regards
Sumit Chawla


On Fri, May 19, 2017 at 2:31 PM, Joseph Wu <jo...@mesosphere.io> wrote:

> What version of Mesos are you using?  (Just based on the word "slave" in
> that error message, I'm guessing 0.28 or older.)
>
> The "Failed to synchronize" error is something that can occur while the
> agent is launching the executor.  During the launch, the agent will create
> a pipe to the executor subprocess; and the executor makes a blocking read
> on this pipe.  The agent will write a value to the pipe to signal the
> executor to proceed.  If the agent restarts or the pipe breaks at this
> point in the launch, then you'll see this error message.
>
> On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumit <su...@gmail.com>
> wrote:
>
>> Hi
>>
>> I am facing a peculiar issue on one of the slave nodes of our cluster.  I
>> have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
>> with exit code 0.
>>
>> ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
>> exited caused by one of the running tasks) Reason: Unknown executor exit
>> code (0)
>>
>>
>> I cannot seem to find anything in mesos-slave.logs, and there is nothing
>> being written to stdout/stderr.  Are there any debugging utitlities that i
>> can use to debug what can be getting wrong on that particular slave?
>>
>> I tried running following but got stuck at:
>>
>>
>> /mesos-containerizer launch --command='{"environment":{},"shell":true,"value":"ls
>> -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-b6eb-
>> 8fa4d2539da7-S77/frameworks/e6745c67-32e8-41ad-b6eb-
>> 8fa4d2539da7-0312/executors/e6745c67-32e8-41ad-b6eb-
>> 8fa4d2539da7-S77/runs/45aa784c-f485-46a6-aeb8-997e82b80c4f --help=false
>> --pipe_read=0 --pipe_write=0 --user=smi
>>
>> Failed to synchronize with slave (it's probably exited)
>>
>>
>> Would apprecite pointing to any debugging methods/documentation to
>> diagnose these kind of problems.
>>
>> Regards
>> Sumit Chawla
>>
>>
>

Re: Mesos Executor Failing

Posted by Joseph Wu <jo...@mesosphere.io>.

What version of Mesos are you using?  (Just based on the word "slave" in
that error message, I'm guessing 0.28 or older.)

The "Failed to synchronize" error is something that can occur while the
agent is launching the executor.  During the launch, the agent will create
a pipe to the executor subprocess; and the executor makes a blocking read
on this pipe.  The agent will write a value to the pipe to signal the
executor to proceed.  If the agent restarts or the pipe breaks at this
point in the launch, then you'll see this error message.

On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumit <su...@gmail.com>
wrote:

> Hi
>
> I am facing a peculiar issue on one of the slave nodes of our cluster.  I
> have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
> with exit code 0.
>
> ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
> exited caused by one of the running tasks) Reason: Unknown executor exit
> code (0)
>
>
> I cannot seem to find anything in mesos-slave.logs, and there is nothing
> being written to stdout/stderr.  Are there any debugging utitlities that i
> can use to debug what can be getting wrong on that particular slave?
>
> I tried running following but got stuck at:
>
>
> /mesos-containerizer launch --command='{"environment":{},"shell":true,"value":"ls
> -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-
> b6eb-8fa4d2539da7-S77/frameworks/e6745c67-32e8-41ad-
> b6eb-8fa4d2539da7-0312/executors/e6745c67-32e8-41ad-
> b6eb-8fa4d2539da7-S77/runs/45aa784c-f485-46a6-aeb8-997e82b80c4f
> --help=false --pipe_read=0 --pipe_write=0 --user=smi
>
> Failed to synchronize with slave (it's probably exited)
>
>
> Would apprecite pointing to any debugging methods/documentation to
> diagnose these kind of problems.
>
> Regards
> Sumit Chawla
>
>

Re: Mesos Executor Failing

Posted by Joseph Wu <jo...@mesosphere.io>.

What version of Mesos are you using?  (Just based on the word "slave" in
that error message, I'm guessing 0.28 or older.)

The "Failed to synchronize" error is something that can occur while the
agent is launching the executor.  During the launch, the agent will create
a pipe to the executor subprocess; and the executor makes a blocking read
on this pipe.  The agent will write a value to the pipe to signal the
executor to proceed.  If the agent restarts or the pipe breaks at this
point in the launch, then you'll see this error message.

On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumit <su...@gmail.com>
wrote:

> Hi
>
> I am facing a peculiar issue on one of the slave nodes of our cluster.  I
> have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
> with exit code 0.
>
> ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
> exited caused by one of the running tasks) Reason: Unknown executor exit
> code (0)
>
>
> I cannot seem to find anything in mesos-slave.logs, and there is nothing
> being written to stdout/stderr.  Are there any debugging utitlities that i
> can use to debug what can be getting wrong on that particular slave?
>
> I tried running following but got stuck at:
>
>
> /mesos-containerizer launch --command='{"environment":{},"shell":true,"value":"ls
> -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-
> b6eb-8fa4d2539da7-S77/frameworks/e6745c67-32e8-41ad-
> b6eb-8fa4d2539da7-0312/executors/e6745c67-32e8-41ad-
> b6eb-8fa4d2539da7-S77/runs/45aa784c-f485-46a6-aeb8-997e82b80c4f
> --help=false --pipe_read=0 --pipe_write=0 --user=smi
>
> Failed to synchronize with slave (it's probably exited)
>
>
> Would apprecite pointing to any debugging methods/documentation to
> diagnose these kind of problems.
>
> Regards
> Sumit Chawla
>
>