You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Gerard Maas <ge...@gmail.com> on 2014/12/01 23:43:37 UTC

Fwd: Mesos killing Spark Driver

Hi,

Sorry if this has been discussed before. I'm new to the list.

We are currently running our Spark + Spark Streaming jobs on Mesos,
submitting our jobs through Marathon.

We see with some regularity that the Spark Streaming driver gets killed by
Mesos and then restarted on some other node by Marathon.

I've no clue why Mesos is killing the driver and looking at both the Mesos
and Spark logs didn't make me any wiser.

On the Spark Streaming driver logs, I find this entry of Mesos "signing
off" my driver:

Shutting down
> Sending SIGTERM to process tree at pid 17845
> Killing the following process trees:
> [
> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>    \--- 17847 java -cp core-compute-job.jar
> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
> ]
> Command terminated with signal Terminated (pid: 17845)


What would be the reasons for Mesos to kill an executor?
Have anybody seen something similar? Any hints on where to start digging?

-kr, Gerard.
.

Re: Mesos killing Spark Driver

Posted by Jing Dong <ji...@qubitproducts.com>.
When executor dies, did you see any exceptions from Mesos or Spark?


On 1 December 2014 at 22:43, Gerard Maas <ge...@gmail.com> wrote:

> Hi,
>
> Sorry if this has been discussed before. I'm new to the list.
>
> We are currently running our Spark + Spark Streaming jobs on Mesos,
> submitting our jobs through Marathon.
>
> We see with some regularity that the Spark Streaming driver gets killed by
> Mesos and then restarted on some other node by Marathon.
>
> I've no clue why Mesos is killing the driver and looking at both the Mesos
> and Spark logs didn't make me any wiser.
>
> On the Spark Streaming driver logs, I find this entry of Mesos "signing
> off" my driver:
>
> Shutting down
>> Sending SIGTERM to process tree at pid 17845
>> Killing the following process trees:
>> [
>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>>    \--- 17847 java -cp core-compute-job.jar
>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
>> ]
>> Command terminated with signal Terminated (pid: 17845)
>
>
> What would be the reasons for Mesos to kill an executor?
> Have anybody seen something similar? Any hints on where to start digging?
>
> -kr, Gerard.
> .
>
>
>
>
>
>


-- 
This email may be confidential or privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone to

the wrong person. Thanks.

Re: Mesos killing Spark Driver

Posted by Gerard Maas <ge...@gmail.com>.
Hi,

This issue is on prod, running Marathon 0.6 - we are currently testing
0.7.5 on Dev, but I've no results of this behavior yet.
I saw your post by searching on the Marathon group but didn't consider that
it would apply to my case as I don't see the NPE.
The warning on version mismatch between Mesos 0.20 and Marathon 0.6 is
indeed important.

Thanks,  Gerard.

On Thu, Jan 8, 2015 at 3:50 PM, Shijun Kong <sk...@investoranalytics.com>
wrote:

>  Hi Gerard,
>
>  What version of Marathon are you running? I ran into similar behavior
> some time back. My problem seems to be compatibility issue between Marathon
> and Meosos: https://github.com/mesosphere/marathon/issues/595
>
>
>
>  Regards,
> Shijun
>
>  On Jan 8, 2015, at 9:28 AM, Gerard Maas <ge...@gmail.com> wrote:
>
>  Hi again,
>
>  I finally found a clue in this issue. It looks like Marathon is the one
> behind the job killing spree. I still don't know *why* but it looks like
> the task consolidation of Marathon finds a discrepancy with Mesos and
> decides to kill the instance.
>
>  INFO|2015-01-08
> 10:05:35,491|pool-1-thread-1173|MarathonScheduler.scala:299|Requesting task
> reconciliation with the Mesos master
>  INFO|2015-01-08
> 10:05:35,493|Thread-188479|MarathonScheduler.scala:138|Received status
> update for task
> core-compute-jobs-actualvalues-st.be0e36cc-9714-11e4-9e7c-3e6ce77341aa:
> TASK_RUNNING (Reconciliation: Latest task state)
> INFO|2015-01-08
> 10:05:35,494|pool-1-thread-1171|MarathonScheduler.scala:338|Need to scale
> core-compute-jobs-actualvalues-st from 0 up to 1 instances
>
>  #### Following mesos, at this point, there's already an instance of this
> job running, so it's actually scaling from 1 to 2 and not from 0 to 1 as it
> says in the logs ####
>
>  INFO|2015-01-08 10:05:35,878|Thread-188483|TaskBuilder.scala:38|No
> matching offer for core-compute-jobs-actualvalues-st (need 1.0 CPUs, 1000.0
> mem, 0.0 disk, 1 ports)
>  (... offers ...)
> ...
> #### Killing ####
>   INFO|2015-01-08
> 10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:353|Scaling
> core-compute-jobs-actualvalues-st from 2 down to 1 instances
>  INFO|2015-01-08
> 10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:357|Killing tasks:
> Set(core-compute-jobs-actualvalues-st.e458118d-971d-11e4-9e7c-3e6ce77341aa)
>
>  Any ideas why this happens and how to fix it?
>
>  -kr, Gerard.
>
>
> On Tue, Dec 2, 2014 at 1:15 AM, Gerard Maas <ge...@gmail.com> wrote:
>
>> Thanks!.  I'll try that and report back once I've some interesting
>> evidence.
>>
>>  -kr, Gerard.
>>
>> On Tue, Dec 2, 2014 at 12:54 AM, Tim Chen <ti...@mesosphere.io> wrote:
>>
>>> Hi Gerard,
>>>
>>>  I see. What will be helpful to help diagnoise your problem is that if
>>> you can enable verbose logging (GLOG_v=1) before running the slave, and
>>> share the slave logs when it happens.
>>>
>>>  Tim
>>>
>>> On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <ge...@gmail.com>
>>> wrote:
>>>
>>>> Hi Tim,
>>>>
>>>>  It's quite hard to reproduce. It just "happens"... some time worst
>>>> than others, mostly when the system is under load. We notice b/c the
>>>> framework starts 'jumping' from one slave to other, but so far we have no
>>>> clue why this is happening.
>>>>
>>>>  What I'm currently looking for is some potential conditions that
>>>> could cause Mesos to kill the executor (not the task) to validate whether
>>>> any of those conditions apply to our case and try to narrow down the
>>>> problem to some reproducible subset.
>>>>
>>>>  -kr, Gerard.
>>>>
>>>>
>>>> On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <ti...@mesosphere.io> wrote:
>>>>
>>>>> There are different reasons, but most commonly is when the framework
>>>>> ask to kill the task.
>>>>>
>>>>>  Can you provide some easy repro steps/artifacts? I've been working
>>>>> on Spark on Mesos these days and can help try this out.
>>>>>
>>>>>  Tim
>>>>>
>>>>> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <ge...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>  Sorry if this has been discussed before. I'm new to the list.
>>>>>>
>>>>>>  We are currently running our Spark + Spark Streaming jobs on Mesos,
>>>>>> submitting our jobs through Marathon.
>>>>>>
>>>>>>  We see with some regularity that the Spark Streaming driver gets
>>>>>> killed by Mesos and then restarted on some other node by Marathon.
>>>>>>
>>>>>>  I've no clue why Mesos is killing the driver and looking at both
>>>>>> the Mesos and Spark logs didn't make me any wiser.
>>>>>>
>>>>>>  On the Spark Streaming driver logs, I find this entry of Mesos
>>>>>> "signing off" my driver:
>>>>>>
>>>>>>  Shutting down
>>>>>>> Sending SIGTERM to process tree at pid 17845
>>>>>>> Killing the following process trees:
>>>>>>> [
>>>>>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>>>>>>>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>>>>>>>    \--- 17847 java -cp core-compute-job.jar
>>>>>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
>>>>>>> ]
>>>>>>> Command terminated with signal Terminated (pid: 17845)
>>>>>>
>>>>>>
>>>>>>  What would be the reasons for Mesos to kill an executor?
>>>>>> Have anybody seen something similar? Any hints on where to start
>>>>>> digging?
>>>>>>
>>>>>>  -kr, Gerard.
>>>>>> .
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: Mesos killing Spark Driver

Posted by Shijun Kong <sk...@investoranalytics.com>.
Hi Gerard,

What version of Marathon are you running? I ran into similar behavior some time back. My problem seems to be compatibility issue between Marathon and Meosos: https://github.com/mesosphere/marathon/issues/595



Regards,
Shijun

On Jan 8, 2015, at 9:28 AM, Gerard Maas <ge...@gmail.com>> wrote:

Hi again,

I finally found a clue in this issue. It looks like Marathon is the one behind the job killing spree. I still don't know *why* but it looks like the task consolidation of Marathon finds a discrepancy with Mesos and decides to kill the instance.

INFO|2015-01-08 10:05:35,491|pool-1-thread-1173|MarathonScheduler.scala:299|Requesting task reconciliation with the Mesos master
 INFO|2015-01-08 10:05:35,493|Thread-188479|MarathonScheduler.scala:138|Received status update for task core-compute-jobs-actualvalues-st.be0e36cc-9714-11e4-9e7c-3e6ce77341aa: TASK_RUNNING (Reconciliation: Latest task state)
INFO|2015-01-08 10:05:35,494|pool-1-thread-1171|MarathonScheduler.scala:338|Need to scale core-compute-jobs-actualvalues-st from 0 up to 1 instances

#### Following mesos, at this point, there's already an instance of this job running, so it's actually scaling from 1 to 2 and not from 0 to 1 as it says in the logs ####

INFO|2015-01-08 10:05:35,878|Thread-188483|TaskBuilder.scala:38|No matching offer for core-compute-jobs-actualvalues-st (need 1.0 CPUs, 1000.0 mem, 0.0 disk, 1 ports)
(... offers ...)
...
#### Killing ####
 INFO|2015-01-08 10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:353|Scaling core-compute-jobs-actualvalues-st from 2 down to 1 instances
 INFO|2015-01-08 10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:357|Killing tasks: Set(core-compute-jobs-actualvalues-st.e458118d-971d-11e4-9e7c-3e6ce77341aa)

Any ideas why this happens and how to fix it?

-kr, Gerard.


On Tue, Dec 2, 2014 at 1:15 AM, Gerard Maas <ge...@gmail.com>> wrote:
Thanks!.  I'll try that and report back once I've some interesting evidence.

-kr, Gerard.

On Tue, Dec 2, 2014 at 12:54 AM, Tim Chen <ti...@mesosphere.io>> wrote:
Hi Gerard,

I see. What will be helpful to help diagnoise your problem is that if you can enable verbose logging (GLOG_v=1) before running the slave, and share the slave logs when it happens.

Tim

On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <ge...@gmail.com>> wrote:
Hi Tim,

It's quite hard to reproduce. It just "happens"... some time worst than others, mostly when the system is under load. We notice b/c the framework starts 'jumping' from one slave to other, but so far we have no clue why this is happening.

What I'm currently looking for is some potential conditions that could cause Mesos to kill the executor (not the task) to validate whether any of those conditions apply to our case and try to narrow down the problem to some reproducible subset.

-kr, Gerard.


On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <ti...@mesosphere.io>> wrote:
There are different reasons, but most commonly is when the framework ask to kill the task.

Can you provide some easy repro steps/artifacts? I've been working on Spark on Mesos these days and can help try this out.

Tim

On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <ge...@gmail.com>> wrote:
Hi,

Sorry if this has been discussed before. I'm new to the list.

We are currently running our Spark + Spark Streaming jobs on Mesos, submitting our jobs through Marathon.

We see with some regularity that the Spark Streaming driver gets killed by Mesos and then restarted on some other node by Marathon.

I've no clue why Mesos is killing the driver and looking at both the Mesos and Spark logs didn't make me any wiser.

On the Spark Streaming driver logs, I find this entry of Mesos "signing off" my driver:

Shutting down
Sending SIGTERM to process tree at pid 17845
Killing the following process trees:
[
-+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
 \-+- 17846 sh ./run-mesos.sh application-ts.conf
   \--- 17847 java -cp core-compute-job.jar -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
]
Command terminated with signal Terminated (pid: 17845)

What would be the reasons for Mesos to kill an executor?
Have anybody seen something similar? Any hints on where to start digging?

-kr, Gerard.
.












Re: Mesos killing Spark Driver

Posted by Gerard Maas <ge...@gmail.com>.
Hi again,

I finally found a clue in this issue. It looks like Marathon is the one
behind the job killing spree. I still don't know *why* but it looks like
the task consolidation of Marathon finds a discrepancy with Mesos and
decides to kill the instance.

INFO|2015-01-08
10:05:35,491|pool-1-thread-1173|MarathonScheduler.scala:299|Requesting task
reconciliation with the Mesos master
 INFO|2015-01-08
10:05:35,493|Thread-188479|MarathonScheduler.scala:138|Received status
update for task
core-compute-jobs-actualvalues-st.be0e36cc-9714-11e4-9e7c-3e6ce77341aa:
TASK_RUNNING (Reconciliation: Latest task state)
INFO|2015-01-08
10:05:35,494|pool-1-thread-1171|MarathonScheduler.scala:338|Need to scale
core-compute-jobs-actualvalues-st from 0 up to 1 instances

#### Following mesos, at this point, there's already an instance of this
job running, so it's actually scaling from 1 to 2 and not from 0 to 1 as it
says in the logs ####

INFO|2015-01-08 10:05:35,878|Thread-188483|TaskBuilder.scala:38|No matching
offer for core-compute-jobs-actualvalues-st (need 1.0 CPUs, 1000.0 mem, 0.0
disk, 1 ports)
(... offers ...)
...
#### Killing ####
 INFO|2015-01-08
10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:353|Scaling
core-compute-jobs-actualvalues-st from 2 down to 1 instances
 INFO|2015-01-08
10:06:05,494|pool-1-thread-1172|MarathonScheduler.scala:357|Killing tasks:
Set(core-compute-jobs-actualvalues-st.e458118d-971d-11e4-9e7c-3e6ce77341aa)

Any ideas why this happens and how to fix it?

-kr, Gerard.


On Tue, Dec 2, 2014 at 1:15 AM, Gerard Maas <ge...@gmail.com> wrote:

> Thanks!.  I'll try that and report back once I've some interesting
> evidence.
>
> -kr, Gerard.
>
> On Tue, Dec 2, 2014 at 12:54 AM, Tim Chen <ti...@mesosphere.io> wrote:
>
>> Hi Gerard,
>>
>> I see. What will be helpful to help diagnoise your problem is that if you
>> can enable verbose logging (GLOG_v=1) before running the slave, and share
>> the slave logs when it happens.
>>
>> Tim
>>
>> On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <ge...@gmail.com>
>> wrote:
>>
>>> Hi Tim,
>>>
>>> It's quite hard to reproduce. It just "happens"... some time worst than
>>> others, mostly when the system is under load. We notice b/c the framework
>>> starts 'jumping' from one slave to other, but so far we have no clue why
>>> this is happening.
>>>
>>> What I'm currently looking for is some potential conditions that could
>>> cause Mesos to kill the executor (not the task) to validate whether any of
>>> those conditions apply to our case and try to narrow down the problem to
>>> some reproducible subset.
>>>
>>> -kr, Gerard.
>>>
>>>
>>> On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <ti...@mesosphere.io> wrote:
>>>
>>>> There are different reasons, but most commonly is when the framework
>>>> ask to kill the task.
>>>>
>>>> Can you provide some easy repro steps/artifacts? I've been working on
>>>> Spark on Mesos these days and can help try this out.
>>>>
>>>> Tim
>>>>
>>>> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <ge...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Sorry if this has been discussed before. I'm new to the list.
>>>>>
>>>>> We are currently running our Spark + Spark Streaming jobs on Mesos,
>>>>> submitting our jobs through Marathon.
>>>>>
>>>>> We see with some regularity that the Spark Streaming driver gets
>>>>> killed by Mesos and then restarted on some other node by Marathon.
>>>>>
>>>>> I've no clue why Mesos is killing the driver and looking at both the
>>>>> Mesos and Spark logs didn't make me any wiser.
>>>>>
>>>>> On the Spark Streaming driver logs, I find this entry of Mesos
>>>>> "signing off" my driver:
>>>>>
>>>>> Shutting down
>>>>>> Sending SIGTERM to process tree at pid 17845
>>>>>> Killing the following process trees:
>>>>>> [
>>>>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>>>>>>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>>>>>>    \--- 17847 java -cp core-compute-job.jar
>>>>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
>>>>>> ]
>>>>>> Command terminated with signal Terminated (pid: 17845)
>>>>>
>>>>>
>>>>> What would be the reasons for Mesos to kill an executor?
>>>>> Have anybody seen something similar? Any hints on where to start
>>>>> digging?
>>>>>
>>>>> -kr, Gerard.
>>>>> .
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Mesos killing Spark Driver

Posted by Gerard Maas <ge...@gmail.com>.
Thanks!.  I'll try that and report back once I've some interesting evidence.

-kr, Gerard.

On Tue, Dec 2, 2014 at 12:54 AM, Tim Chen <ti...@mesosphere.io> wrote:

> Hi Gerard,
>
> I see. What will be helpful to help diagnoise your problem is that if you
> can enable verbose logging (GLOG_v=1) before running the slave, and share
> the slave logs when it happens.
>
> Tim
>
> On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <ge...@gmail.com> wrote:
>
>> Hi Tim,
>>
>> It's quite hard to reproduce. It just "happens"... some time worst than
>> others, mostly when the system is under load. We notice b/c the framework
>> starts 'jumping' from one slave to other, but so far we have no clue why
>> this is happening.
>>
>> What I'm currently looking for is some potential conditions that could
>> cause Mesos to kill the executor (not the task) to validate whether any of
>> those conditions apply to our case and try to narrow down the problem to
>> some reproducible subset.
>>
>> -kr, Gerard.
>>
>>
>> On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <ti...@mesosphere.io> wrote:
>>
>>> There are different reasons, but most commonly is when the framework ask
>>> to kill the task.
>>>
>>> Can you provide some easy repro steps/artifacts? I've been working on
>>> Spark on Mesos these days and can help try this out.
>>>
>>> Tim
>>>
>>> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <ge...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Sorry if this has been discussed before. I'm new to the list.
>>>>
>>>> We are currently running our Spark + Spark Streaming jobs on Mesos,
>>>> submitting our jobs through Marathon.
>>>>
>>>> We see with some regularity that the Spark Streaming driver gets killed
>>>> by Mesos and then restarted on some other node by Marathon.
>>>>
>>>> I've no clue why Mesos is killing the driver and looking at both the
>>>> Mesos and Spark logs didn't make me any wiser.
>>>>
>>>> On the Spark Streaming driver logs, I find this entry of Mesos "signing
>>>> off" my driver:
>>>>
>>>> Shutting down
>>>>> Sending SIGTERM to process tree at pid 17845
>>>>> Killing the following process trees:
>>>>> [
>>>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>>>>>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>>>>>    \--- 17847 java -cp core-compute-job.jar
>>>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
>>>>> ]
>>>>> Command terminated with signal Terminated (pid: 17845)
>>>>
>>>>
>>>> What would be the reasons for Mesos to kill an executor?
>>>> Have anybody seen something similar? Any hints on where to start
>>>> digging?
>>>>
>>>> -kr, Gerard.
>>>> .
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Mesos killing Spark Driver

Posted by Tim Chen <ti...@mesosphere.io>.
Hi Gerard,

I see. What will be helpful to help diagnoise your problem is that if you
can enable verbose logging (GLOG_v=1) before running the slave, and share
the slave logs when it happens.

Tim

On Mon, Dec 1, 2014 at 3:23 PM, Gerard Maas <ge...@gmail.com> wrote:

> Hi Tim,
>
> It's quite hard to reproduce. It just "happens"... some time worst than
> others, mostly when the system is under load. We notice b/c the framework
> starts 'jumping' from one slave to other, but so far we have no clue why
> this is happening.
>
> What I'm currently looking for is some potential conditions that could
> cause Mesos to kill the executor (not the task) to validate whether any of
> those conditions apply to our case and try to narrow down the problem to
> some reproducible subset.
>
> -kr, Gerard.
>
>
> On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <ti...@mesosphere.io> wrote:
>
>> There are different reasons, but most commonly is when the framework ask
>> to kill the task.
>>
>> Can you provide some easy repro steps/artifacts? I've been working on
>> Spark on Mesos these days and can help try this out.
>>
>> Tim
>>
>> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <ge...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Sorry if this has been discussed before. I'm new to the list.
>>>
>>> We are currently running our Spark + Spark Streaming jobs on Mesos,
>>> submitting our jobs through Marathon.
>>>
>>> We see with some regularity that the Spark Streaming driver gets killed
>>> by Mesos and then restarted on some other node by Marathon.
>>>
>>> I've no clue why Mesos is killing the driver and looking at both the
>>> Mesos and Spark logs didn't make me any wiser.
>>>
>>> On the Spark Streaming driver logs, I find this entry of Mesos "signing
>>> off" my driver:
>>>
>>> Shutting down
>>>> Sending SIGTERM to process tree at pid 17845
>>>> Killing the following process trees:
>>>> [
>>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>>>>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>>>>    \--- 17847 java -cp core-compute-job.jar
>>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
>>>> ]
>>>> Command terminated with signal Terminated (pid: 17845)
>>>
>>>
>>> What would be the reasons for Mesos to kill an executor?
>>> Have anybody seen something similar? Any hints on where to start digging?
>>>
>>> -kr, Gerard.
>>> .
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Mesos killing Spark Driver

Posted by Gerard Maas <ge...@gmail.com>.
Hi Tim,

It's quite hard to reproduce. It just "happens"... some time worst than
others, mostly when the system is under load. We notice b/c the framework
starts 'jumping' from one slave to other, but so far we have no clue why
this is happening.

What I'm currently looking for is some potential conditions that could
cause Mesos to kill the executor (not the task) to validate whether any of
those conditions apply to our case and try to narrow down the problem to
some reproducible subset.

-kr, Gerard.


On Mon, Dec 1, 2014 at 11:57 PM, Tim Chen <ti...@mesosphere.io> wrote:

> There are different reasons, but most commonly is when the framework ask
> to kill the task.
>
> Can you provide some easy repro steps/artifacts? I've been working on
> Spark on Mesos these days and can help try this out.
>
> Tim
>
> On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <ge...@gmail.com> wrote:
>
>> Hi,
>>
>> Sorry if this has been discussed before. I'm new to the list.
>>
>> We are currently running our Spark + Spark Streaming jobs on Mesos,
>> submitting our jobs through Marathon.
>>
>> We see with some regularity that the Spark Streaming driver gets killed
>> by Mesos and then restarted on some other node by Marathon.
>>
>> I've no clue why Mesos is killing the driver and looking at both the
>> Mesos and Spark logs didn't make me any wiser.
>>
>> On the Spark Streaming driver logs, I find this entry of Mesos "signing
>> off" my driver:
>>
>> Shutting down
>>> Sending SIGTERM to process tree at pid 17845
>>> Killing the following process trees:
>>> [
>>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>>>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>>>    \--- 17847 java -cp core-compute-job.jar
>>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
>>> ]
>>> Command terminated with signal Terminated (pid: 17845)
>>
>>
>> What would be the reasons for Mesos to kill an executor?
>> Have anybody seen something similar? Any hints on where to start digging?
>>
>> -kr, Gerard.
>> .
>>
>>
>>
>>
>>
>>
>

Re: Mesos killing Spark Driver

Posted by Tim Chen <ti...@mesosphere.io>.
There are different reasons, but most commonly is when the framework ask to
kill the task.

Can you provide some easy repro steps/artifacts? I've been working on Spark
on Mesos these days and can help try this out.

Tim

On Mon, Dec 1, 2014 at 2:43 PM, Gerard Maas <ge...@gmail.com> wrote:

> Hi,
>
> Sorry if this has been discussed before. I'm new to the list.
>
> We are currently running our Spark + Spark Streaming jobs on Mesos,
> submitting our jobs through Marathon.
>
> We see with some regularity that the Spark Streaming driver gets killed by
> Mesos and then restarted on some other node by Marathon.
>
> I've no clue why Mesos is killing the driver and looking at both the Mesos
> and Spark logs didn't make me any wiser.
>
> On the Spark Streaming driver logs, I find this entry of Mesos "signing
> off" my driver:
>
> Shutting down
>> Sending SIGTERM to process tree at pid 17845
>> Killing the following process trees:
>> [
>> -+- 17845 sh -c sh ./run-mesos.sh application-ts.conf
>>  \-+- 17846 sh ./run-mesos.sh application-ts.conf
>>    \--- 17847 java -cp core-compute-job.jar
>> -Dconfig.file=application-ts.conf com.compute.job.FooJob 31326
>> ]
>> Command terminated with signal Terminated (pid: 17845)
>
>
> What would be the reasons for Mesos to kill an executor?
> Have anybody seen something similar? Any hints on where to start digging?
>
> -kr, Gerard.
> .
>
>
>
>
>
>