You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by besil <sb...@beintoo.com> on 2015/07/09 12:32:09 UTC

Spark Mesos task rescheduling

Hi,

We are experimenting scheduling errors due to mesos slave failing.
It seems to be an open bug, more information can be found here.

https://issues.apache.org/jira/browse/SPARK-3289

According to this  link
<https://mail-archives.apache.org/mod_mbox/mesos-user/201310.mbox/%3CCAAkWvAxPRRNRCdLAZcybnmk1_9eLyhEOdAf8urf8ssrLBAcx8g@mail.gmail.com%3E>  
from mail archive, it seems that Spark doesn't reschedule LOST tasks to
active executors, but keep trying rescheduling it on the failed host.

We would like to dynamically resize our Mesos cluster (adding or removing
machines - using an AWS autoscaling group), but this bug kills our running
applications if a Mesos slave running a Spark executor is shut down.

Is any known workaround?

Thank you



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Mesos-task-rescheduling-tp23740.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark Mesos task rescheduling

Posted by Silvio Bernardinello <sb...@beintoo.com>.

Hi,

Thank you for confirming my doubts and for the link.
Yes, we actually run in fine-grained mode because we would like to
dynamically resize our cluster as needed (thank you for the dynamic
allocation link).

However, we tried coarse grained mode and mesos seems not to relaunch the
failed task.
Maybe there is a timeout before trying to relaunch it, but I'm not aware of
it.



On Thu, Jul 9, 2015 at 5:13 PM, Iulian Dragoș <iu...@typesafe.com>
wrote:

>
>
> On Thu, Jul 9, 2015 at 12:32 PM, besil <sb...@beintoo.com> wrote:
>
>> Hi,
>>
>> We are experimenting scheduling errors due to mesos slave failing.
>> It seems to be an open bug, more information can be found here.
>>
>> https://issues.apache.org/jira/browse/SPARK-3289
>>
>> According to this  link
>> <
>> https://mail-archives.apache.org/mod_mbox/mesos-user/201310.mbox/%3CCAAkWvAxPRRNRCdLAZcybnmk1_9eLyhEOdAf8urf8ssrLBAcx8g@mail.gmail.com%3E
>> >
>> from mail archive, it seems that Spark doesn't reschedule LOST tasks to
>> active executors, but keep trying rescheduling it on the failed host.
>>
>
> Are you running in fine-grained mode? In coarse-grained mode it seems that
> Spark will notice a slave that fails repeatedly and would not accept offers
> on that slave:
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala#L188
>
>
>>
>> We would like to dynamically resize our Mesos cluster (adding or removing
>> machines - using an AWS autoscaling group), but this bug kills our running
>> applications if a Mesos slave running a Spark executor is shut down.
>>
>
> I think what you need is dynamic allocation, which should be available
> soon (PR: 4984 <https://github.com/apache/spark/pull/4984>).
>
>
>> Is any known workaround?
>>
>> Thank you
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Mesos-task-rescheduling-tp23740.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>
> --
>
> --
> Iulian Dragos
>
> ------
> Reactive Apps on the JVM
> www.typesafe.com
>
>

Re: Spark Mesos task rescheduling

Posted by Iulian Dragoș <iu...@typesafe.com>.

On Thu, Jul 9, 2015 at 12:32 PM, besil <sb...@beintoo.com> wrote:

> Hi,
>
> We are experimenting scheduling errors due to mesos slave failing.
> It seems to be an open bug, more information can be found here.
>
> https://issues.apache.org/jira/browse/SPARK-3289
>
> According to this  link
> <
> https://mail-archives.apache.org/mod_mbox/mesos-user/201310.mbox/%3CCAAkWvAxPRRNRCdLAZcybnmk1_9eLyhEOdAf8urf8ssrLBAcx8g@mail.gmail.com%3E
> >
> from mail archive, it seems that Spark doesn't reschedule LOST tasks to
> active executors, but keep trying rescheduling it on the failed host.
>

Are you running in fine-grained mode? In coarse-grained mode it seems that
Spark will notice a slave that fails repeatedly and would not accept offers
on that slave:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala#L188


>
> We would like to dynamically resize our Mesos cluster (adding or removing
> machines - using an AWS autoscaling group), but this bug kills our running
> applications if a Mesos slave running a Spark executor is shut down.
>

I think what you need is dynamic allocation, which should be available soon
(PR: 4984 <https://github.com/apache/spark/pull/4984>).


> Is any known workaround?
>
> Thank you
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Mesos-task-rescheduling-tp23740.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 

--
Iulian Dragos

------
Reactive Apps on the JVM
www.typesafe.com