You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sadhan Sood <sa...@gmail.com> on 2015/08/26 19:45:28 UTC

Spark cluster multi tenancy

Hi All,

We've set up our spark cluster on aws running on yarn (running on hadoop
2.3) with fair scheduling and preemption turned on. The cluster is shared
for prod and dev work where prod runs with a higher fair share and can
preempt dev jobs if there are not enough resources available for it.
It appears that dev jobs which get preempted often get unstable after
losing some executors and the whole jobs gets stuck (without making any
progress) or end up getting restarted (and hence losing all the work done).
Has someone encountered this before ? Is the solution just to set
spark.task.maxFailures
to a really high value to recover from task failures in such scenarios? Are
there other approaches that people have taken for spark multi tenancy that
works better in such scenario?

Thanks,
Sadhan

Re: Spark cluster multi tenancy

Posted by Jerrick Hoang <je...@gmail.com>.
Would be interested to know the answer too.

On Wed, Aug 26, 2015 at 11:45 AM, Sadhan Sood <sa...@gmail.com> wrote:

> Interestingly, if there is nothing running on dev spark-shell, it recovers
> successfully and regains the lost executors. Attaching the log for that.
> Notice, the "Registering block manager .." statements in the very end after
> all executors were lost.
>
> On Wed, Aug 26, 2015 at 11:27 AM, Sadhan Sood <sa...@gmail.com>
> wrote:
>
>> Attaching log for when the dev job gets stuck (once all its executors are
>> lost due to preemption). This is a spark-shell job running in yarn-client
>> mode.
>>
>> On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood <sa...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> We've set up our spark cluster on aws running on yarn (running on hadoop
>>> 2.3) with fair scheduling and preemption turned on. The cluster is shared
>>> for prod and dev work where prod runs with a higher fair share and can
>>> preempt dev jobs if there are not enough resources available for it.
>>> It appears that dev jobs which get preempted often get unstable after
>>> losing some executors and the whole jobs gets stuck (without making any
>>> progress) or end up getting restarted (and hence losing all the work done).
>>> Has someone encountered this before ? Is the solution just to set spark.task.maxFailures
>>> to a really high value to recover from task failures in such scenarios? Are
>>> there other approaches that people have taken for spark multi tenancy that
>>> works better in such scenario?
>>>
>>> Thanks,
>>> Sadhan
>>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

Re: Spark cluster multi tenancy

Posted by Sadhan Sood <sa...@gmail.com>.
Interestingly, if there is nothing running on dev spark-shell, it recovers
successfully and regains the lost executors. Attaching the log for that.
Notice, the "Registering block manager .." statements in the very end after
all executors were lost.

On Wed, Aug 26, 2015 at 11:27 AM, Sadhan Sood <sa...@gmail.com> wrote:

> Attaching log for when the dev job gets stuck (once all its executors are
> lost due to preemption). This is a spark-shell job running in yarn-client
> mode.
>
> On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood <sa...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> We've set up our spark cluster on aws running on yarn (running on hadoop
>> 2.3) with fair scheduling and preemption turned on. The cluster is shared
>> for prod and dev work where prod runs with a higher fair share and can
>> preempt dev jobs if there are not enough resources available for it.
>> It appears that dev jobs which get preempted often get unstable after
>> losing some executors and the whole jobs gets stuck (without making any
>> progress) or end up getting restarted (and hence losing all the work done).
>> Has someone encountered this before ? Is the solution just to set spark.task.maxFailures
>> to a really high value to recover from task failures in such scenarios? Are
>> there other approaches that people have taken for spark multi tenancy that
>> works better in such scenario?
>>
>> Thanks,
>> Sadhan
>>
>
>

Re: Spark cluster multi tenancy

Posted by Sadhan Sood <sa...@gmail.com>.
Attaching log for when the dev job gets stuck (once all its executors are
lost due to preemption). This is a spark-shell job running in yarn-client
mode.

On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood <sa...@gmail.com> wrote:

> Hi All,
>
> We've set up our spark cluster on aws running on yarn (running on hadoop
> 2.3) with fair scheduling and preemption turned on. The cluster is shared
> for prod and dev work where prod runs with a higher fair share and can
> preempt dev jobs if there are not enough resources available for it.
> It appears that dev jobs which get preempted often get unstable after
> losing some executors and the whole jobs gets stuck (without making any
> progress) or end up getting restarted (and hence losing all the work done).
> Has someone encountered this before ? Is the solution just to set spark.task.maxFailures
> to a really high value to recover from task failures in such scenarios? Are
> there other approaches that people have taken for spark multi tenancy that
> works better in such scenario?
>
> Thanks,
> Sadhan
>