You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Ashwin Shankar <as...@gmail.com> on 2014/10/22 20:47:21 UTC

Multitenancy in Spark - within/across spark context

Hi Spark devs/users,
One of the things we are investigating here at Netflix is if Spark would
suit us for our ETL needs, and one of requirements is multi tenancy.
I did read the official doc
<http://spark.apache.org/docs/latest/job-scheduling.html> and the book, but
I'm still not clear on certain things.

Here are my questions :
1. *Sharing spark context* : How exactly multiple users can share the
cluster using same spark
    context ? UserA wants to run AppA, UserB wants to run AppB. How do they
talk to same
    context ? How exactly are each of their jobs scheduled and run in same
context?
    Is preemption supported in this scenario ? How are user names passed on
to the spark context ?

2. *Different spark context in YARN*: assuming I have a YARN cluster with
queues and preemption
    configured. Are there problems if executors/containers of a spark app
are preempted to allow a
    high priority spark app to execute ? Would the preempted app get stuck
or would it continue to
    make progress? How are user names passed on from spark to yarn(say I'm
using nested user
    queues feature in fair scheduler) ?

3. Sharing RDDs in 1 and 2 above ?

4. Anything else about user/job isolation ?

I know I'm asking a lot of questions. Thanks in advance :) !

-- 
Thanks,
Ashwin
Netflix

Re: Multitenancy in Spark - within/across spark context

Posted by RJ Nowling <rn...@gmail.com>.

Ashwin,

What is your motivation for needing to share RDDs between jobs? Optimizing
for reusing data across jobs?

If so, you may want to look into Tachyon. My understanding is that Tachyon
acts like a caching layer and you can designate when data will be reused in
multiple jobs so it know to keep that in memory or local disk for faster
access. But my knowledge of tachyon is second hand so forgive me if I have
it wrong :)

RJ

On Friday, October 24, 2014, Evan Chan <ve...@gmail.com> wrote:

> Ashwin,
>
> I would say the strategies in general are:
>
> 1) Have each user submit separate Spark app (each its own Spark
> Context), with its own resource settings, and share data through HDFS
> or something like Tachyon for speed.
>
> 2) Share a single spark context amongst multiple users, using fair
> scheduler.  This is sort of like having a Hadoop resource pool.    It
> has some obvious HA/SPOF issues, namely that if the context dies then
> every user using it is also dead.   Also, sharing RDDs in cached
> memory has the same resiliency problems, namely that if any executor
> dies then Spark must recompute / rebuild the RDD (it tries to only
> rebuild the missing part, but sometimes it must rebuild everything).
>
> Job server can help with 1 or 2, 2 in particular.  If you have any
> questions about job server, feel free to ask at the spark-jobserver
> google group.   I am the maintainer.
>
> -Evan
>
>
> On Thu, Oct 23, 2014 at 1:06 PM, Marcelo Vanzin <vanzin@cloudera.com
> <javascript:;>> wrote:
> > You may want to take a look at
> https://issues.apache.org/jira/browse/SPARK-3174.
> >
> > On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang <jianshi.huang@gmail.com
> <javascript:;>> wrote:
> >> Upvote for the multitanency requirement.
> >>
> >> I'm also building a data analytic platform and there'll be multiple
> users
> >> running queries and computations simultaneously. One of the paint point
> is
> >> control of resource size. Users don't really know how much nodes they
> need,
> >> they always use as much as possible... The result is lots of wasted
> resource
> >> in our Yarn cluster.
> >>
> >> A way to 1) allow multiple spark context to share the same resource or
> 2)
> >> add dynamic resource management for Yarn mode is very much wanted.
> >>
> >> Jianshi
> >>
> >> On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin <vanzin@cloudera.com
> <javascript:;>> wrote:
> >>>
> >>> On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
> >>> <ashwinshankar77@gmail.com <javascript:;>> wrote:
> >>> >> That's not something you might want to do usually. In general, a
> >>> >> SparkContext maps to a user application
> >>> >
> >>> > My question was basically this. In this page in the official doc,
> under
> >>> > "Scheduling within an application" section, it talks about multiuser
> and
> >>> > fair sharing within an app. How does multiuser within an application
> >>> > work(how users connect to an app,run their stuff) ? When would I
> want to
> >>> > use
> >>> > this ?
> >>>
> >>> I see. The way I read that page is that Spark supports all those
> >>> scheduling options; but Spark doesn't give you the means to actually
> >>> be able to submit jobs from different users to a running SparkContext
> >>> hosted on a different process. For that, you'll need something like
> >>> the job server that I referenced before, or write your own framework
> >>> for supporting that.
> >>>
> >>> Personally, I'd use the information on that page when dealing with
> >>> concurrent jobs in the same SparkContext, but still restricted to the
> >>> same user. I'd avoid trying to create any application where a single
> >>> SparkContext is trying to be shared by multiple users in any way.
> >>>
> >>> >> As far as I understand, this will cause executors to be killed,
> which
> >>> >> means that Spark will start retrying tasks to rebuild the data that
> >>> >> was held by those executors when needed.
> >>> >
> >>> > I basically wanted to find out if there were any "gotchas" related to
> >>> > preemption on Spark. Things like say half of an application's
> executors
> >>> > got
> >>> > preempted say while doing reduceByKey, will the application progress
> >>> > with
> >>> > the remaining resources/fair share ?
> >>>
> >>> Jobs should still make progress as long as at least one executor is
> >>> available. The gotcha would be the one I mentioned, where Spark will
> >>> fail your job after "x" executors failed, which might be a common
> >>> occurrence when preemption is enabled. That being said, it's a
> >>> configurable option, so you can set "x" to a very large value and your
> >>> job should keep on chugging along.
> >>>
> >>> The options you'd want to take a look at are: spark.task.maxFailures
> >>> and spark.yarn.max.executor.failures
> >>>
> >>> --
> >>> Marcelo
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> <javascript:;>
> >>> For additional commands, e-mail: user-help@spark.apache.org
> <javascript:;>
> >>>
> >>
> >>
> >>
> >> --
> >> Jianshi Huang
> >>
> >> LinkedIn: jianshi
> >> Twitter: @jshuang
> >> Github & Blog: http://huangjs.github.com/
> >
> >
> >
> > --
> > Marcelo
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <javascript:;>
> > For additional commands, e-mail: dev-help@spark.apache.org
> <javascript:;>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-help@spark.apache.org <javascript:;>
>
>

-- 
em rnowling@gmail.com
c 954.496.2314

Re: Multitenancy in Spark - within/across spark context

Posted by RJ Nowling <rn...@gmail.com>.

Ashwin,

What is your motivation for needing to share RDDs between jobs? Optimizing
for reusing data across jobs?

If so, you may want to look into Tachyon. My understanding is that Tachyon
acts like a caching layer and you can designate when data will be reused in
multiple jobs so it know to keep that in memory or local disk for faster
access. But my knowledge of tachyon is second hand so forgive me if I have
it wrong :)

RJ

On Friday, October 24, 2014, Evan Chan <ve...@gmail.com> wrote:

> Ashwin,
>
> I would say the strategies in general are:
>
> 1) Have each user submit separate Spark app (each its own Spark
> Context), with its own resource settings, and share data through HDFS
> or something like Tachyon for speed.
>
> 2) Share a single spark context amongst multiple users, using fair
> scheduler.  This is sort of like having a Hadoop resource pool.    It
> has some obvious HA/SPOF issues, namely that if the context dies then
> every user using it is also dead.   Also, sharing RDDs in cached
> memory has the same resiliency problems, namely that if any executor
> dies then Spark must recompute / rebuild the RDD (it tries to only
> rebuild the missing part, but sometimes it must rebuild everything).
>
> Job server can help with 1 or 2, 2 in particular.  If you have any
> questions about job server, feel free to ask at the spark-jobserver
> google group.   I am the maintainer.
>
> -Evan
>
>
> On Thu, Oct 23, 2014 at 1:06 PM, Marcelo Vanzin <vanzin@cloudera.com
> <javascript:;>> wrote:
> > You may want to take a look at
> https://issues.apache.org/jira/browse/SPARK-3174.
> >
> > On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang <jianshi.huang@gmail.com
> <javascript:;>> wrote:
> >> Upvote for the multitanency requirement.
> >>
> >> I'm also building a data analytic platform and there'll be multiple
> users
> >> running queries and computations simultaneously. One of the paint point
> is
> >> control of resource size. Users don't really know how much nodes they
> need,
> >> they always use as much as possible... The result is lots of wasted
> resource
> >> in our Yarn cluster.
> >>
> >> A way to 1) allow multiple spark context to share the same resource or
> 2)
> >> add dynamic resource management for Yarn mode is very much wanted.
> >>
> >> Jianshi
> >>
> >> On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin <vanzin@cloudera.com
> <javascript:;>> wrote:
> >>>
> >>> On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
> >>> <ashwinshankar77@gmail.com <javascript:;>> wrote:
> >>> >> That's not something you might want to do usually. In general, a
> >>> >> SparkContext maps to a user application
> >>> >
> >>> > My question was basically this. In this page in the official doc,
> under
> >>> > "Scheduling within an application" section, it talks about multiuser
> and
> >>> > fair sharing within an app. How does multiuser within an application
> >>> > work(how users connect to an app,run their stuff) ? When would I
> want to
> >>> > use
> >>> > this ?
> >>>
> >>> I see. The way I read that page is that Spark supports all those
> >>> scheduling options; but Spark doesn't give you the means to actually
> >>> be able to submit jobs from different users to a running SparkContext
> >>> hosted on a different process. For that, you'll need something like
> >>> the job server that I referenced before, or write your own framework
> >>> for supporting that.
> >>>
> >>> Personally, I'd use the information on that page when dealing with
> >>> concurrent jobs in the same SparkContext, but still restricted to the
> >>> same user. I'd avoid trying to create any application where a single
> >>> SparkContext is trying to be shared by multiple users in any way.
> >>>
> >>> >> As far as I understand, this will cause executors to be killed,
> which
> >>> >> means that Spark will start retrying tasks to rebuild the data that
> >>> >> was held by those executors when needed.
> >>> >
> >>> > I basically wanted to find out if there were any "gotchas" related to
> >>> > preemption on Spark. Things like say half of an application's
> executors
> >>> > got
> >>> > preempted say while doing reduceByKey, will the application progress
> >>> > with
> >>> > the remaining resources/fair share ?
> >>>
> >>> Jobs should still make progress as long as at least one executor is
> >>> available. The gotcha would be the one I mentioned, where Spark will
> >>> fail your job after "x" executors failed, which might be a common
> >>> occurrence when preemption is enabled. That being said, it's a
> >>> configurable option, so you can set "x" to a very large value and your
> >>> job should keep on chugging along.
> >>>
> >>> The options you'd want to take a look at are: spark.task.maxFailures
> >>> and spark.yarn.max.executor.failures
> >>>
> >>> --
> >>> Marcelo
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> <javascript:;>
> >>> For additional commands, e-mail: user-help@spark.apache.org
> <javascript:;>
> >>>
> >>
> >>
> >>
> >> --
> >> Jianshi Huang
> >>
> >> LinkedIn: jianshi
> >> Twitter: @jshuang
> >> Github & Blog: http://huangjs.github.com/
> >
> >
> >
> > --
> > Marcelo
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <javascript:;>
> > For additional commands, e-mail: dev-help@spark.apache.org
> <javascript:;>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-help@spark.apache.org <javascript:;>
>
>

-- 
em rnowling@gmail.com
c 954.496.2314

Re: Multitenancy in Spark - within/across spark context

Posted by Evan Chan <ve...@gmail.com>.

Ashwin,

I would say the strategies in general are:

1) Have each user submit separate Spark app (each its own Spark
Context), with its own resource settings, and share data through HDFS
or something like Tachyon for speed.

2) Share a single spark context amongst multiple users, using fair
scheduler.  This is sort of like having a Hadoop resource pool.    It
has some obvious HA/SPOF issues, namely that if the context dies then
every user using it is also dead.   Also, sharing RDDs in cached
memory has the same resiliency problems, namely that if any executor
dies then Spark must recompute / rebuild the RDD (it tries to only
rebuild the missing part, but sometimes it must rebuild everything).

Job server can help with 1 or 2, 2 in particular.  If you have any
questions about job server, feel free to ask at the spark-jobserver
google group.   I am the maintainer.

-Evan


On Thu, Oct 23, 2014 at 1:06 PM, Marcelo Vanzin <va...@cloudera.com> wrote:
> You may want to take a look at https://issues.apache.org/jira/browse/SPARK-3174.
>
> On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang <ji...@gmail.com> wrote:
>> Upvote for the multitanency requirement.
>>
>> I'm also building a data analytic platform and there'll be multiple users
>> running queries and computations simultaneously. One of the paint point is
>> control of resource size. Users don't really know how much nodes they need,
>> they always use as much as possible... The result is lots of wasted resource
>> in our Yarn cluster.
>>
>> A way to 1) allow multiple spark context to share the same resource or 2)
>> add dynamic resource management for Yarn mode is very much wanted.
>>
>> Jianshi
>>
>> On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin <va...@cloudera.com> wrote:
>>>
>>> On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
>>> <as...@gmail.com> wrote:
>>> >> That's not something you might want to do usually. In general, a
>>> >> SparkContext maps to a user application
>>> >
>>> > My question was basically this. In this page in the official doc, under
>>> > "Scheduling within an application" section, it talks about multiuser and
>>> > fair sharing within an app. How does multiuser within an application
>>> > work(how users connect to an app,run their stuff) ? When would I want to
>>> > use
>>> > this ?
>>>
>>> I see. The way I read that page is that Spark supports all those
>>> scheduling options; but Spark doesn't give you the means to actually
>>> be able to submit jobs from different users to a running SparkContext
>>> hosted on a different process. For that, you'll need something like
>>> the job server that I referenced before, or write your own framework
>>> for supporting that.
>>>
>>> Personally, I'd use the information on that page when dealing with
>>> concurrent jobs in the same SparkContext, but still restricted to the
>>> same user. I'd avoid trying to create any application where a single
>>> SparkContext is trying to be shared by multiple users in any way.
>>>
>>> >> As far as I understand, this will cause executors to be killed, which
>>> >> means that Spark will start retrying tasks to rebuild the data that
>>> >> was held by those executors when needed.
>>> >
>>> > I basically wanted to find out if there were any "gotchas" related to
>>> > preemption on Spark. Things like say half of an application's executors
>>> > got
>>> > preempted say while doing reduceByKey, will the application progress
>>> > with
>>> > the remaining resources/fair share ?
>>>
>>> Jobs should still make progress as long as at least one executor is
>>> available. The gotcha would be the one I mentioned, where Spark will
>>> fail your job after "x" executors failed, which might be a common
>>> occurrence when preemption is enabled. That being said, it's a
>>> configurable option, so you can set "x" to a very large value and your
>>> job should keep on chugging along.
>>>
>>> The options you'd want to take a look at are: spark.task.maxFailures
>>> and spark.yarn.max.executor.failures
>>>
>>> --
>>> Marcelo
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Multitenancy in Spark - within/across spark context

Posted by Evan Chan <ve...@gmail.com>.

Ashwin,

I would say the strategies in general are:

1) Have each user submit separate Spark app (each its own Spark
Context), with its own resource settings, and share data through HDFS
or something like Tachyon for speed.

2) Share a single spark context amongst multiple users, using fair
scheduler.  This is sort of like having a Hadoop resource pool.    It
has some obvious HA/SPOF issues, namely that if the context dies then
every user using it is also dead.   Also, sharing RDDs in cached
memory has the same resiliency problems, namely that if any executor
dies then Spark must recompute / rebuild the RDD (it tries to only
rebuild the missing part, but sometimes it must rebuild everything).

Job server can help with 1 or 2, 2 in particular.  If you have any
questions about job server, feel free to ask at the spark-jobserver
google group.   I am the maintainer.

-Evan


On Thu, Oct 23, 2014 at 1:06 PM, Marcelo Vanzin <va...@cloudera.com> wrote:
> You may want to take a look at https://issues.apache.org/jira/browse/SPARK-3174.
>
> On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang <ji...@gmail.com> wrote:
>> Upvote for the multitanency requirement.
>>
>> I'm also building a data analytic platform and there'll be multiple users
>> running queries and computations simultaneously. One of the paint point is
>> control of resource size. Users don't really know how much nodes they need,
>> they always use as much as possible... The result is lots of wasted resource
>> in our Yarn cluster.
>>
>> A way to 1) allow multiple spark context to share the same resource or 2)
>> add dynamic resource management for Yarn mode is very much wanted.
>>
>> Jianshi
>>
>> On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin <va...@cloudera.com> wrote:
>>>
>>> On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
>>> <as...@gmail.com> wrote:
>>> >> That's not something you might want to do usually. In general, a
>>> >> SparkContext maps to a user application
>>> >
>>> > My question was basically this. In this page in the official doc, under
>>> > "Scheduling within an application" section, it talks about multiuser and
>>> > fair sharing within an app. How does multiuser within an application
>>> > work(how users connect to an app,run their stuff) ? When would I want to
>>> > use
>>> > this ?
>>>
>>> I see. The way I read that page is that Spark supports all those
>>> scheduling options; but Spark doesn't give you the means to actually
>>> be able to submit jobs from different users to a running SparkContext
>>> hosted on a different process. For that, you'll need something like
>>> the job server that I referenced before, or write your own framework
>>> for supporting that.
>>>
>>> Personally, I'd use the information on that page when dealing with
>>> concurrent jobs in the same SparkContext, but still restricted to the
>>> same user. I'd avoid trying to create any application where a single
>>> SparkContext is trying to be shared by multiple users in any way.
>>>
>>> >> As far as I understand, this will cause executors to be killed, which
>>> >> means that Spark will start retrying tasks to rebuild the data that
>>> >> was held by those executors when needed.
>>> >
>>> > I basically wanted to find out if there were any "gotchas" related to
>>> > preemption on Spark. Things like say half of an application's executors
>>> > got
>>> > preempted say while doing reduceByKey, will the application progress
>>> > with
>>> > the remaining resources/fair share ?
>>>
>>> Jobs should still make progress as long as at least one executor is
>>> available. The gotcha would be the one I mentioned, where Spark will
>>> fail your job after "x" executors failed, which might be a common
>>> occurrence when preemption is enabled. That being said, it's a
>>> configurable option, so you can set "x" to a very large value and your
>>> job should keep on chugging along.
>>>
>>> The options you'd want to take a look at are: spark.task.maxFailures
>>> and spark.yarn.max.executor.failures
>>>
>>> --
>>> Marcelo
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>
>
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Multitenancy in Spark - within/across spark context

Posted by Marcelo Vanzin <va...@cloudera.com>.

You may want to take a look at https://issues.apache.org/jira/browse/SPARK-3174.

On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang <ji...@gmail.com> wrote:
> Upvote for the multitanency requirement.
>
> I'm also building a data analytic platform and there'll be multiple users
> running queries and computations simultaneously. One of the paint point is
> control of resource size. Users don't really know how much nodes they need,
> they always use as much as possible... The result is lots of wasted resource
> in our Yarn cluster.
>
> A way to 1) allow multiple spark context to share the same resource or 2)
> add dynamic resource management for Yarn mode is very much wanted.
>
> Jianshi
>
> On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin <va...@cloudera.com> wrote:
>>
>> On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
>> <as...@gmail.com> wrote:
>> >> That's not something you might want to do usually. In general, a
>> >> SparkContext maps to a user application
>> >
>> > My question was basically this. In this page in the official doc, under
>> > "Scheduling within an application" section, it talks about multiuser and
>> > fair sharing within an app. How does multiuser within an application
>> > work(how users connect to an app,run their stuff) ? When would I want to
>> > use
>> > this ?
>>
>> I see. The way I read that page is that Spark supports all those
>> scheduling options; but Spark doesn't give you the means to actually
>> be able to submit jobs from different users to a running SparkContext
>> hosted on a different process. For that, you'll need something like
>> the job server that I referenced before, or write your own framework
>> for supporting that.
>>
>> Personally, I'd use the information on that page when dealing with
>> concurrent jobs in the same SparkContext, but still restricted to the
>> same user. I'd avoid trying to create any application where a single
>> SparkContext is trying to be shared by multiple users in any way.
>>
>> >> As far as I understand, this will cause executors to be killed, which
>> >> means that Spark will start retrying tasks to rebuild the data that
>> >> was held by those executors when needed.
>> >
>> > I basically wanted to find out if there were any "gotchas" related to
>> > preemption on Spark. Things like say half of an application's executors
>> > got
>> > preempted say while doing reduceByKey, will the application progress
>> > with
>> > the remaining resources/fair share ?
>>
>> Jobs should still make progress as long as at least one executor is
>> available. The gotcha would be the one I mentioned, where Spark will
>> fail your job after "x" executors failed, which might be a common
>> occurrence when preemption is enabled. That being said, it's a
>> configurable option, so you can set "x" to a very large value and your
>> job should keep on chugging along.
>>
>> The options you'd want to take a look at are: spark.task.maxFailures
>> and spark.yarn.max.executor.failures
>>
>> --
>> Marcelo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Multitenancy in Spark - within/across spark context

Posted by Marcelo Vanzin <va...@cloudera.com>.

You may want to take a look at https://issues.apache.org/jira/browse/SPARK-3174.

On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang <ji...@gmail.com> wrote:
> Upvote for the multitanency requirement.
>
> I'm also building a data analytic platform and there'll be multiple users
> running queries and computations simultaneously. One of the paint point is
> control of resource size. Users don't really know how much nodes they need,
> they always use as much as possible... The result is lots of wasted resource
> in our Yarn cluster.
>
> A way to 1) allow multiple spark context to share the same resource or 2)
> add dynamic resource management for Yarn mode is very much wanted.
>
> Jianshi
>
> On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin <va...@cloudera.com> wrote:
>>
>> On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
>> <as...@gmail.com> wrote:
>> >> That's not something you might want to do usually. In general, a
>> >> SparkContext maps to a user application
>> >
>> > My question was basically this. In this page in the official doc, under
>> > "Scheduling within an application" section, it talks about multiuser and
>> > fair sharing within an app. How does multiuser within an application
>> > work(how users connect to an app,run their stuff) ? When would I want to
>> > use
>> > this ?
>>
>> I see. The way I read that page is that Spark supports all those
>> scheduling options; but Spark doesn't give you the means to actually
>> be able to submit jobs from different users to a running SparkContext
>> hosted on a different process. For that, you'll need something like
>> the job server that I referenced before, or write your own framework
>> for supporting that.
>>
>> Personally, I'd use the information on that page when dealing with
>> concurrent jobs in the same SparkContext, but still restricted to the
>> same user. I'd avoid trying to create any application where a single
>> SparkContext is trying to be shared by multiple users in any way.
>>
>> >> As far as I understand, this will cause executors to be killed, which
>> >> means that Spark will start retrying tasks to rebuild the data that
>> >> was held by those executors when needed.
>> >
>> > I basically wanted to find out if there were any "gotchas" related to
>> > preemption on Spark. Things like say half of an application's executors
>> > got
>> > preempted say while doing reduceByKey, will the application progress
>> > with
>> > the remaining resources/fair share ?
>>
>> Jobs should still make progress as long as at least one executor is
>> available. The gotcha would be the one I mentioned, where Spark will
>> fail your job after "x" executors failed, which might be a common
>> occurrence when preemption is enabled. That being said, it's a
>> configurable option, so you can set "x" to a very large value and your
>> job should keep on chugging along.
>>
>> The options you'd want to take a look at are: spark.task.maxFailures
>> and spark.yarn.max.executor.failures
>>
>> --
>> Marcelo
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Multitenancy in Spark - within/across spark context

Posted by Jianshi Huang <ji...@gmail.com>.

Upvote for the multitanency requirement.

I'm also building a data analytic platform and there'll be multiple users
running queries and computations simultaneously. One of the paint point is
control of resource size. Users don't really know how much nodes they need,
they always use as much as possible... The result is lots of wasted
resource in our Yarn cluster.

A way to 1) allow multiple spark context to share the same resource or 2)
add dynamic resource management for Yarn mode is very much wanted.

Jianshi

On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin <va...@cloudera.com> wrote:

> On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
> <as...@gmail.com> wrote:
> >> That's not something you might want to do usually. In general, a
> >> SparkContext maps to a user application
> >
> > My question was basically this. In this page in the official doc, under
> > "Scheduling within an application" section, it talks about multiuser and
> > fair sharing within an app. How does multiuser within an application
> > work(how users connect to an app,run their stuff) ? When would I want to
> use
> > this ?
>
> I see. The way I read that page is that Spark supports all those
> scheduling options; but Spark doesn't give you the means to actually
> be able to submit jobs from different users to a running SparkContext
> hosted on a different process. For that, you'll need something like
> the job server that I referenced before, or write your own framework
> for supporting that.
>
> Personally, I'd use the information on that page when dealing with
> concurrent jobs in the same SparkContext, but still restricted to the
> same user. I'd avoid trying to create any application where a single
> SparkContext is trying to be shared by multiple users in any way.
>
> >> As far as I understand, this will cause executors to be killed, which
> >> means that Spark will start retrying tasks to rebuild the data that
> >> was held by those executors when needed.
> >
> > I basically wanted to find out if there were any "gotchas" related to
> > preemption on Spark. Things like say half of an application's executors
> got
> > preempted say while doing reduceByKey, will the application progress with
> > the remaining resources/fair share ?
>
> Jobs should still make progress as long as at least one executor is
> available. The gotcha would be the one I mentioned, where Spark will
> fail your job after "x" executors failed, which might be a common
> occurrence when preemption is enabled. That being said, it's a
> configurable option, so you can set "x" to a very large value and your
> job should keep on chugging along.
>
> The options you'd want to take a look at are: spark.task.maxFailures
> and spark.yarn.max.executor.failures
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Multitenancy in Spark - within/across spark context

Posted by Jianshi Huang <ji...@gmail.com>.

Upvote for the multitanency requirement.

I'm also building a data analytic platform and there'll be multiple users
running queries and computations simultaneously. One of the paint point is
control of resource size. Users don't really know how much nodes they need,
they always use as much as possible... The result is lots of wasted
resource in our Yarn cluster.

A way to 1) allow multiple spark context to share the same resource or 2)
add dynamic resource management for Yarn mode is very much wanted.

Jianshi

On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin <va...@cloudera.com> wrote:

> On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
> <as...@gmail.com> wrote:
> >> That's not something you might want to do usually. In general, a
> >> SparkContext maps to a user application
> >
> > My question was basically this. In this page in the official doc, under
> > "Scheduling within an application" section, it talks about multiuser and
> > fair sharing within an app. How does multiuser within an application
> > work(how users connect to an app,run their stuff) ? When would I want to
> use
> > this ?
>
> I see. The way I read that page is that Spark supports all those
> scheduling options; but Spark doesn't give you the means to actually
> be able to submit jobs from different users to a running SparkContext
> hosted on a different process. For that, you'll need something like
> the job server that I referenced before, or write your own framework
> for supporting that.
>
> Personally, I'd use the information on that page when dealing with
> concurrent jobs in the same SparkContext, but still restricted to the
> same user. I'd avoid trying to create any application where a single
> SparkContext is trying to be shared by multiple users in any way.
>
> >> As far as I understand, this will cause executors to be killed, which
> >> means that Spark will start retrying tasks to rebuild the data that
> >> was held by those executors when needed.
> >
> > I basically wanted to find out if there were any "gotchas" related to
> > preemption on Spark. Things like say half of an application's executors
> got
> > preempted say while doing reduceByKey, will the application progress with
> > the remaining resources/fair share ?
>
> Jobs should still make progress as long as at least one executor is
> available. The gotcha would be the one I mentioned, where Spark will
> fail your job after "x" executors failed, which might be a common
> occurrence when preemption is enabled. That being said, it's a
> configurable option, so you can set "x" to a very large value and your
> job should keep on chugging along.
>
> The options you'd want to take a look at are: spark.task.maxFailures
> and spark.yarn.max.executor.failures
>
> --
> Marcelo
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: Multitenancy in Spark - within/across spark context

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
<as...@gmail.com> wrote:
>> That's not something you might want to do usually. In general, a
>> SparkContext maps to a user application
>
> My question was basically this. In this page in the official doc, under
> "Scheduling within an application" section, it talks about multiuser and
> fair sharing within an app. How does multiuser within an application
> work(how users connect to an app,run their stuff) ? When would I want to use
> this ?

I see. The way I read that page is that Spark supports all those
scheduling options; but Spark doesn't give you the means to actually
be able to submit jobs from different users to a running SparkContext
hosted on a different process. For that, you'll need something like
the job server that I referenced before, or write your own framework
for supporting that.

Personally, I'd use the information on that page when dealing with
concurrent jobs in the same SparkContext, but still restricted to the
same user. I'd avoid trying to create any application where a single
SparkContext is trying to be shared by multiple users in any way.

>> As far as I understand, this will cause executors to be killed, which
>> means that Spark will start retrying tasks to rebuild the data that
>> was held by those executors when needed.
>
> I basically wanted to find out if there were any "gotchas" related to
> preemption on Spark. Things like say half of an application's executors got
> preempted say while doing reduceByKey, will the application progress with
> the remaining resources/fair share ?

Jobs should still make progress as long as at least one executor is
available. The gotcha would be the one I mentioned, where Spark will
fail your job after "x" executors failed, which might be a common
occurrence when preemption is enabled. That being said, it's a
configurable option, so you can set "x" to a very large value and your
job should keep on chugging along.

The options you'd want to take a look at are: spark.task.maxFailures
and spark.yarn.max.executor.failures

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Multitenancy in Spark - within/across spark context

Posted by Marcelo Vanzin <va...@cloudera.com>.

On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
<as...@gmail.com> wrote:
>> That's not something you might want to do usually. In general, a
>> SparkContext maps to a user application
>
> My question was basically this. In this page in the official doc, under
> "Scheduling within an application" section, it talks about multiuser and
> fair sharing within an app. How does multiuser within an application
> work(how users connect to an app,run their stuff) ? When would I want to use
> this ?

I see. The way I read that page is that Spark supports all those
scheduling options; but Spark doesn't give you the means to actually
be able to submit jobs from different users to a running SparkContext
hosted on a different process. For that, you'll need something like
the job server that I referenced before, or write your own framework
for supporting that.

Personally, I'd use the information on that page when dealing with
concurrent jobs in the same SparkContext, but still restricted to the
same user. I'd avoid trying to create any application where a single
SparkContext is trying to be shared by multiple users in any way.

>> As far as I understand, this will cause executors to be killed, which
>> means that Spark will start retrying tasks to rebuild the data that
>> was held by those executors when needed.
>
> I basically wanted to find out if there were any "gotchas" related to
> preemption on Spark. Things like say half of an application's executors got
> preempted say while doing reduceByKey, will the application progress with
> the remaining resources/fair share ?

Jobs should still make progress as long as at least one executor is
available. The gotcha would be the one I mentioned, where Spark will
fail your job after "x" executors failed, which might be a common
occurrence when preemption is enabled. That being said, it's a
configurable option, so you can set "x" to a very large value and your
job should keep on chugging along.

The options you'd want to take a look at are: spark.task.maxFailures
and spark.yarn.max.executor.failures

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Multitenancy in Spark - within/across spark context

Posted by Ashwin Shankar <as...@gmail.com>.

Thanks Marcelo, that was helpful ! I had some follow up questions :

That's not something you might want to do usually. In general, a
> SparkContext maps to a user application

My question was basically this. In this
<http://spark.apache.org/docs/latest/job-scheduling.html> page in the
official doc, under  "Scheduling within an application" section, it talks
about multiuser and fair sharing within an app. How does multiuser within
an application work(how users connect to an app,run their stuff) ? When
would I want to use this ?

As far as I understand, this will cause executors to be killed, which
> means that Spark will start retrying tasks to rebuild the data that
> was held by those executors when needed.

I basically wanted to find out if there were any "gotchas" related to
preemption on Spark. Things like say half of an application's executors got
preempted say while doing reduceByKey, will the application progress with
the remaining resources/fair share ?

I'm new to spark, sry if I'm asking something very obvious :).

Thanks,
Ashwin

On Wed, Oct 22, 2014 at 12:07 PM, Marcelo Vanzin <va...@cloudera.com>
wrote:

> Hi Ashwin,
>
> Let me try to answer to the best of my knowledge.
>
> On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar
> <as...@gmail.com> wrote:
> > Here are my questions :
> > 1. Sharing spark context : How exactly multiple users can share the
> cluster
> > using same spark
> >     context ?
>
> That's not something you might want to do usually. In general, a
> SparkContext maps to a user application, so each user would submit
> their own job which would create its own SparkContext.
>
> If you want to go outside of Spark, there are project which allow you
> to manage SparkContext instances outside of applications and
> potentially share them, such as
> https://github.com/spark-jobserver/spark-jobserver. But be sure you
> actually need it - since you haven't really explained the use case,
> it's not very clear.
>
> > 2. Different spark context in YARN: assuming I have a YARN cluster with
> > queues and preemption
> >     configured. Are there problems if executors/containers of a spark app
> > are preempted to allow a
> >     high priority spark app to execute ?
>
> As far as I understand, this will cause executors to be killed, which
> means that Spark will start retrying tasks to rebuild the data that
> was held by those executors when needed. Yarn mode does have a
> configurable upper limit on the number of executor failures, so if
> your jobs keeps getting preempted it will eventually fail (unless you
> tweak the settings).
>
> I don't recall whether Yarn has an API to cleanly allow clients to
> stop executors when preempted, but even if it does, I don't think
> that's supported in Spark at the moment.
>
> > How are user names passed on from spark to yarn(say I'm
> > using nested user queues feature in fair scheduler) ?
>
> Spark will try to run the job as the requesting user; if you're not
> using Kerberos, that means the process themselves will be run as
> whatever user runs the Yarn daemons, but the Spark app will be run
> inside a "UserGroupInformation.doAs()" call as the requesting user. So
> technically nested queues should work as expected.
>
> > 3. Sharing RDDs in 1 and 2 above ?
>
> I'll assume you don't mean actually sharing RDDs in the same context,
> but between different SparkContext instances. You might (big might
> here) be able to checkpoint an RDD from one context and load it from
> another context; that's actually like some HA-like features for Spark
> drivers are being addressed.
>
> The job server I mentioned before, which allows different apps to
> share the same Spark context, has a feature to share RDDs by name,
> also, without having to resort to checkpointing.
>
> Hope this helps!
>
> --
> Marcelo
>



-- 
Thanks,
Ashwin

Re: Multitenancy in Spark - within/across spark context

Posted by Ashwin Shankar <as...@gmail.com>.

Thanks Marcelo, that was helpful ! I had some follow up questions :

That's not something you might want to do usually. In general, a
> SparkContext maps to a user application

My question was basically this. In this
<http://spark.apache.org/docs/latest/job-scheduling.html> page in the
official doc, under  "Scheduling within an application" section, it talks
about multiuser and fair sharing within an app. How does multiuser within
an application work(how users connect to an app,run their stuff) ? When
would I want to use this ?

As far as I understand, this will cause executors to be killed, which
> means that Spark will start retrying tasks to rebuild the data that
> was held by those executors when needed.

I basically wanted to find out if there were any "gotchas" related to
preemption on Spark. Things like say half of an application's executors got
preempted say while doing reduceByKey, will the application progress with
the remaining resources/fair share ?

I'm new to spark, sry if I'm asking something very obvious :).

Thanks,
Ashwin

On Wed, Oct 22, 2014 at 12:07 PM, Marcelo Vanzin <va...@cloudera.com>
wrote:

> Hi Ashwin,
>
> Let me try to answer to the best of my knowledge.
>
> On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar
> <as...@gmail.com> wrote:
> > Here are my questions :
> > 1. Sharing spark context : How exactly multiple users can share the
> cluster
> > using same spark
> >     context ?
>
> That's not something you might want to do usually. In general, a
> SparkContext maps to a user application, so each user would submit
> their own job which would create its own SparkContext.
>
> If you want to go outside of Spark, there are project which allow you
> to manage SparkContext instances outside of applications and
> potentially share them, such as
> https://github.com/spark-jobserver/spark-jobserver. But be sure you
> actually need it - since you haven't really explained the use case,
> it's not very clear.
>
> > 2. Different spark context in YARN: assuming I have a YARN cluster with
> > queues and preemption
> >     configured. Are there problems if executors/containers of a spark app
> > are preempted to allow a
> >     high priority spark app to execute ?
>
> As far as I understand, this will cause executors to be killed, which
> means that Spark will start retrying tasks to rebuild the data that
> was held by those executors when needed. Yarn mode does have a
> configurable upper limit on the number of executor failures, so if
> your jobs keeps getting preempted it will eventually fail (unless you
> tweak the settings).
>
> I don't recall whether Yarn has an API to cleanly allow clients to
> stop executors when preempted, but even if it does, I don't think
> that's supported in Spark at the moment.
>
> > How are user names passed on from spark to yarn(say I'm
> > using nested user queues feature in fair scheduler) ?
>
> Spark will try to run the job as the requesting user; if you're not
> using Kerberos, that means the process themselves will be run as
> whatever user runs the Yarn daemons, but the Spark app will be run
> inside a "UserGroupInformation.doAs()" call as the requesting user. So
> technically nested queues should work as expected.
>
> > 3. Sharing RDDs in 1 and 2 above ?
>
> I'll assume you don't mean actually sharing RDDs in the same context,
> but between different SparkContext instances. You might (big might
> here) be able to checkpoint an RDD from one context and load it from
> another context; that's actually like some HA-like features for Spark
> drivers are being addressed.
>
> The job server I mentioned before, which allows different apps to
> share the same Spark context, has a feature to share RDDs by name,
> also, without having to resort to checkpointing.
>
> Hope this helps!
>
> --
> Marcelo
>



-- 
Thanks,
Ashwin

Re: Multitenancy in Spark - within/across spark context

Posted by Marcelo Vanzin <va...@cloudera.com>.

Hi Ashwin,

Let me try to answer to the best of my knowledge.

On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar
<as...@gmail.com> wrote:
> Here are my questions :
> 1. Sharing spark context : How exactly multiple users can share the cluster
> using same spark
>     context ?

That's not something you might want to do usually. In general, a
SparkContext maps to a user application, so each user would submit
their own job which would create its own SparkContext.

If you want to go outside of Spark, there are project which allow you
to manage SparkContext instances outside of applications and
potentially share them, such as
https://github.com/spark-jobserver/spark-jobserver. But be sure you
actually need it - since you haven't really explained the use case,
it's not very clear.

> 2. Different spark context in YARN: assuming I have a YARN cluster with
> queues and preemption
>     configured. Are there problems if executors/containers of a spark app
> are preempted to allow a
>     high priority spark app to execute ?

As far as I understand, this will cause executors to be killed, which
means that Spark will start retrying tasks to rebuild the data that
was held by those executors when needed. Yarn mode does have a
configurable upper limit on the number of executor failures, so if
your jobs keeps getting preempted it will eventually fail (unless you
tweak the settings).

I don't recall whether Yarn has an API to cleanly allow clients to
stop executors when preempted, but even if it does, I don't think
that's supported in Spark at the moment.

> How are user names passed on from spark to yarn(say I'm
> using nested user queues feature in fair scheduler) ?

Spark will try to run the job as the requesting user; if you're not
using Kerberos, that means the process themselves will be run as
whatever user runs the Yarn daemons, but the Spark app will be run
inside a "UserGroupInformation.doAs()" call as the requesting user. So
technically nested queues should work as expected.

> 3. Sharing RDDs in 1 and 2 above ?

I'll assume you don't mean actually sharing RDDs in the same context,
but between different SparkContext instances. You might (big might
here) be able to checkpoint an RDD from one context and load it from
another context; that's actually like some HA-like features for Spark
drivers are being addressed.

The job server I mentioned before, which allows different apps to
share the same Spark context, has a feature to share RDDs by name,
also, without having to resort to checkpointing.

Hope this helps!

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Multitenancy in Spark - within/across spark context

Posted by Marcelo Vanzin <va...@cloudera.com>.

Hi Ashwin,

Let me try to answer to the best of my knowledge.

On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar
<as...@gmail.com> wrote:
> Here are my questions :
> 1. Sharing spark context : How exactly multiple users can share the cluster
> using same spark
>     context ?

That's not something you might want to do usually. In general, a
SparkContext maps to a user application, so each user would submit
their own job which would create its own SparkContext.

If you want to go outside of Spark, there are project which allow you
to manage SparkContext instances outside of applications and
potentially share them, such as
https://github.com/spark-jobserver/spark-jobserver. But be sure you
actually need it - since you haven't really explained the use case,
it's not very clear.

> 2. Different spark context in YARN: assuming I have a YARN cluster with
> queues and preemption
>     configured. Are there problems if executors/containers of a spark app
> are preempted to allow a
>     high priority spark app to execute ?

As far as I understand, this will cause executors to be killed, which
means that Spark will start retrying tasks to rebuild the data that
was held by those executors when needed. Yarn mode does have a
configurable upper limit on the number of executor failures, so if
your jobs keeps getting preempted it will eventually fail (unless you
tweak the settings).

I don't recall whether Yarn has an API to cleanly allow clients to
stop executors when preempted, but even if it does, I don't think
that's supported in Spark at the moment.

> How are user names passed on from spark to yarn(say I'm
> using nested user queues feature in fair scheduler) ?

Spark will try to run the job as the requesting user; if you're not
using Kerberos, that means the process themselves will be run as
whatever user runs the Yarn daemons, but the Spark app will be run
inside a "UserGroupInformation.doAs()" call as the requesting user. So
technically nested queues should work as expected.

> 3. Sharing RDDs in 1 and 2 above ?

I'll assume you don't mean actually sharing RDDs in the same context,
but between different SparkContext instances. You might (big might
here) be able to checkpoint an RDD from one context and load it from
another context; that's actually like some HA-like features for Spark
drivers are being addressed.

The job server I mentioned before, which allows different apps to
share the same Spark context, has a feature to share RDDs by name,
also, without having to resort to checkpointing.

Hope this helps!

-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org